To use RDD API you need spark context

In [4]:
import findspark
findspark.init()

import pyspark
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

from lib.logger import Log4j
from lib.utils import get_spark_app_config,load_survey_df,count_by_country

In [2]:
conf = SparkConf() \
        .setMaster("local[3]") \
        .setAppName("HelloRDD")
#conf = get_spark_app_config()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext
logger = Log4j(spark)
logger.info("Starting HelloSpark")

Create RDD

In [5]:
linesrdd = sc.textFile("data/samplerdd.csv")
linesrdd.collect()

['2014-08-27 11:29:31,37,"Female","United States","IL",NA,"No","Yes","Often","6-25","No","Yes","Yes","Not sure","No","Yes","Yes","Somewhat easy","No","No","Some of them","Yes","No","Maybe","Yes","No",NA',
 '2014-08-27 11:29:37,44,"M","United States","IN",NA,"No","No","Rarely","More than 1000","No","No","Don\'t know","No","Don\'t know","Don\'t know","Don\'t know","Don\'t know","Maybe","No","No","No","No","No","Don\'t know","No",NA',
 '2014-08-27 11:29:44,32,"Male","Canada",NA,NA,"No","No","Rarely","6-25","No","Yes","No","No","No","No","Don\'t know","Somewhat difficult","No","No","Yes","Yes","Yes","Yes","No","No",NA',
 '2014-08-27 11:29:46,31,"Male","United Kingdom",NA,NA,"Yes","Yes","Often","26-100","No","Yes","No","Yes","No","No","No","Somewhat difficult","Yes","Yes","Some of them","No","Maybe","Maybe","No","Yes",NA',
 '2014-08-27 11:30:22,31,"Male","United States","TX",NA,"No","No","Never","100-500","Yes","Yes","Yes","No","Don\'t know","Don\'t know","Don\'t know","Don\'t know","No","N

How to process RDD?
record in rdd are line of text - we dont have schema or row column structure

In [6]:
partitionedrdd = linesrdd.repartition(2)

Lets give structure to our records

In [7]:
colsRDD = partitionedrdd.map(lambda line: line.replace('"','').split(","))

will take map transformation. map function takes the lambda function and call it in loop for each line.
with replace we are removing double qoutes and we are splitting line using comma
and output is list of (strings)text.
Now we have columns but we need schema as well. how to get datatype for each column

In [8]:
from collections import namedtuple
SurveyRecord = namedtuple("SurveyRecord", ["Age", "Gender", "Country", "State"])

will use surveyrecord named tuple to give schema to our rdd.
lets process the colsRDD.
use map method to process each row. use survey record object taking into only four rows.

In [9]:
selectRDD = colsRDD.map(lambda cols: SurveyRecord(int(cols[1]), cols[2], cols[3], cols[4]))
print(selectRDD.collect())

[SurveyRecord(Age=37, Gender='Female', Country='United States', State='IL'), SurveyRecord(Age=44, Gender='M', Country='United States', State='IN'), SurveyRecord(Age=32, Gender='Male', Country='Canada', State='NA'), SurveyRecord(Age=31, Gender='Male', Country='United Kingdom', State='NA'), SurveyRecord(Age=31, Gender='Male', Country='United States', State='TX'), SurveyRecord(Age=33, Gender='Male', Country='United States', State='TN'), SurveyRecord(Age=35, Gender='Female', Country='United States', State='MI'), SurveyRecord(Age=39, Gender='M', Country='Canada', State='NA'), SurveyRecord(Age=42, Gender='Female', Country='United States', State='IL')]


In [10]:
filteredRDD = selectRDD.filter(lambda r: r.Age < 40)

Now we want this record grouped by country and count it.
for this, first step is to create key value pair. country becomes key and value becomes hardcoded 1.

In [11]:
kvRDD = filteredRDD.map(lambda r: (r.Country, 1))

Now we get key value rdd. next step is, use reducedByKey method and sum up the hardcoded value 1. that is the count.
then collect the count 

In [12]:
countRDD = kvRDD.reduceByKey(lambda v1, v2: v1 + v2)

Get the count and push it to log file.

In [13]:
colsList = countRDD.collect()
for x in colsList:
    logger.info(x)

In [14]:
spark.stop()

However GroupBy implementation on RDD is not so obvious and might not make sense at first.
That was the challege we face during use of rdd. we need to handcode everything such as grouping and aggregating.
Spark Engine had no clue about data strcture inside the rdd, neither spark will look inside your lambda functions and these two things limit spark for creating optimised execution plan.

we are not getting into it further as they are raw and outdated API for spark developers. 