This tutorial is regarding Spark RDD basic operations

In [3]:
#Create SparkContex to create rdd's. 1. Spark session 2. spark context 3. create RDDS
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Spark RDD').getOrCreate()

In [6]:
from pyspark.context import SparkContext
sc = SparkContext # Direct method

In [8]:
sc_new = spark.sparkContext

In [9]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving sample.txt to sample.txt
User uploaded file "sample.txt" with length 1815 bytes


In [11]:
readRDD = sc_new.textFile('sample.txt')
partionedRDD = readRDD.repartition(3)

#readRDD.collect()

In [17]:
colsRDD = partionedRDD.map(lambda line: line.replace('"','').split(','))
#colsRDD.collect()

In [16]:
# create a schema 
from collections import namedtuple
surveyRecords = namedtuple("SurveyRecord",["Age","Gender","Country","State"])

In [18]:
selectRDD = colsRDD.map(lambda cols: surveyRecords(int(cols[1]),cols[2],cols[3],cols[4]))
selectRDD.collect()

[SurveyRecord(Age=37, Gender='Female', Country='United States', State='IL'),
 SurveyRecord(Age=44, Gender='M', Country='United States', State='IN'),
 SurveyRecord(Age=32, Gender='Male', Country='Canada', State='NA'),
 SurveyRecord(Age=31, Gender='Male', Country='United Kingdom', State='NA'),
 SurveyRecord(Age=31, Gender='Male', Country='United States', State='TX'),
 SurveyRecord(Age=33, Gender='Male', Country='United States', State='TN'),
 SurveyRecord(Age=35, Gender='Female', Country='United States', State='MI'),
 SurveyRecord(Age=39, Gender='M', Country='Canada', State='NA'),
 SurveyRecord(Age=42, Gender='Female', Country='United States', State='IL')]

In [19]:
# filter data
filterRDD = selectRDD.filter(lambda r: r.Age<40)
filterRDD.collect()

[SurveyRecord(Age=37, Gender='Female', Country='United States', State='IL'),
 SurveyRecord(Age=32, Gender='Male', Country='Canada', State='NA'),
 SurveyRecord(Age=31, Gender='Male', Country='United Kingdom', State='NA'),
 SurveyRecord(Age=31, Gender='Male', Country='United States', State='TX'),
 SurveyRecord(Age=33, Gender='Male', Country='United States', State='TN'),
 SurveyRecord(Age=35, Gender='Female', Country='United States', State='MI'),
 SurveyRecord(Age=39, Gender='M', Country='Canada', State='NA')]

In [20]:
#create a counter by a tuple
kvRDD = filterRDD.map(lambda a: (a.Country,1))
kvRDD.collect()

[('United States', 1),
 ('Canada', 1),
 ('United Kingdom', 1),
 ('United States', 1),
 ('United States', 1),
 ('United States', 1),
 ('Canada', 1)]

In [22]:
# sum by the key col country
countRDD = kvRDD.reduceByKey(lambda v1,v2: v1+v2)
countRDD.collect()

[('Canada', 2), ('United States', 4), ('United Kingdom', 1)]

In [23]:
coll = countRDD.collect()

for x in coll:
  print(x)

('Canada', 2)
('United States', 4)
('United Kingdom', 1)


Spark SQL Engine - This optimize the code and generates efficient java byte codes. This optimization can be broken down into 4 phases:
1. Anlysis - The spark engine will read you code and generates a abstract syntax tree for your sql or dataframe query.In this stage the column name, table name data types are resolved. Run time error is generated if there is error. 
2.Optimization- SQl engine will apply rule based optimization and will construct a set of multiple execution plans. Then a cost will be assigned to each plan
3.Physical Planing - The engine pics the most cost effective logical plan and generates a physical plan. The phy plan is a set of RDD's generated. Which determines the way thigs will be applied in the cluster.
4. Code Generation- In this efficient java byte codes to run on each machine.