### Application
- A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

### SparkSession
- An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession object yourself.

### Job
- A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).

### Stage
- Each job gets divided into smaller sets of tasks called stages that depend on each other.

### Task
- A single unit of work or execution that will be sent to a Spark executor.

In [None]:
# import findspark
# findspark.init()
import pyspark

In [None]:
sc

In [None]:
spark

In [None]:
# SparkContext and SparkSession
import findspark
findspark.init()

from pyspark import SparkConf, SparkContext
sconf = SparkConf()

sconf.setMaster('local[*]').setAppName('RDD')

sc = SparkContext(master='local[*]',appName='RDD')
sc= SparkContext(conf=sconf)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('RDD').master('local[*]').getOrCreate()

sc = sparkContext

In [None]:
from pyspark.sql import SparkSession

spark2 = SparkSession.builder.appName('RDD').master('local[*]').getOrCreate()

23/04/08 10:13:56 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [None]:
sc3 = spark.sparkContext

## Important Terms

Let's quickly go over some important terms:

Term                   |Definition
----                   |-------
RDD                    |Resilient Distributed Dataset
Transformation         |Spark operation that produces an RDD
Action                 |Spark operation that produces a local object
Spark Job              |Sequence of transformations on data with a final action

## Creating an RDD

There are two ways to create RDDs: <b>parallelizing</b> an existing collection in your driver program, or <b>referencing a dataset</b> in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Method                      |Result
----------                               |-------
`sc.parallelize(array)`                  |Create RDD of elements of array (or list)
`sc.textFile(path/to/file)`                      |Create RDD of lines from file

In [None]:
import numpy as np

In [None]:
data = np.arange(1,51)

In [None]:
data

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

In [None]:
rdd1 = sc.parallelize(data)

##### Once created, the distributed dataset (distDataRDD) can be operated on in parallel.

In [None]:
rdd1

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [None]:
rdd2 = rdd1.map(lambda x:x**2)

In [None]:
rdd2

PythonRDD[1] at RDD at PythonRDD.scala:53

In [None]:
def f(x):
    return np.sqrt(x)

In [None]:
rdd3 = rdd2.map(f)

In [None]:
rdd3

PythonRDD[2] at RDD at PythonRDD.scala:53

In [None]:
rdd3.getNumPartitions()

4

In [None]:
rdd1.getNumPartitions()

4

In [None]:
rdd1.collect()

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50]

In [None]:
rdd2.collect()

                                                                                

[1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361,
 400,
 441,
 484,
 529,
 576,
 625,
 676,
 729,
 784,
 841,
 900,
 961,
 1024,
 1089,
 1156,
 1225,
 1296,
 1369,
 1444,
 1521,
 1600,
 1681,
 1764,
 1849,
 1936,
 2025,
 2116,
 2209,
 2304,
 2401,
 2500]

In [None]:
rdd3.collect()

[1.0,
 2.0,
 3.0,
 4.0,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0]

In [None]:
rdd4 = rdd3.map(lambda x:x+2)

In [None]:
rdd4

PythonRDD[3] at RDD at PythonRDD.scala:53

In [None]:
rdd5 = rdd4.map(lambda x:x/3)

In [None]:
rdd5

PythonRDD[4] at RDD at PythonRDD.scala:53

In [None]:
rdd5.collect()

[1.0,
 1.3333333333333333,
 1.6666666666666667,
 2.0,
 2.3333333333333335,
 2.6666666666666665,
 3.0,
 3.3333333333333335,
 3.6666666666666665,
 4.0,
 4.333333333333333,
 4.666666666666667,
 5.0,
 5.333333333333333,
 5.666666666666667,
 6.0,
 6.333333333333333,
 6.666666666666667,
 7.0,
 7.333333333333333,
 7.666666666666667,
 8.0,
 8.333333333333334,
 8.666666666666666,
 9.0,
 9.333333333333334,
 9.666666666666666,
 10.0,
 10.333333333333334,
 10.666666666666666,
 11.0,
 11.333333333333334,
 11.666666666666666,
 12.0,
 12.333333333333334,
 12.666666666666666,
 13.0,
 13.333333333333334,
 13.666666666666666,
 14.0,
 14.333333333333334,
 14.666666666666666,
 15.0,
 15.333333333333334,
 15.666666666666666,
 16.0,
 16.333333333333332,
 16.666666666666668,
 17.0,
 17.333333333333332]

In [None]:
spark

In [None]:
rdd6 = rdd4.map(lambda x:x+x)

In [None]:
rdd6

PythonRDD[5] at RDD at PythonRDD.scala:53

In [None]:
rdd6.collect()

[6.0,
 8.0,
 10.0,
 12.0,
 14.0,
 16.0,
 18.0,
 20.0,
 22.0,
 24.0,
 26.0,
 28.0,
 30.0,
 32.0,
 34.0,
 36.0,
 38.0,
 40.0,
 42.0,
 44.0,
 46.0,
 48.0,
 50.0,
 52.0,
 54.0,
 56.0,
 58.0,
 60.0,
 62.0,
 64.0,
 66.0,
 68.0,
 70.0,
 72.0,
 74.0,
 76.0,
 78.0,
 80.0,
 82.0,
 84.0,
 86.0,
 88.0,
 90.0,
 92.0,
 94.0,
 96.0,
 98.0,
 100.0,
 102.0,
 104.0]

In [None]:
rdd5.count()

50

In [None]:
rdd5.reduce(lambda a,b:a+b)

458.33333333333337

In [None]:
rdd1.map(lambda x:x**2).map(lambda x:x+x).map(f).reduce(lambda a,b:a+b)

1803.122292025696

In [None]:
spark

In [None]:
rdd3.cache()

PythonRDD[2] at RDD at PythonRDD.scala:53

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

In [None]:
%%writefile example.txt
first line
second line
third line
fourth line

Overwriting example.txt


In [None]:
distFile = sc.textFile('example.txt')
distFile

example.txt MapPartitionsRDD[10] at textFile at DirectMethodHandleAccessor.java:104

In [None]:
distFile2 = sc.textFile('csv')

In [None]:
distFile2

csv MapPartitionsRDD[12] at textFile at DirectMethodHandleAccessor.java:104

In [None]:
distFile.getNumPartitions()

2

In [None]:
distFile2.getNumPartitions()

6

In [None]:
distFile.count()

4

In [None]:
distFile2.count()

1508

In [None]:
distFile.collect()

['first line', 'second line', 'third line', 'fourth line']

In [None]:
distFile2.collect()

['DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count',
 'United States,Romania,1',
 'United States,Ireland,264',
 'United States,India,69',
 'Egypt,United States,24',
 'Equatorial Guinea,United States,1',
 'United States,Singapore,25',
 'United States,Grenada,54',
 'Costa Rica,United States,477',
 'Senegal,United States,29',
 'United States,Marshall Islands,44',
 'Guyana,United States,17',
 'United States,Sint Maarten,53',
 'Malta,United States,1',
 'Bolivia,United States,46',
 'Anguilla,United States,21',
 'Turks and Caicos Islands,United States,136',
 'United States,Afghanistan,2',
 'Saint Vincent and the Grenadines,United States,1',
 'Italy,United States,390',
 'United States,Russia,156',
 'United States,Federated States of Micronesia,48',
 'Pakistan,United States,9',
 'United States,Netherlands,570',
 'Iceland,United States,118',
 'Marshall Islands,United States,77',
 'Luxembourg,United States,91',
 'Honduras,United States,391',
 'The Bahamas,United States,903',
 'El Salvador,United State

In [None]:
distFile3 = sc.textFile('json')

In [None]:
distFile3.count()

1502

In [None]:
distFile3.collect()

['{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":1}',
 '{"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":264}',
 '{"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":69}',
 '{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":24}',
 '{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Equatorial Guinea","count":1}',
 '{"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":25}',
 '{"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":54}',
 '{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":477}',
 '{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":29}',
 '{"ORIGIN_COUNTRY_NAME":"Marshall Islands","DEST_COUNTRY_NAME":"United States","count":44}',
 '{"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Guyana","count":17}',
 '{"ORIGIN_COUNTRY_NAME":"Sint Maarten","DEST_

In [None]:
distFile4 = sc.textFile('parquet/')

In [None]:
distFile4.count()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: Path: /home/hatemelattar/AI Intake 43/PySpark/Ubuntu_Final_Spark_Intake_2/L2_RDD_DataFrames/parquet/2010-summary.parquet is a directory, which is not supported by the record reader when `mapreduce.input.fileinputformat.input.dir.recursive` is false.
	at org.apache.spark.errors.SparkCoreErrors$.pathNotSupportedError(SparkCoreErrors.scala:69)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:240)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
	at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:55)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:292)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:288)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1589)


In [None]:
distFile4 = sc.textFile('parquet/2010-summary.parquet')

In [None]:
distFile4.count()

24

In [None]:
distFile4.collect()

['PAR1\x15\x04\x15�\x1a\x15�\x0eL\x15�\x01\x15\x04\x00\x00\x1f�\x08\x00\x00\x00\x00\x00\x00\x03UUMs\x1b7\x0c�!V�|�ۤ��:�=\x7fBVdY��Q,ۇް��.".��VV��\x00\x18�Ӌ�%A���\x01zwvv��(be�\x11"��|0��.�.Ə\x04�\x07\x02k�\x1c�k>��>���\x12~�-:���\u061c�\x01\x1cH�\x15ب�W��3�\x05�\x13W\'�\x16�b�>�]o�Uf',
 'T��,z�_���@.�Gr%�\x7f�',
 '�y@\x07\x15CP��\x08v��\x1b�\x11�q�kQ�\x04����o��c\\�}�~b[�\x14jyx�]�\x02�o\x04',
 'ǿ�\x06��9�f\x0b�\x19*\x1f�\x15O��%\x1c`�H6���)\x1e0H���)�W\x00!�\x1b���Y�\x0f� ��',
 '*���\x17P{��\x16�/n��?s%�N�',
 '-��',
 '\x7f2=RܬL\x02�B���-\x08�a��\x1e���\x06H�H�G��!1\x1b#���i3V>���\x1f�\x19��\x04',
 '�',
 'B�*-q�\x15\x15@�s�kܛ\x7f\x11$�<��\x02Y',
 'b�$WC\x18F�\x03V��Z�\x01�',
 "�Yl�w�!�\x15�������\\��=��*r�ϵ�\x12x��\x057(�C��é�A�{��\x05C��\x0e]H*�o\x10!�b㚾ӟ�TK�1+m���˛{�=8ys\x03\x14I��iϦ\x06d�*�[�p]\x01^\x06��'��2x�`f�+��S�\x07�_|�E���\x1dv��T��C��_�\x13�r6\t��wrƊ.\x1b��_C�p\x15�'!$0���\x19\x085���Gz\x03\x07��/�7�\x0e����\x1cO\x02\x14f֒���Y�5t>`n*S��Uڦ",
 '�M ~A�B�\t�\x1b\x0ci�Pk\x1eA�\x

In [None]:
distFile.first()

'first line'

In [None]:
lst = distFile.collect()

In [None]:
lst

['first line', 'second line', 'third line', 'fourth line']

In [None]:
lst[0]

'first line'

In [None]:
secfind = distFile.filter(lambda line:'second' in line)

In [None]:
secfind

PythonRDD[25] at RDD at PythonRDD.scala:53

In [None]:
secfind.collect()

['second line']

In [None]:
EGYfind = distFile2.filter(lambda line:'Egypt'in line)

In [None]:
EGYfind

PythonRDD[26] at RDD at PythonRDD.scala:53

In [None]:
EGYfind.collect()

['Egypt,United States,24',
 'United States,Egypt,25',
 'Egypt,United States,15',
 'United States,Egypt,12',
 'Egypt,United States,13',
 'United States,Egypt,12',
 'Egypt,United States,13',
 'United States,Egypt,12',
 'Egypt,United States,11',
 'United States,Egypt,11',
 'Egypt,United States,13',
 'United States,Egypt,15']

In [None]:
EGYfind.count()

12

In [None]:
thrdfind = distFile.filter(lambda line: 'third' in line)

In [None]:
thrdfind

PythonRDD[28] at RDD at PythonRDD.scala:53

In [None]:
thrdfind.collect()

['third line']

In [None]:
distFile_mapped = distFile.map(lambda s:len(s))

In [None]:
distFile_mapped

PythonRDD[29] at RDD at PythonRDD.scala:53

In [None]:
distFile_mapped.collect()

[10, 11, 10, 11]

In [None]:
distFile_mapped.reduce(lambda a,b : a+b)

42

In [None]:
distFile2_mapped = distFile2.map(lambda s:s.split())

In [None]:
distFile2_mapped

PythonRDD[31] at RDD at PythonRDD.scala:53

In [None]:
distFile2_mapped.take(5)

[['DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count'],
 ['United', 'States,Romania,1'],
 ['United', 'States,Ireland,264'],
 ['United', 'States,India,69'],
 ['Egypt,United', 'States,24']]

## RDD Transformations

We can use transformations to create a set of instructions we want to preform on the RDD (before we call an action and actually execute them).

Transformations are the process which are used to create a new RDD. It follows the principle of Lazy Evaluations (the execution will not start until an action is triggered).

Transformation Example                          |Result
----------                               |-------
`filter(lambda x: x % 2 == 0)`           |Discard non-even elements
`map(lambda x: x * 2)`                   |Multiply each RDD element by `2`
`map(lambda x: x.split())`               |Split each string into words
`flatMap(lambda x: x.split())`           |Split each string into words and flatten sequence
`sample(withReplacement=True,0.25)`      |Create sample of 25% of elements with replacement
`union(rdd)`                             |Append `rdd` to existing RDD
`distinct()`                             |Remove duplicates in RDD
`sortBy(lambda x: x, ascending=False)`   |Sort elements in descending order

## RDD Actions

Once you have your 'recipe' of transformations ready, what you will do next is execute them by calling an action.

Actions are the processes which are applied on an RDD to initiate Apache Spark to apply calculation and pass the result back to driver. 

Here are some common actions:

Action                             |Result
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(3)`                              |First 3 elements of RDD 
`top(3)`                               |Top 3 elements of RDD
`takeSample(withReplacement=True,3)`   |Create sample of 3 elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)

In [None]:
%%writefile example2.txt
first 
second line
the third line
then a fourth line

Overwriting example2.txt


In [None]:
# Show RDD
sc.textFile('example2.txt')

example2.txt MapPartitionsRDD[1] at textFile at DirectMethodHandleAccessor.java:104

In [None]:
# Save a reference to this RDD
text_rdd = sc.textFile('example2.txt')

In [None]:
text_rdd

example2.txt MapPartitionsRDD[3] at textFile at DirectMethodHandleAccessor.java:104

In [None]:
text_rdd.take(2)

                                                                                

['first ', 'second line']

In [None]:
text_rdd.collect()

['first ', 'second line', 'the third line', 'then a fourth line']

### Collect

Action / To Driver: Return all items in the RDD to the driver in a single list

![](http://i.imgur.com/DUO6ygB.png)

In [None]:
text_rdd.collect()

['first ', 'second line', 'the third line', 'then a fourth line']

## Transformation
In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify the DataFrame you have into the one that you want. These instructions are called transformations. Transformations are the core of how you will be expressing your business logic using Spark. There are two types of transformations, those that specify narrow dependencies and those that specify wide dependencies.
https://databricks.com/glossary/what-are-transformations 

<b>Narrow transformation — specify narrow dependencies</b>
Narrow transformation are those where each input partition will contribute to only one output partition.
![image.png](attachment:image.png)

<b>Wide transformation — specify wide dependencies.</b>
Wide transformation will have input partitions contributing to many output partitions.
You will often hear this referred to as a <I><b>shuffle<b></I> where Spark will exchange partitions across the cluster. 
![image-2.png](attachment:image-2.png)

### Map

Transformation / Narrow: Return a new RDD by applying a function to each element of this RDD

![](http://i.imgur.com/PxNJf0U.png)

In [None]:
rdd = sc.parallelize(list(range(8)))
print('rdd elements:         ',rdd.collect())

rdd elements:          [0, 1, 2, 3, 4, 5, 6, 7]


In [None]:
rdd_squared = rdd.map(lambda x: x ** 2).collect() # Square each element
print('rdd elements squared: ', rdd_squared)

rdd elements squared:  [0, 1, 4, 9, 16, 25, 36, 49]


In [None]:
def sq(no):
    return no**2

In [None]:
rdd_squared = rdd.map(sq).collect() # Square each element
print('rdd elements squared: ', rdd_squared)

rdd elements squared:  [0, 1, 4, 9, 16, 25, 36, 49]


In [None]:
# Map a function (or lambda expression) to each line
# Then collect the results.
text_rdd.map(lambda line: line.split()).collect()

[['first'],
 ['second', 'line'],
 ['the', 'third', 'line'],
 ['then', 'a', 'fourth', 'line']]

In [None]:
textRddlst = text_rdd.map(lambda line: line.split()).collect()

In [None]:
textRddlst[1]

['second', 'line']

In [None]:
textRddlst[1][0]

'second'

## Map vs flatMap

### FlatMap

Transformation / Narrow: Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results

![](http://i.imgur.com/TsSUex8.png)

In [None]:
text_rdd.map(lambda line: line.split()).collect()

                                                                                

[['first'],
 ['second', 'line'],
 ['the', 'third', 'line'],
 ['then', 'a', 'fourth', 'line']]

In [None]:
# Map vs flatMap
# Collect everything as a single flat map
text_rdd.flatMap(lambda line: line.split()).collect()

['first',
 'second',
 'line',
 'the',
 'third',
 'line',
 'then',
 'a',
 'fourth',
 'line']

In [None]:
lstFlatMap = text_rdd.flatMap(lambda line: line.split()).collect()

In [None]:
lstFlatMap[8] 

'fourth'

### Filter

Transformation / Narrow: Return a new RDD containing only the elements that satisfy a predicate

![](http://i.imgur.com/GFyji4U.png)

In [None]:
rdd = sc.parallelize(list(range(8)))
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7]

In [None]:
rdd.filter(lambda x : x%2==0).collect()

[0, 2, 4, 6]

### GroupBy

Transformation / Wide: Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.

![](http://i.imgur.com/gdj0Ey8.png)

In [None]:
rdd = sc.parallelize(['John','Fred','Anna','James','Frank'])
rdd2 = rdd.groupBy(lambda w : w[0])
rdd2.collect()

[('J', <pyspark.resultiterable.ResultIterable at 0x7f904084c3a0>),
 ('F', <pyspark.resultiterable.ResultIterable at 0x7f90281bf310>),
 ('A', <pyspark.resultiterable.ResultIterable at 0x7f90281bf370>)]

In [None]:
import numpy as np
rdd3 = rdd2.map(lambda T:(T[0],len(T[1])))
rdd3.collect()

[('J', 2), ('F', 2), ('A', 1)]

In [None]:
rdd4 = rdd2.map(lambda T:(T[0],sorted(T[1])))
rdd4.collect()

[('J', ['James', 'John']), ('F', ['Frank', 'Fred']), ('A', ['Anna'])]

In [None]:
rdd2.first()

('J', <pyspark.resultiterable.ResultIterable at 0x7f90209c7040>)

In [None]:
rdd2.first()[1]

<pyspark.resultiterable.ResultIterable at 0x7f90209c7d00>

In [None]:
list(rdd2.first()[1])

['John', 'James']

In [None]:
rdd2_lst = rdd2.collect()
[(k,list(v)) for (k,v) in rdd2_lst]

[('J', ['John', 'James']), ('F', ['Fred', 'Frank']), ('A', ['Anna'])]

In [None]:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
rdd_g = rdd.groupBy(lambda x: x % 2)
result = rdd_g.collect()

In [None]:
result

[(0, <pyspark.resultiterable.ResultIterable at 0x7f9020938af0>),
 (1, <pyspark.resultiterable.ResultIterable at 0x7f90209f52e0>)]

In [None]:
sorted([(x, sorted(y)) for (x, y) in result])

[(0, [2, 8]), (1, [1, 1, 3, 5])]

In [None]:
rdd_sorted = rdd_g.map(lambda T:(T[0],sorted(T[1])))
rdd_sorted.collect()

[(0, [2, 8]), (1, [1, 1, 3, 5])]

### GroupByKey

Transformation / Wide: Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.

![](http://i.imgur.com/TlWRGr2.png)

In [None]:
rdd = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
rdd2 = rdd.groupByKey()
rdd2.collect()

[('B', <pyspark.resultiterable.ResultIterable at 0x7f90209c78e0>),
 ('A', <pyspark.resultiterable.ResultIterable at 0x7f90209f53d0>)]

In [None]:
[(j[0], list(j[1])) for j in rdd2.collect()]

[('B', [5, 4]), ('A', [3, 2, 1])]

In [None]:
sorted([(j[0], list(j[1])) for j in rdd2.collect()])

[('A', [3, 2, 1]), ('B', [5, 4])]

In [None]:
sorted([(j[0], sorted(list(j[1]))) for j in rdd2.collect()])

[('A', [1, 2, 3]), ('B', [4, 5])]

In [None]:
rdd_sorted = rdd2.map(lambda T:(T[0],sorted(T[1]))).sortByKey()

rdd_sorted.collect()

[('A', [1, 2, 3]), ('B', [4, 5])]

### Join

Transformation / Wide: Return a new RDD containing all pairs of elements having the same key in the original RDDs

![](http://i.imgur.com/YXL42Nl.png)

In [None]:
rdd1 = sc.parallelize([("a", 1), ("b", 2)])
rdd2 = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
rdd3 = rdd1.join(rdd2)
rdd3.collect()

[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]

In [None]:
rdd4 = rdd3.groupByKey()
rdd4.collect()

[('a', <pyspark.resultiterable.ResultIterable at 0x7f901b8a0f10>),
 ('b', <pyspark.resultiterable.ResultIterable at 0x7f901bbc0430>)]

In [None]:
[(k,list(v)) for (k,v) in rdd4.collect()]

[('a', [(1, 3), (1, 4)]), ('b', [(2, 5)])]

In [None]:
rdd5 = rdd4.map(lambda x:(x[0],list(x[1])))
rdd5.collect()

[('a', [(1, 3), (1, 4)]), ('b', [(2, 5)])]

### Distinct

Transformation / Wide: Return a new RDD containing distinct items from the original RDD (omitting all duplicates)

![](http://i.imgur.com/Vqgy2a4.png)

In [None]:
rdd = sc.parallelize([1,2,3,3,4,5,10,5,5,5,2,2,2])
rdd.distinct().collect()

[4, 1, 5, 2, 10, 3]

In [None]:
txtrdd_flat = text_rdd.flatMap(lambda line : line.split())

In [None]:
txtrdd_flat.collect()

['first',
 'second',
 'line',
 'the',
 'third',
 'line',
 'then',
 'a',
 'fourth',
 'line']

In [None]:
txtrdd_flat.distinct().collect()

['line', 'third', 'fourth', 'first', 'second', 'the', 'then', 'a']

### KeyBy

Transformation / Narrow: Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function.

![](http://i.imgur.com/nqYhDW5.png)

In [None]:
rdd = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
rdd.keyBy(lambda w: w[0]).collect()

[('J', 'John'), ('F', 'Fred'), ('A', 'Anna'), ('J', 'James')]

## Actions

![](http://i.imgur.com/R72uzwX.png)

In [None]:
rdd = sc.parallelize(list(range(8)))

In [None]:
rdd2 = rdd.map(lambda x:x**2)

In [None]:
rdd2.reduce(lambda a,b:a+b) # reduce is an action!

140

In [None]:
from operator import add

In [None]:
rdd2.reduce(add)

140

### Max, Min, Sum, Mean, Variance, Stdev

Action / To Driver: Compute the respective function (maximum value, minimum value, sum, mean, variance, or standard deviation) from a numeric RDD

![](http://i.imgur.com/HUCtib1.png)

In [None]:
rdd2.collect()

[0, 1, 4, 9, 16, 25, 36, 49]

In [None]:
# Using actions
print('Max: ',rdd2.max())
print('Min: ',rdd2.min())
print('Sum: ',rdd2.sum())
print('Mean: ',rdd2.mean())
print('Variance: ',rdd2.variance())
print('Stdev: ',rdd2.stdev())

Max:  49
Min:  0
Sum:  140
Mean:  17.5
Variance:  278.25000000000006
Stdev:  16.68082731761228


### CountByKey

Action / To Driver: Return a map of keys and counts of their occurrences in the RDD

![](http://i.imgur.com/jvQTGv6.png)

In [None]:
rdd = sc.parallelize([('J', 'James'), ('F','Fred'), 
                    ('A','Anna'), ('J','John')])

In [None]:
rdd.countByKey()

defaultdict(int, {'J': 2, 'F': 1, 'A': 1})

In [None]:
dic = rdd.countByKey()

In [None]:
dic

defaultdict(int, {'J': 2, 'F': 1, 'A': 1})

In [None]:
dic

defaultdict(int, {'J': 2, 'F': 1, 'A': 1})

In [None]:
dic.get('J')

2

In [None]:
dic.get('F')

1

In [None]:
# Stop the local spark cluster
sc.stop()

### Spark stages are the physical unit of execution for the computation of multiple tasks. The Spark stages are controlled by the Directed Acyclic Graph(DAG) for any data processing and transformations on the resilient distributed datasets(RDD). There are mainly two stages associated with the Spark frameworks such as, ShuffleMapStage and ResultStage. The Shuffle MapStage is the intermediate phase for the tasks which prepares data for subsequent stages, whereas resultStage is a final step to the spark function for the particular set of tasks in the spark job. ResultSet is associated with the initialization of parameter, counters and registry values in Spark.