### Application
- A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

### SparkSession
- An object that provides a point of entry to interact with underlying Spark functionality and allows programming Spark with its APIs. In an interactive Spark shell, the Spark driver instantiates a SparkSession for you, while in a Spark application, you create a SparkSession object yourself.

### Job
- A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()).

### Stage
- Each job gets divided into smaller sets of tasks called stages that depend on each other.

### Task
- A single unit of work or execution that will be sent to a Spark executor.

In [1]:
#!pip install pyspark

In [2]:
#import findspark
#findspark.init()
import pyspark

In [3]:
from pyspark import SparkContext
sc = SparkContext()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/06/02 13:09:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.getOrCreate()

In [6]:
sc = spark.sparkContext

## Important Terms

Let's quickly go over some important terms:

Term                   |Definition
----                   |-------
RDD                    |Resilient Distributed Dataset
Transformation         |Spark operation that produces an RDD
Action                 |Spark operation that produces a local object
Spark Job              |Sequence of transformations on data with a final action

## Creating an RDD

There are two ways to create RDDs: <b>parallelizing</b> an existing collection in your driver program, or <b>referencing a dataset</b> in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Method                      |Result
----------                               |-------
`sc.parallelize(array)`                  |Create RDD of elements of array (or list)
`sc.textFile(path/to/file)`                      |Create RDD of lines from file

In [7]:
data = [1,2,3,4,5]
distDataRDD = sc.parallelize(data)

In [8]:
distDataRDD

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [9]:
distDataRDD.collect()

[1, 2, 3, 4, 5]

##### Once created, the distributed dataset (distDataRDD) can be operated on in parallel.

In [10]:
distDataRDD.reduce(lambda a,b: a+b)

15

PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

In [11]:
%%writefile example.txt
first line
second line
third line
fourth line

Overwriting example.txt


In [12]:
distFile = sc.textFile('example.txt')

In [13]:
distFile

example.txt MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

In [14]:
distFile.getNumPartitions()

2

In [15]:
distFile.count()

4

In [16]:
distFile.collect()

['first line', 'second line', 'third line', 'fourth line']

In [17]:
distFile.first()

'first line'

In [18]:
lst = distFile.collect()

In [19]:
lst

['first line', 'second line', 'third line', 'fourth line']

In [20]:
lst[0]

'first line'

In [21]:
secfind = distFile.filter(lambda line:'second' in line)

In [22]:
secfind

PythonRDD[6] at RDD at PythonRDD.scala:53

In [23]:
secfind.collect()

['second line']

In [24]:
thrdfind = distFile.filter(lambda line: 'third' in line)

In [25]:
thrdfind

PythonRDD[7] at RDD at PythonRDD.scala:53

In [26]:
thrdfind.collect()

['third line']

In [27]:
distFile_mapped = distFile.map(lambda s:len(s))

In [28]:
distFile_mapped

PythonRDD[8] at RDD at PythonRDD.scala:53

In [29]:
distFile_mapped.collect()

[10, 11, 10, 11]

In [30]:
distFile_mapped.reduce(lambda a,b : a+b)

42

## RDD Transformations

We can use transformations to create a set of instructions we want to preform on the RDD (before we call an action and actually execute them).

Transformations are the process which are used to create a new RDD. It follows the principle of Lazy Evaluations (the execution will not start until an action is triggered).

Transformation Example                          |Result
----------                               |-------
`filter(lambda x: x % 2 == 0)`           |Discard non-even elements
`map(lambda x: x * 2)`                   |Multiply each RDD element by `2`
`map(lambda x: x.split())`               |Split each string into words
`flatMap(lambda x: x.split())`           |Split each string into words and flatten sequence
`sample(withReplacement=True,0.25)`      |Create sample of 25% of elements with replacement
`union(rdd)`                             |Append `rdd` to existing RDD
`distinct()`                             |Remove duplicates in RDD
`sortBy(lambda x: x, ascending=False)`   |Sort elements in descending order

## RDD Actions

Once you have your 'recipe' of transformations ready, what you will do next is execute them by calling an action.

Actions are the processes which are applied on an RDD to initiate Apache Spark to apply calculation and pass the result back to driver. 

Here are some common actions:

Action                             |Result
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(3)`                              |First 3 elements of RDD 
`top(3)`                               |Top 3 elements of RDD
`takeSample(withReplacement=True,3)`   |Create sample of 3 elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)

In [31]:
%%writefile example2.txt
first 
second line
the third line
then a fourth line

Overwriting example2.txt


In [32]:
# Show RDD
sc.textFile('example2.txt')

example2.txt MapPartitionsRDD[11] at textFile at NativeMethodAccessorImpl.java:0

In [33]:
# Save a reference to this RDD
text_rdd = sc.textFile('example2.txt')

In [34]:
text_rdd

example2.txt MapPartitionsRDD[13] at textFile at NativeMethodAccessorImpl.java:0

In [35]:
text_rdd.take(2)

['first ', 'second line']

In [36]:
text_rdd.collect()

['first ', 'second line', 'the third line', 'then a fourth line']

### Exercise

Create a file `sample.txt`with. Read and load it into a RDD with the `textFile` spark function.

### Collect

Action / To Driver: Return all items in the RDD to the driver in a single list

![](http://i.imgur.com/DUO6ygB.png)

In [37]:
text_rdd.collect()

['first ', 'second line', 'the third line', 'then a fourth line']

### Exercise 

Collect the text you read before from the `sample.txt`file.

## Transformation
In Spark, the core data structures are immutable meaning they cannot be changed once created. This might seem like a strange concept at first, if you cannot change it, how are you supposed to use it? In order to “change” a DataFrame you will have to instruct Spark how you would like to modify the DataFrame you have into the one that you want. These instructions are called transformations. Transformations are the core of how you will be expressing your business logic using Spark. There are two types of transformations, those that specify narrow dependencies and those that specify wide dependencies.
https://databricks.com/glossary/what-are-transformations 

<b>Narrow transformation — specify narrow dependencies</b>
Narrow transformation are those where each input partition will contribute to only one output partition.
![image.png](attachment:image.png)

<b>Wide transformation — specify wide dependencies.</b>
Wide transformation will have input partitions contributing to many output partitions.
You will often hear this referred to as a <I><b>shuffle<b></I> where Spark will exchange partitions across the cluster. 
![image-2.png](attachment:image-2.png)

### Map

Transformation / Narrow: Return a new RDD by applying a function to each element of this RDD

![](http://i.imgur.com/PxNJf0U.png)

In [38]:
rdd = sc.parallelize(list(range(8)))
print('rdd elements:         ',rdd.collect())

rdd elements:          [0, 1, 2, 3, 4, 5, 6, 7]


In [39]:
rdd_squared = rdd.map(lambda x: x ** 2).collect() # Square each element
print('rdd elements squared: ', rdd_squared)

rdd elements squared:  [0, 1, 4, 9, 16, 25, 36, 49]


In [40]:
def sq(no):
    return no**2

In [41]:
rdd_squared = rdd.map(sq).collect() # Square each element
print('rdd elements squared: ', rdd_squared)

rdd elements squared:  [0, 1, 4, 9, 16, 25, 36, 49]


In [42]:
# Map a function (or lambda expression) to each line
# Then collect the results.
text_rdd.map(lambda line: line.split()).collect()

[['first'],
 ['second', 'line'],
 ['the', 'third', 'line'],
 ['then', 'a', 'fourth', 'line']]

In [43]:
textRddlst = text_rdd.map(lambda line: line.split()).collect()

In [44]:
textRddlst[1]

['second', 'line']

In [45]:
textRddlst[1][0]

'second'

### Exercise

1- Create rdd from list of numbers and apply any built in Math function to it.<br>
2- Apply ant math operation using lambda expression.

## Map vs flatMap

### FlatMap

Transformation / Narrow: Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results

![](http://i.imgur.com/TsSUex8.png)

In [46]:
text_rdd.map(lambda line: line.split()).collect()

[['first'],
 ['second', 'line'],
 ['the', 'third', 'line'],
 ['then', 'a', 'fourth', 'line']]

In [47]:
# Map vs flatMap
# Collect everything as a single flat map
text_rdd.flatMap(lambda line: line.split()).collect()

['first',
 'second',
 'line',
 'the',
 'third',
 'line',
 'then',
 'a',
 'fourth',
 'line']

In [48]:
lstFlatMap = text_rdd.flatMap(lambda line: line.split()).collect()

In [49]:
lstFlatMap[8] 

'fourth'

### Filter

Transformation / Narrow: Return a new RDD containing only the elements that satisfy a predicate

![](http://i.imgur.com/GFyji4U.png)

In [50]:
rdd.collect()

[0, 1, 2, 3, 4, 5, 6, 7]

In [51]:
rdd.filter(lambda x : x%2==0).collect()

[0, 2, 4, 6]

### GroupBy

Transformation / Wide: Group the data in the original RDD. Create pairs where the key is the output of a user function, and the value is all items for which the function yields this key.

![](http://i.imgur.com/gdj0Ey8.png)

In [52]:
rdd = sc.parallelize(['John','Fred','Anna','James','Frank'])
rdd2 = rdd.groupBy(lambda w : w[0])
rdd2.collect()

[('J', <pyspark.resultiterable.ResultIterable at 0x7fccf8751280>),
 ('A', <pyspark.resultiterable.ResultIterable at 0x7fccf8749ee0>),
 ('F', <pyspark.resultiterable.ResultIterable at 0x7fccf8749be0>)]

In [53]:
rdd2.first()

('J', <pyspark.resultiterable.ResultIterable at 0x7fccf86e4070>)

In [54]:
rdd2.first()[1]

<pyspark.resultiterable.ResultIterable at 0x7fccf86e4dc0>

In [55]:
list(rdd2.first()[1])

['John', 'James']

In [56]:
rdd2_lst = rdd2.collect()
[(k,list(v)) for (k,v) in rdd2_lst]

[('J', ['John', 'James']), ('A', ['Anna']), ('F', ['Fred', 'Frank'])]

In [57]:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
result = rdd.groupBy(lambda x: x % 2).collect()

In [58]:
result

[(0, <pyspark.resultiterable.ResultIterable at 0x7fccf86e4040>),
 (1, <pyspark.resultiterable.ResultIterable at 0x7fccf8749070>)]

In [59]:
sorted([(x, sorted(y)) for (x, y) in result])

[(0, [2, 8]), (1, [1, 1, 3, 5])]

### GroupByKey

Transformation / Wide: Group the values for each key in the original RDD. Create a new pair where the original key corresponds to this collected group of values.

![](http://i.imgur.com/TlWRGr2.png)

In [60]:
rdd = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])
rdd2 = rdd.groupByKey()
rdd2.collect()

[('A', <pyspark.resultiterable.ResultIterable at 0x7fccf8751790>),
 ('B', <pyspark.resultiterable.ResultIterable at 0x7fccf86f4c70>)]

In [61]:
[(j[0], list(j[1])) for j in rdd2.collect()]

[('A', [3, 2, 1]), ('B', [5, 4])]

In [62]:
sorted([(j[0], list(j[1])) for j in rdd2.collect()])

[('A', [3, 2, 1]), ('B', [5, 4])]

In [63]:
sorted([(j[0], sorted(list(j[1]))) for j in rdd2.collect()])

[('A', [1, 2, 3]), ('B', [4, 5])]

### Join

Transformation / Wide: Return a new RDD containing all pairs of elements having the same key in the original RDDs

![](http://i.imgur.com/YXL42Nl.png)

In [64]:
rdd1 = sc.parallelize([("a", 1), ("b", 2)])
rdd2 = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])
rdd1.join(rdd2).collect()

[('b', (2, 5)), ('a', (1, 3)), ('a', (1, 4))]

### Distinct

Transformation / Wide: Return a new RDD containing distinct items from the original RDD (omitting all duplicates)

![](http://i.imgur.com/Vqgy2a4.png)

In [65]:
rdd = sc.parallelize([1,2,3,3,4,5,10,5,5,5,2,2,2])
rdd.distinct().collect()

[1, 2, 3, 4, 5, 10]

In [66]:
txtrdd_flat = text_rdd.flatMap(lambda line : line.split())

In [67]:
txtrdd_flat.collect()

['first',
 'second',
 'line',
 'the',
 'third',
 'line',
 'then',
 'a',
 'fourth',
 'line']

In [68]:
txtrdd_flat.distinct().collect()

['line', 'third', 'fourth', 'first', 'second', 'the', 'then', 'a']

### KeyBy

Transformation / Narrow: Create a Pair RDD, forming one pair for each item in the original RDD. The pair’s key is calculated from the value via a user-supplied function.

![](http://i.imgur.com/nqYhDW5.png)

In [69]:
rdd = sc.parallelize(['John', 'Fred', 'Anna', 'James'])
rdd.keyBy(lambda w: w[0]).collect()

[('J', 'John'), ('F', 'Fred'), ('A', 'Anna'), ('J', 'James')]

## Actions

![](http://i.imgur.com/R72uzwX.png)

In [70]:
rdd = sc.parallelize(list(range(8)))

In [71]:
rdd2 = rdd.map(lambda x:x**2)

In [72]:
rdd2.reduce(lambda a,b:a+b) # reduce is an action!

140

In [73]:
from operator import add

In [74]:
rdd2.reduce(add)

140

## Exercise
### Max, Min, Sum, Mean, Variance, Stdev

Action / To Driver: Compute the respective function (maximum value, minimum value, sum, mean, variance, or standard deviation) from a numeric RDD

![](http://i.imgur.com/HUCtib1.png)

In [75]:
rdd2.collect()

[0, 1, 4, 9, 16, 25, 36, 49]

In [76]:
# Using actions
print('Max: ',rdd2.max())
print('Min: ',rdd2.min())
print('Sum: ',rdd2.sum())
print('Mean: ',rdd2.mean())
print('Variance: ',rdd2.variance())
print('Stdev: ',rdd2.stdev())

Max:  49
Min:  0
Sum:  140
Mean:  17.5
Variance:  278.25
Stdev:  16.680827317612277


### CountByKey

Action / To Driver: Return a map of keys and counts of their occurrences in the RDD

![](http://i.imgur.com/jvQTGv6.png)

In [77]:
rdd = sc.parallelize([('J', 'James'), ('F','Fred'), 
                    ('A','Anna'), ('J','John')])

In [78]:
rdd.countByKey()

defaultdict(int, {'J': 2, 'F': 1, 'A': 1})

In [79]:
# Stop the local spark cluster
sc.stop()

![image.png](attachment:image.png)

### Spark stages are the physical unit of execution for the computation of multiple tasks. The Spark stages are controlled by the Directed Acyclic Graph(DAG) for any data processing and transformations on the resilient distributed datasets(RDD). There are mainly two stages associated with the Spark frameworks such as, ShuffleMapStage and ResultStage. The Shuffle MapStage is the intermediate phase for the tasks which prepares data for subsequent stages, whereas resultStage is a final step to the spark function for the particular set of tasks in the spark job. ResultSet is associated with the initialization of parameter, counters and registry values in Spark.