<p align="center">
   <img src="spark.png">
</p>

# Spark Hands-on: Spark RDD

In this notebook you will discover the **Spark** framework and its Python API (pySpark). We will cover the basic transformations and actions with **Spark Core** and its main component: the **Resilient Distributed Dataset** (RDD). 

In the next notebooks, we will go throught the **SQL API** that allow higher level queries that are optimized by the Spark engine (Catalyst and Tungstene). Recall that when possible, **Spark SQL is the recommended API to analyse data** that is built on top of RDD's except you have low-level code to perform or work with very unstructured data.
See [RDD vs DataFrame vs DataSet](https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/) for a comprehensive comparison (*Dataset* API is only available in Scala language).


Finally we will see a basic application of a machine learning algorithm that will be trained at scale through **SparkML/MLlib pipelines**.

*Nota*: Spark streaming (Dstreams and Structured streaming) and Graph API are not covered in this notebook.

For an introduction to the main Spark concepts, pleaser refer to the [confluence page](https://confluence.airbus.corp/display/E2H89ATAR/Introduction+to+Spark).

## 1. Spark Configuration

Before doing any calculation, Spark needs to be configured. To do that, you have to create a **SparkSession** that set the entry point to the master node of the cluster. With this object, you can either connect to the master node and inherit from the basic configuration of the cluster, or you can set the ressources you need for your application.

*Nota*: In previous version, you had to create a SparkContext before, this is now encapsulated in SparkSession.

In [1]:
# Some imports

# Spark
import pyspark
from pyspark.sql import SparkSession


# Pandas
import pandas as pd

# Plot
%matplotlib inline
import matplotlib.pyplot as plt

# Foundry
import foundrywrapper
from foundrywrapper import FoundryWrapper

In [2]:
#Required for foundry, not useful for spark purposes
fw = FoundryWrapper()
fw

The SparkSession is set with the master node URL, and we overide it with a custom config. 
Here we have the driver node that holds the main application. It is dedicated at negociating ressources with the resources manager (e.g Kubernetes, YARN ...) and it is where the application is launched (typically the JAR of a spark application). We allocate 1 gigabyte of ram and 2 cpus.

Then, for the workers, we use 2 executors (that are 2 JVM) with each 4 giga of RAM and 1 cores.

This configuration should be enhanced with considering the computations that can be made, data size ...

In [3]:
spark = (SparkSession.builder 
        .master('spark://spark-master:7077') #master node URL
        .appName('~ Spark hands-on: RDD~') #my App name
        .config('spark.driver.memory', '1g') # memory allocated to the master
        .config('spark.driver.cores', '2') # CPU's allocated to the master
        .config('spark.executor.instances', '2') # how many executors
        .config('spark.executor.memory', '4g') # memory per executor (where the data is stored)
        .config('spark.executor.cores', '1') #CPU's per executor
        .config(conf=fw.spark.get_spark_app_config())
        .getOrCreate())
# Configuring spark to use Foundry
spark = fw.configure_spark(spark) # foundry specific

Printing the spark session, we see what version of Spark we are using (3.0) and we have access to the link to the *Spark UI*. This UI is really useful to manage the workload and to see the stages of the applications running. For instance, it can be useful to monitor the ressources used and adapt it consequently.

In [4]:
spark

## 2. Loading Datasets

Spark has many connectors to interact with differents data format. The most common one are Parquet, JSON and csv ...


In [5]:
# Showing dataset stats
# Foundry wrapper
file_rid = fw.compass.get_rid('/datasources/flight_radar_24/data/cleaned/subsets/flight_radar_24_week')
fw.catalog.get_dataset_view_stats(file_rid)

KeyError: 'exceptionClass'

Let's use the Fligth Radar 24 dataset. This is a parquet file and we can load it into a DataFrame object in a one line command.

In [None]:
%%time
df = spark.read.parquet(fw.spark.foundryfs_uri(file_rid))

In [None]:
df.printSchema() #helps visualizing what's inside a dataframe

The Dataframe API is recommended to manipulate data, but let's first use RDDs to have an insight of the spark core components ...

In [8]:
%%time
rdd = df.rdd

CPU times: user 773 µs, sys: 2.51 ms, total: 3.28 ms
Wall time: 1.38 s


What does the data looks like in a RDD format ?

In [9]:
%%time
rdd.take(5)

CPU times: user 10.2 ms, sys: 4.53 ms, total: 14.8 ms
Wall time: 9.2 s


[Row(flight_id=87404301, aircraft_id=11012303, aircraft_registration='N617AR', equipment='rv6', departure_airport_code=None, scheduled_arrival_airport_code=None, arrival_airport_code=None, flight_number=None, callsign=None, msn=None, elapsed_time_seconds=151, departure_date_time=datetime.datetime(2015, 1, 6, 15, 40, 27), arrival_date_time=datetime.datetime(2015, 1, 6, 15, 42, 58), airport_separation=None, track_distance=4.562991608717113, adsb_start_flight_phase='ascent', adsb_end_flight_phase='descent', out_time=datetime.datetime(2015, 1, 6, 15, 30, 6, 229200), off_time=datetime.datetime(2015, 1, 6, 15, 40, 27), on_time=datetime.datetime(2015, 1, 6, 15, 42, 58), in_time=datetime.datetime(2015, 1, 6, 15, 50, 3, 639400), OOOI_quality_rating='WOOD', next_flight_id=None, next_arrival_airport_code=None, turnaround_time=None, dist_between_in_and_next_out=None, trt_quality_rating=None, manufacturer=None, aircraft_family=None, aircraft_type=None, airline_iata=None, airline_icao=None, turnarou

Few comments here:
- We first loaded a DataFrame object and then converted it to a RDD object. This is an anti-pattern and we will use later the DataFrame.
- We see that the RDD contains some **Row** data type. This means we cannot manipulate it like we would do with a standard pandas DataFrame. This will also be handled using Spark SQL
- Note that the two first instructions have been processed very rapidly. This is because they are **transformations**. Recall Spark is a lazy computing engine. That means, it has done nothing before we execute an **action** with the **take** function which is intended to print out some data.

## 3. Transformations and actions

They are two types of operations you can apply on RDD Transformations and Actions. (This is the same for dataframe)

- A Transformation allows to create a new RDD from an existing one.
- An Action applies computation on the RDD and return a value (like the .count function)

*Nota:* Recall that RDD is a resilient data format. Each RDD contains somehow a link to its "father" that allows the cluster to recreate it in case of failure.

### Map
**Map** is the most common **transformation**. It applies the same function to each entries of the RDD. Let's compute for each flight the flying duration.

*Nota*: This is structured data.

In [26]:
time_duration = rdd.map(lambda x: x.arrival_date_time - x.departure_date_time)

### take

**Take** is a common **action** that will print out few result lines. Be careful, calling take will execute all the transformations that leads to the result !

In [27]:
time_duration.take(5)

[datetime.timedelta(0, 151),
 datetime.timedelta(0, 2152),
 datetime.timedelta(0, 1775),
 datetime.timedelta(0, 2066),
 datetime.timedelta(0, 1827)]

To bad, we lose the link between the obseration and the result. Concatenating the result with the initial RDD allows to keep this link. But take care, now our RDD is of type (Row, datetime.timedelta)

In [28]:
time_duration_concatenated = rdd.map(lambda x: (x,x.arrival_date_time - x.departure_date_time))
time_duration_concatenated.take(1)

[(Row(flight_id=87404301, aircraft_id=11012303, aircraft_registration='N617AR', equipment='rv6', departure_airport_code=None, scheduled_arrival_airport_code=None, arrival_airport_code=None, flight_number=None, callsign=None, msn=None, elapsed_time_seconds=151, departure_date_time=datetime.datetime(2015, 1, 6, 15, 40, 27), arrival_date_time=datetime.datetime(2015, 1, 6, 15, 42, 58), airport_separation=None, track_distance=4.562991608717113, adsb_start_flight_phase='ascent', adsb_end_flight_phase='descent', out_time=datetime.datetime(2015, 1, 6, 15, 30, 6, 229200), off_time=datetime.datetime(2015, 1, 6, 15, 40, 27), on_time=datetime.datetime(2015, 1, 6, 15, 42, 58), in_time=datetime.datetime(2015, 1, 6, 15, 50, 3, 639400), OOOI_quality_rating='WOOD', next_flight_id=None, next_arrival_airport_code=None, turnaround_time=None, dist_between_in_and_next_out=None, trt_quality_rating=None, manufacturer=None, aircraft_family=None, aircraft_type=None, airline_iata=None, airline_icao=None, turnaro

Inside the map function, you can specify your own function as well. let's make a example that will concatenate departure airport code with arrival_airport_code.

In [29]:
def concatenate_airport_codes(x):
    if (x.departure_airport_code is None) | (x.arrival_airport_code is None):
        res = ""
    else:
        res = x.departure_airport_code + "_" + x.arrival_airport_code
    return res

In [31]:
rdd.map(concatenate_airport_codes).take(2) #One transformation followed by one action

['', 'BTS_PRG']

### Count

Let's count how many flight have flight from "TAT" airport and compute the percentage within the all database:

In [34]:
%%time
tat_departure_count = rdd.filter(lambda x : x.departure_airport_code == "TAT") \
                .count()
percentage_tat_departure = tat_departure_count / rdd.count() * 100
print(percentage_tat_departure)

0.0024280132023217875
CPU times: user 22.9 ms, sys: 6.86 ms, total: 29.7 ms
Wall time: 31.7 s


### Collect

This operation retrieves all the data into a python list. All the data are sent from the different workers to the master node. This is indeed an action that has to be used with care if you are working with a large dataset.

In [35]:
# We've seen that there is not a big percentage of flight coming from TAT airport, let's collect it safely !
rdd.filter(lambda x : x.departure_airport_code == "TAT").collect()

[Row(flight_id=86761336, aircraft_id=5267244, aircraft_registration='OMBYL', equipment='YK40', departure_airport_code='TAT', scheduled_arrival_airport_code='BTS', arrival_airport_code='BTS', flight_number=None, callsign='SSG1', msn=None, elapsed_time_seconds=1775, departure_date_time=datetime.datetime(2015, 1, 1, 8, 11, 43), arrival_date_time=datetime.datetime(2015, 1, 1, 8, 41, 18), airport_separation=132.12782100251258, track_distance=115.24480734281174, adsb_start_flight_phase='ascent', adsb_end_flight_phase='descent', out_time=datetime.datetime(2015, 1, 1, 7, 35, 45, 803200), off_time=datetime.datetime(2015, 1, 1, 8, 7, 27, 918814), on_time=datetime.datetime(2015, 1, 1, 8, 41, 18), in_time=datetime.datetime(2015, 1, 1, 8, 47, 11, 326800), OOOI_quality_rating='WOOD', next_flight_id=86771171, next_arrival_airport_code='PRG', turnaround_time=9835, dist_between_in_and_next_out=0.2837841078157432, trt_quality_rating='WOOD', manufacturer='Yakovlev', aircraft_family='Yak-40', aircraft_typ

*Nota*: Every transformation creates a new RDD. It is good practice to create a pipeline of transformations to have a concise coding with a clear view on the transformations.

### Operations betweens two RDD's

Apply a function on RDD A et B than will produce a third RDD

For instance: **subtract**. Let's create a new RDD that does not contains departure_airport_code == "TAT" using preceding RDD's. This toy example would be equivalent to a filter transformation.

In [36]:
# We have our initial RDD called rdd
# First, let's compute a RDD containing the TAT code

tat_rdd = rdd.filter(lambda x : x.departure_airport_code == "TAT")

not_tat_rdd = rdd.subtract(tat_rdd) #would be the same as doing rdd.filter(lambda x : x.departure_airport_code != "TAT")

not_tat_rdd.take(1)

[Row(flight_id=86826850, aircraft_id=11092606, aircraft_registration='N696RG', equipment='WW24', departure_airport_code=None, scheduled_arrival_airport_code=None, arrival_airport_code=None, flight_number=None, callsign='4', msn=None, elapsed_time_seconds=1010, departure_date_time=datetime.datetime(2015, 1, 1, 23, 7, 20), arrival_date_time=datetime.datetime(2015, 1, 1, 23, 24, 10), airport_separation=None, track_distance=133.73038445955518, adsb_start_flight_phase='ascent', adsb_end_flight_phase='cruise', out_time=datetime.datetime(2015, 1, 1, 22, 7, 32, 100200), off_time=datetime.datetime(2015, 1, 1, 22, 57, 28, 451132), on_time=datetime.datetime(2015, 1, 1, 23, 48, 29, 203840), in_time=datetime.datetime(2015, 1, 2, 0, 19, 9, 119800), OOOI_quality_rating='WOOD', next_flight_id=None, next_arrival_airport_code=None, turnaround_time=None, dist_between_in_and_next_out=None, trt_quality_rating=None, manufacturer=None, aircraft_family=None, aircraft_type=None, airline_iata=None, airline_icao

### Reduce

This is an aggregation function that leads to a single element.
For instance let's implement another way to count the number of departure_airport_code==TAT using a map/reduce style.


In [45]:
tat_rdd_keyed = tat_rdd.map(lambda x : 1) # transformation: we map a 1 to every observation
tat_count = tat_rdd_keyed.reduce(lambda x,y :  x+y)
print(tat_count) #not a rdd !

16


### countByKey, reduceByKey ...
What if we want to compute this to all the different departure airports ?
The countByKey function allows to do this for key/value RDD. There are many other actions function on these type of RDD (see the docs :) )

In [46]:
keyed_rdd = rdd.map(lambda x : (x.departure_airport_code, 1)) # create a key value (departure_airport, 1)

We could have passed in the value field whatever we wanted. For instance, we could have written (x.key, x), so now the values would be accessed by a particular key.
The interest of doing that is to get grouped result (can be equivalent to some GroupBy in Spark SQL or pandas)


In [48]:
counted_departure = keyed_rdd.reduceByKey(lambda x,y : x+y)
# Here it takes a (String, Int) type and it will just groupby the keys and count them efficiently
# If it was not grouped by key you would get an error since you can't sum such a tuple!

In [48]:
counted_departure.collect() 

[(None, 154461),
 ('FLL', 2167),
 ('SHV', 141),
 ('SIN', 3260),
 ('MCO', 2772),
 ('TMB', 6),
 ('MSY', 807),
 ('MAN', 1307),
 ('TLV', 1480),
 ('EYW', 134),
 ('PHF', 75),
 ('CRQ', 4),
 ('SFB', 214),
 ('MNZ', 11),
 ('DME', 2230),
 ('GRU', 2093),
 ('UFA', 150),
 ('SIP', 156),
 ('TFS', 654),
 ('CUN', 1323),
 ('CEK', 84),
 ('UIO', 258),
 ('SSH', 139),
 ('FNJ', 2),
 ('VVO', 102),
 ('HTA', 20),
 ('VKO', 1048),
 ('ONT', 624),
 ('EDC', 9),
 ('AUS', 1005),
 ('MFE', 110),
 ('AGU', 49),
 ('BJX', 109),
 ('TJM', 68),
 ('VSA', 70),
 ('SJD', 288),
 ('CJS', 47),
 ('CME', 26),
 ('ZIH', 49),
 ('REX', 20),
 ('IAH', 5140),
 ('OTP', 759),
 ('VER', 82),
 ('OVS', 1),
 ('RYG', 104),
 ('MST', 77),
 ('GRI', 18),
 ('MTH', 7),
 ('LGB', 445),
 ('GYN', 138),
 ('SAN', 1683),
 ('CBG', 46),
 ('BZZ', 25),
 ('ACT', 80),
 ('DWH', 16),
 ('LCH', 38),
 ('DBQ', 20),
 ('ACH', 18),
 ('AOH', 1),
 ('LPL', 307),
 ('MRY', 180),
 ('MVY', 15),
 ('GSE', 20),
 ('OXF', 125),
 ('DUS', 1544),
 ('BRQ', 36),
 ('SYD', 2630),
 ('TXL', 1383),
 

Woah! This is a big list of results ! Maybe you would be interested only in the most frequented airports !?
Let's sort this out!

In [54]:
counted_departure_sorted = counted_departure.sortBy(lambda x: x[1], ascending=False)
counted_departure_sorted.take(5)

[(None, 154461), ('ORD', 7605), ('ATL', 7442), ('DFW', 6614), ('LAX', 5752)]

*Nota*: There is a bunch of ways to compute the same thing ... We could have used functions like **keyBy, mapValues** ... refer to the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html) to have an overview !

### Save our result to a text file locally

This is an action that will trigger every transformations from the transformation plan of Spark!


In [57]:
# Add foundry stuff

#counted_departure_sorted.coalesce(1).saveAsTextFile("res.txt")
#saveAsHadoop ...



In [19]:

sc = spark.sparkContext
a = sc.parallelize([1,2,3])
b = a.map(lambda x : x**2)
c= b.filter(lambda x: x< 3)
print(c.toDebugString())
c.collect()

b'(2) PythonRDD[34] at RDD at PythonRDD.scala:55 []\n |  ParallelCollectionRDD[33] at parallelize at PythonRDD.scala:198 []'


[1]

In [20]:
spark.stop()