<p align="center">
   <img src="spark.png">
</p>

# Spark Hands-on: Spark RDD

In this notebook you will discover the **Spark** framework and its Python API (pySpark). We will cover the basic transformations and actions with **Spark Core** and its main component: the **Resilient Distributed Dataset** (RDD). 

In the next notebooks, we will go throught the **SQL API** that allow higher level queries that are optimized by the Spark engine (Catalyst and Tungstene). Recall that when possible, **Spark SQL is the recommended API to analyse data** that is built on top of RDD's except you have low-level code to perform or work with very unstructured data.
See [RDD vs DataFrame vs DataSet](https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/) for a comprehensive comparison (*Dataset* API is only available in Scala language).


Finally we will see a basic application of a machine learning algorithm that will be trained at scale through **SparkML/MLlib pipelines**.

*Nota*: Spark streaming (Dstreams and Structured streaming) and Graph API are not covered in this notebook.

For an introduction to the main Spark concepts, pleaser refer to the [confluence page](https://confluence.airbus.corp/display/E2H89ATAR/Introduction+to+Spark).

## 1. Spark Configuration

Before doing any calculation, Spark needs to be configured. To do that, you have to create a **SparkSession** that set the entry point to the master node of the cluster. With this object, you can either connect to the master node and inherit from the basic configuration of the cluster, or you can set the ressources you need for your application.

*Nota*: In previous version, you had to create a SparkContext before, this is now encapsulated in SparkSession.

In [1]:
# Some imports

# Spark
import pyspark
from pyspark.sql import SparkSession


# Pandas
import pandas as pd

# Plot
%matplotlib inline
import matplotlib.pyplot as plt

# Foundry
import foundrywrapper
from foundrywrapper import FoundryWrapper

In [4]:
#Required for foundry, not useful for spark purposes
fw = FoundryWrapper()
fw

The SparkSession is set with the master node URL, and we overide it with a custom config. 
Here we have the driver node that holds the main application. It is dedicated at negociating ressources with the resources manager (e.g Kubernetes, YARN ...) and it is where the application is launched (typically the JAR of a spark application). We allocate 1 gigabyte of ram and 2 cpus.

Then, for the workers, we use 2 executors (that are 2 JVM) with each 4 giga of RAM and 1 cores.

This configuration should be enhanced with considering the computations that can be made, data size ...

In [5]:
spark = (SparkSession.builder 
        .master('spark://spark-master:7077') #master node URL
        .appName('~ Spark hands-on: RDD~') #my App name
        .config('spark.driver.memory', '1g') # memory allocated to the master
        .config('spark.driver.cores', '2') # CPU's allocated to the master
        .config('spark.executor.instances', '2') # how many executors
        .config('spark.executor.memory', '4g') # memory per executor (where the data is stored)
        .config('spark.executor.cores', '1') #CPU's per executor
        .config('spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT',1)\
        .config('spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT',1) \
        .config(conf=fw.spark.get_spark_app_config())
        .getOrCreate())
# Configuring spark to use Foundry
spark = fw.configure_spark(spark) # foundry specific

Printing the spark session, we see what version of Spark we are using (palentir3.0) and we have access to the link to the *Spark UI*. This UI is really useful to manage the workload and to see the stages of the applications running. For instance, it can be useful to monitor the ressources used and adapt it consequently.

In [6]:
spark

## 2. Loading Datasets

Spark has many connectors to interact with differents data format. The most common one are Parquet, JSON and csv ...


# The dataset:

This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

In [7]:
data_file = "kddcup.data_10_percent"

In [8]:
# import gzip
f = open(data_file, 'r')
rdd = spark.sparkContext.parallelize(f)
f.close()

What does the data looks like in a RDD format ?

In [9]:
%%time
rdd.take(5)

CPU times: user 5.45 ms, sys: 5.84 ms, total: 11.3 ms
Wall time: 1.93 s


['0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.\n',
 '0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.\n',
 '0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.\n',
 '0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.\n',
 '0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.\n']

Few comments here:
- Data is not splitted like we would do with a csv reader, with RDD you have to do it by yourself.
- Note that the two first instructions have been processed very rapidly. This is because they are **transformations**. Recall Spark is a lazy computing engine. That means, it has done nothing before we execute an **action** with the **take** function which is intended to print out some data.

## 3. Transformations and actions

They are two types of operations you can apply on RDD Transformations and Actions. (This is the same for dataframe)

- A Transformation allows to create a new RDD from an existing one.
- An Action applies computation on the RDD and return a value (like the .count function)

*Nota:* Recall that RDD is a resilient data format. Each RDD contains somehow a link to its "father" that allows the cluster to recreate it in case of failure.

### Map
**Map** is the most common **transformation**. It applies the same function to each entries of the RDD. Let's split each record by comma.


In [10]:
rdd_splitted = rdd.map(lambda x: x.split(","))

### take

**Take** is a common **action** that will print out few result lines. Be careful, calling take will execute all the transformations that lead to the result !

In [11]:
rdd_splitted.take(2)

[['0',
  'tcp',
  'http',
  'SF',
  '181',
  '5450',
  '0',
  '0',
  '0',
  '0',
  '0',
  '1',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '8',
  '8',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  '1.00',
  '0.00',
  '0.00',
  '9',
  '9',
  '1.00',
  '0.00',
  '0.11',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  'normal.\n'],
 ['0',
  'tcp',
  'http',
  'SF',
  '239',
  '486',
  '0',
  '0',
  '0',
  '0',
  '0',
  '1',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '0',
  '8',
  '8',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  '1.00',
  '0.00',
  '0.00',
  '19',
  '19',
  '1.00',
  '0.00',
  '0.05',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  '0.00',
  'normal.\n']]

We get every feature now separated in a list with the label at the end. Let's represent it as a tuple (feature, label) while cleaning the label removing the line separator field.
To do so let's use again map. This time the function is more complex, so we cannot use a one-liner map.

In [12]:
def parse_interaction(l):
    elems = l
    features =[]
    for e in  elems[:-1]:
        try:
            e=float(e)
        except ValueError:
            e=e
        features.append(e)
    y = elems[-1][:-2]
    return (features, y)
rdd_key= rdd_splitted.map(parse_interaction)
print(rdd_key.take(2))

[([0.0, 'tcp', 'http', 'SF', 181.0, 5450.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 9.0, 9.0, 1.0, 0.0, 0.11, 0.0, 0.0, 0.0, 0.0, 0.0], 'normal'), ([0.0, 'tcp', 'http', 'SF', 239.0, 486.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 19.0, 19.0, 1.0, 0.0, 0.05, 0.0, 0.0, 0.0, 0.0, 0.0], 'normal')]


## Count

Let's count the number of normal and abnormal connections.

In [13]:
# transformations
normal = rdd_key.filter(lambda x: x[1] == "normal")
abnormal = rdd_key.filter(lambda x: x[1] != "normal")

#actions
print(normal.count())
print(abnormal.count())

97278
396743


In [14]:
normal

PythonRDD[6] at RDD at PythonRDD.scala:55

## Collect

**Collect** is the same function as **take** except that it will retrieve the **whole** dataset to the driver! Use it with, if the data is too large it can result in a out of memory error.

In [15]:
normal_list = normal.collect()
normal_list[0]

([0.0,
  'tcp',
  'http',
  'SF',
  181.0,
  5450.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  8.0,
  8.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  0.0,
  9.0,
  9.0,
  1.0,
  0.0,
  0.11,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 'normal')

### Operations betweens two RDD's

Apply a function on RDD A et B than will produce a third RDD

For instance: **subtract**. Let's create a new RDD that does contain the abnormal label using preceding RDD's. This toy example would be equivalent to a filter transformation.

In [18]:
normal_raw = rdd.filter(lambda x: "normal." in x)
abnormal_raw = rdd.subtract(normal_raw)  #would be the same as doing rdd.filter(lambda x :"normal." not in x)

print(abnormal_raw.count())

396743


*Nota*: Every transformation creates a new RDD. It is good practice to create a pipeline of transformations to have a concise coding with a clear view on the transformations.

### Reduce

This is an aggregation function that leads to a single element.
For instance let's implement another way to count the number of normal transactions using a map/reduce style.


In [19]:
rdd_keyed = normal.map(lambda x : 1) # transformation: we map a 1 to every observation
normal_count = rdd_keyed.reduce(lambda x,y :  x+y) # action
print(normal_count) #not a rdd !

97278


Then let's count the total attack duration for the abnormal transaction (intrusions)

In [20]:
abnormal.take(1)

[([184.0,
   'tcp',
   'telnet',
   'SF',
   1511.0,
   2957.0,
   0.0,
   0.0,
   0.0,
   3.0,
   0.0,
   1.0,
   2.0,
   1.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   1.0,
   0.0,
   0.0,
   0.0,
   0.0,
   1.0,
   0.0,
   0.0,
   1.0,
   3.0,
   1.0,
   0.0,
   1.0,
   0.67,
   0.0,
   0.0,
   0.0,
   0.0],
  'buffer_overflow')]

In [21]:
abnormal_attack_duration = abnormal.map(lambda x : x[0][0]) #duration is the first elemet
abnormal_attack_duration_total = abnormal_attack_duration.reduce(lambda x,y: x+y)
print(abnormal_attack_duration_total)

2626792.0


[184.0]

### countByKey, reduceByKey ...
What if we want to compute the total duration for normal connections and all the different labels referring to bad connections ?
The **countByKey** function allows to do this for key/value RDD. There are many other actions function on these type of RDD (see the docs :) )

In [31]:
keyed_rdd = rdd_splitted.map(lambda x : (x[41][:-2], float(x[0]))) # create a key value (type, duration)

We could have passed in the value field whatever we wanted. For instance, we could have written (x.key, x), so now the values would be accessed by a particular key.
The interest of doing that is to get grouped result (can be equivalent to some GroupBy in Spark SQL or pandas)


In [32]:
duration_attack_by_key = keyed_rdd.reduceByKey(lambda x,y : x+y)
# Here it takes a (String, Int) type and it will just groupby the keys and count them efficiently
# If it was not grouped by key you would get an error since you can't sum such a tuple!

In [33]:
duration_attack_by_key.collect() 

[('loadmodule', 326.0),
 ('neptune', 0.0),
 ('guess_passwd', 144.0),
 ('portsweep', 1991911.0),
 ('ftp_write', 259.0),
 ('multihop', 1288.0),
 ('warezmaster', 301.0),
 ('warezclient', 627563.0),
 ('smurf', 0.0),
 ('normal', 21075991.0),
 ('satan', 64.0),
 ('ipsweep', 43.0),
 ('buffer_overflow', 2751.0),
 ('perl', 124.0),
 ('pod', 0.0),
 ('teardrop', 0.0),
 ('land', 0.0),
 ('rootkit', 1008.0),
 ('phf', 18.0),
 ('back', 284.0),
 ('imap', 72.0),
 ('nmap', 0.0),
 ('spy', 636.0)]

Let's sort this out!

In [34]:
duration_attack_by_key_sorted = duration_attack_by_key.sortBy(lambda x: x[1], ascending=False)
duration_attack_by_key_sorted.take(5)

[('normal', 21075991.0),
 ('portsweep', 1991911.0),
 ('warezclient', 627563.0),
 ('buffer_overflow', 2751.0),
 ('multihop', 1288.0)]

*Nota*: There is a bunch of ways to compute the same thing ... We could have used functions like **keyBy, mapValues** ... refer to the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html) to have an overview !

# Debugging

Debugging can be tedious in Spark. To help understanding the behaviour of the code we can print the logical plan created by Spark when doing operations.
To have a comprehensive debug, it is good to use the Spark UI.

In [19]:

sc = spark.sparkContext
a = sc.parallelize([1,2,3])
b = a.map(lambda x : x**2)
c= b.filter(lambda x: x< 3)
print(c.toDebugString())
c.collect()

b'(2) PythonRDD[34] at RDD at PythonRDD.scala:55 []\n |  ParallelCollectionRDD[33] at parallelize at PythonRDD.scala:198 []'


[1]

In [35]:
spark.stop()