# Introduction to Interactive Analysis with Spark

## Table of Content
1. [Initialization](#1.-Initialization)
2. [Creating an RDD](#2.-Creating-an-RDD)
3. [Getting Help](#3.-Getting-Help)
4. [Action on a Dataset](#4.-Action-on-a-Dataset)
5. [Dataset Transformation](#5.-Dataset-Transformation)
6. [Caching a Dataset](#6.-Caching-a-Dataset)
7. [Filtering a Dataset](#7.-Filtering-a-Dataset)
8. [Reduction Operation](#8.-Reduction-Operation)  
  8.1 [Filtering Unrelated Entries](#8.1-Filtering-Unrelated-Entries)  
  8.2 [Transforming in Key-Value Pairs](#8.2-Transforming-in-Key-Value-Pairs)  
  8.3 [Aggregating the Results by Key](#8.3-Aggregating-the-Results-by-Key)  
  8.4 [Finding the Most Processing Hungry Projects](#8.4-Finding-the-Most-Processing-Hungry-Projects)    
  8.5 [Bar Chart](#8.5-Bar-Chart)
9. [Mini-Project](#9.-Mini-Project)
10. [Ending the Analysis](#10.-Ending-the-Analysis)
11. [Recap](#11.-Recap)
12. [References](#12.-References)

## List of Exercises
1. [Exercise 1: How to Cache?](#Exercise-1)
3. [Exercise 2: How to Filter?](#Exercise-2)
2. [Exercise 3: How to Peek?](#Exercise-3)
4. [Exercise 4: How to Sort?](#Exercise-4)

## 1. Initialization

In this notebook, we will use Spark to analyze a structured dataset.

First, we need to import Spark's Python module named `pyspark`.

In [None]:
import pyspark

Then, we need to create a SparkContext.

In [None]:
try:
    sc = pyspark.SparkContext()
except ValueError:
    print("Warning : a SparkContext already exists.")

If we execute Spark locally, the context creation can be used to access Spark functions. It is also possible to launch Spark and the Python interpreter / notebook using the script named `pyspark`. In this case, the context will already be created (under the name `sc`) and a `ValueError` exception is raised when we try to create a second context. The exception has not effect so we simply catch it and display a warning message.

## 2. Creating an RDD

We will now create an RDD from text files representing the logs of a supercomputer named Colosse 2014. This data is in the folder `/project/datasets/colosse`.

In [None]:
moabevents = sc.textFile('/project/datasets/colosse/*.ssv')

The first lines of our log file looks like this

```
00:00:08 1388552403:2422699 job 10563398 JOBEND 1 8 shm 54000 Completed short 1388519145 1388541168 1388541168 1388552397 zcv-890-aa
00:00:15 1388552413:2422700 job 10563514 JOBSTART 1 8 shm 54000 Idle short 1388519476 1388552413 1388552413 1388552413 zcv-890-aa
00:04:53 1388552691:2422701 job 10563883 JOBEND 3 24 mpourbag 1800 Completed short 1388551677 1388552392 1388552392 1388552670 bem-651-ac
00:04:58 1388552698:2422702 job 10563887 JOBSUBMIT 3 24 mpourbag 1800 Idle short 1388552698 0 0 1388552698 bem-651-ac
00:06:20 1388552780:2422703 node r101-n75 NODEUP r101-n75 STATE=Idle PARTITION=torque AMEMORY=24155 ASWAP=21803 CDISK=1 CMEMORY=24155 CPROC=8 CSWAP=24155 RACK=1
```

It is a tabular file, where each line is a single event and the columns represent
1. Event time;
2. Event epoch;
3. Event type;

If the event is related to a job, the rest of the columns will be:  
4. Job id;
5. Job event : `{JOBSUBMIT, JOBSTART, JOBEND}`;
6. Number of nodes;
7. Number of cores;
8. Username;
9. Wallclock limit;
10. Status;
11. Queue;
12. Submit time;
13. Dispatch time;
14. Start time;
15. End time;
16. Project id.

We can look at a few entries with the RDD's method `take` to get the first `K` elements of the dataset. Here, `K = 10`.

In [None]:
first10 = moabevents.take(10)
print(first10)

Since `take` returns a list, we can iterate on the result and print each line from the file on a separate line on the screen.

In [None]:
from pprint import pprint
for item in first10:
    pprint(item)

## 3. Getting Help

At any moment, you can get help on a Python object using the `help()` function. For example, if we want to know more aboud the RDD's `take()` method.

In [None]:
help(moabevents.take)

## 4. Action on a Dataset

The `take()` method is one among multiple available *actions* we can apply on an RDD. An exhaustive list of actions is available at the following URL:
https://spark.apache.org/docs/latest/programming-guide.html#actions

In case we would not want to leave the notebook tab, we can call `help()` directly on an RDD.

In [None]:
help(moabevents)

Among the available action, lets take the method `count()`. What does it return?

In [None]:
moabevents.count()

Each action apply on an RDD leads to the creation of one or many task and the production of a result. Every task executed in the same app can be visualized in the Spark's dashboard. In this interface, we can track the progress of a task, and check different performance measures on the task, for example its duration and cache statistics.

## 5. Dataset Transformation

If we display the 10 first elements of our dataset that we retrieved earlier.

In [None]:
first10

We realize that the RDD is composed of each line of input text files, but that is not possible to access to individual column. **Why?**

In [None]:
first1 = moabevents.first()
first1

The action `first()` as its name states, return the first entry of the dataset. We see that each entry is a single string. We will need to transform that first RDD in a second in order to divide each string in a list. To do this, we will use the method `split()` of Python string object.

We first test the function on the dataset first entry.

In [None]:
str.split(first1)

We now want to apply this transformation to every RDD's entry. The RDD's method `map(func)` returns a new RDD formed by processing each element of the source with a function *func*.

In [None]:
moabevents_tab = moabevents.map(str.split)

The evaluation of this transformation is *lazy*. Spark does not compute anything as long as a result is not requested by an action. To convince yourself, execute the preceding cell, then visit the Spark dashboard. You should see that no job have been added to the list.

## 6. Caching a Dataset

When we expect to operate frequently on the same dataset, it can be useful to tell Spark to keep it in memory.

To do so, we use the `cache()` method.

In [None]:
moabevents_tab.cache()

The RDDs stored in memory are displayed in the **Storage** section of Spark web interface. Note that datasets are not loaded in memory until an action is called on them. 

To free memory used by cached RDD that we no longer need, we need to call the `unpersist()` method.

In [None]:
moabevents_tab.unpersist()

## 7. Filtering a Dataset

Since we now have an RDD that is easier to manipulate, we can start the analysis. First, we take interest in events that are related to job. The following line filter the last RDD we created, keeping only the entries that are related to job.

Try to answer the following quiz before executing the cell:  
* What sort of argument takes the `filter()` method?
* What type is `x`?
* Is filter an action or a transformation?
* What does `filter()` return?

In [None]:
moabjobs = moabevents_tab.filter(lambda x: x[2] == "job")

#### Exercise 1
**Write the code to cache the new RDD in the following cell**.

We can now count the number of job events. The macro notebook `%time` will indicate how long it took Spark to count the number of entries in the RDD.

In [None]:
%time moabjobs.count()

Since we told Spark to keep the dataset in memory, the time required to count the number of job events should be shorter for the second execution.

In [None]:
%time moabjobs.count()

Since we committed an action on a cached RDD, it should now figured in the **Storage** section of our app's dashboard.

## 8. Reduction Operation

We are now interested in producing a bar chart of the total requested walltime per project in our dataset. In order to do this, we will need to aggregate the requested walltime for each project. This type of operation is called a *reduction*.

### 8.1 Filtering Unrelated Entries

The requested wallclock time limit will show up in different events related to the same job. Some of theses events are: 
```
JOBSTART
JOBEND
JOBSUBMIT
```

#### Exercise 2
**Design a filter to only keep the entries related to job submission.**

In [None]:
moabjobends = moabjobs.<FILL IN>

### 8.2 Transforming in Key-Value Pairs

We now need to transform our dataset to only keep the project and the requested walltime. Furthermore, the requested walltime field is a string, we therefore use `int()` to convert it to an integer.

In [None]:
project_walltime = moabjobends.map(lambda entry: (entry[-1], int(entry[8])))

#### Exercise 3
**Take a look at the first 5 elements of the new RDD to confirm the transformation is correct.**

In [None]:
project_walltime.<FILL IN>

Spark provides functions to work with key-value pairs. In our new dataset, the key is the project id and the wallclock time limit is the value. The `key()` method of an RDD returns a new RDD composed only of the keys.

In [None]:
project_walltime.keys().take(5)

A `values()` method is also available.

In [None]:
project_walltime.values().take(5)

### 8.3 Aggregating the Results by Key

We want to compute the total requested walltime for each project. In order to do this, we use the `reduceByKey()` method. As its name states, this function expect the RDD to be composed of key-value pairs.

When called on a dataset of $(K, V)$ pairs, `reduceByKey()` returns a dataset of $(K, V)$ pairs where the values for each key are aggregated using the given reduce function *func*, which must be of type $(V,V) \rightarrow V$.

Since we want the total requested walltime per project, our aggregating function will be the addition.

In [None]:
agg_limits = project_walltime.reduceByKey(lambda x, y: x + y).cache()

`reduceByKey()` is a transformation, therefore the result is a new RDD. To visualize the entire content of the latter, we can use the `collect()` action.

In [None]:
agg_limits.collect()

### 8.4 Finding the Most Processing Hungry Projects

We now wish to determine the 5 projects that requested the most processing walltime in 2014. In order to do this, multiple solutions possible.

#### 8.4.1. Sort locally

In [None]:
top5 = sorted(agg_limits.collect(), key=lambda x: x[1], reverse=True)[:5]
print(top5)

#### 8.4.2 Ask Spark to sort the dataset using the `sortBy()` method.

In [None]:
top5 = agg_limits.sortBy(keyfunc=lambda x: x[1], ascending=False).take(5)
print(top5)

#### 8.4.3 Ask Spark for the top 5 using the `top()` method.

In [None]:
top5 = agg_limits.top(5, lambda x: x[1])
print(top5)

#### Exercise 4
**Can you think of another method to get the same result?**

### 8.5 Bar Chart

Since we are in an interactive notebook, we can use Python plotting library `plotly` to plot our bar chart.

In [None]:
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go

init_notebook_mode() # run at the start of every ipython notebook to use plotly.offline
                     # this injects the plotly.js source files into the notebook

We finally collect our data. We first sort the data in descending order of walltime limt. We then retrieve the keys of our RDD corresponding to the project ids. Finally, we collect the values which correspond to the total walltime limit per project.

In [None]:
sorted_agg_limits = agg_limits.sortBy(keyfunc=lambda x: x[1], ascending=False)
projects = sorted_agg_limits.keys().collect()
times = sorted_agg_limits.values().collect()

In [None]:
data = [
    go.Bar(
        x=projects,
        y=times,
    )
]

layout = go.Layout(
    title="Total requested walltime per project in 2014",
    xaxis=dict(
        title='Project ID',
        tickangle=-45
    ),
    yaxis=dict(
        title='Total requested walltime',
    ),    
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

## 9. Mini-Project

## Are users good at predicting their jobs' walltime?

We would now like to know if the users are good at predicting the walltime of their jobs. For this, you will first need to compute the effective walltime used for each project. Then, create a new chart where we can see the used walltime over the requested walltime.

While Moab provides for each job the start and end times, these timestamps are sometime innacurate or null. To avoid this, we will first compute ourselves the beginning and ending timestamp for each job based on the timestamp of the event log. Then, we will compute how long was each job.

### Filtering to keep only relevant events

We will only take interest in job event that are either of the name `JOBSTART` or `JOBEND`. 

**Create a new RDD based on `moabjobs` which should only contain events named `JOBSTART` or `JOBEND`**

### Manipulating time and date

Python provides the module  `datetime` to manipulate date and time and parse timestamps.

In [None]:
from datetime import datetime

The function `fromtimestamp` can be used to transform an integer representing a Unix timestamp into a datetime object. Here is a simple example:

In [None]:
datetime.fromtimestamp(1388552691)

`datetime` objects can be subtracted to get a `timedelta`.

In [None]:
delta = datetime.fromtimestamp(1388552691) - datetime.fromtimestamp(1388552000)
delta

`timedelta` objects have a seconds attribute to retrieve the value of the delta in seconds.

In [None]:
delta.seconds

**Transform the previous RDD so the new one will have the following structure:**
- key: (job id, account id)
- value: a dictionary with one entry
    - key: event name (i.e: JOBSTART or JOBEND)
    - value: timestamp as `datetime`
    
Fill the missing part in the function `convert`.

In [None]:
def convert(rec):
    job_id = <FILL_IN>
    event_name = <FILL_IN>
    timestamp = <FILL_IN>
    return job_id, {event_name : timestamp}

timestamps = <FILL_IN>

### Aggregating JOBSTARTs and JOBENDs

For the same job id, it is possible that multiple events with the same names were logged. We will want next to only keep one `JOBSTART` and one `JOBEND` for each job id. To do so, we have written a small function that, given the previous RDD structure, can merge entries in such a way that we have the very first `JOBSTART` and the very last `JOBEND`.

The function is call `merge_jobtypes`.

In [None]:
def merge_jobtypes(a, b):
    c = {}
    if 'JOBSTART' in a or 'JOBSTART' in b:
        c['JOBSTART'] = min(a.get('JOBSTART', b.get('JOBSTART')), 
                            b.get('JOBSTART', a.get('JOBSTART')))
    if 'JOBEND' in a or 'JOBEND' in b:
        c['JOBEND'] = max(a.get('JOBEND', b.get('JOBEND')), 
                          b.get('JOBEND', a.get('JOBEND')))
    return c

**Create a new RDD by applying the correct transformation using `merge_jobtypes` as the transformation function.**

**Filter the jobs that do not have a JOBSTART AND a JOBEND**

**Compute wallclock time per job**

**Compute the total wallclock time per project**

**Build an RDD that will contain both the processed walltime and the total requested walltime per project.**

**Sort the RDD per total processed walltime in descending order.**

### Visualizing the Results

In [None]:
projects = project_con_pred.keys().collect()
times, limits = zip(*project_con_pred.values().collect())

data = [
    go.Bar(
        x=projects,
        y=times,
        name='processing time'
    ),
    go.Bar(
        x=projects,
        y=limits,
        name='predicted time'
    )    
]

layout = go.Layout(
    title="Walltime per project in 2014",
    xaxis=dict(
        title='Project ID',
        tickangle=-45
    ),
    yaxis=dict(
        title='Walltime',
    ),
#     barmode='stack'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

## 10. Ending the Analysis

Once the analysis is done, we need to tell Spark to free the resources and destroy the context using the `SparkContext`'s method `stop()`.

In [None]:
sc.stop()

## 11. Recap

In this notebook, we used and learned about the following parts of 
**[Python Spark API](http://spark.apache.org/docs/latest/api/python/)**:
1. Import Spark Python module: 
**[`import pyspark`](http://spark.apache.org/docs/latest/api/python/pyspark.html)**
2. Create a SparkContext:
**[`pyspark.SparkContext()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext)**
2. Create an RDD from text files:
**[`SparkContext.textFile(path)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.textFile)**
3. Take a the first *num* elements from an RDD: 
**[`Rdd.take(num)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.take)**
3. Count the number of elements in an RDD: 
**[`Rdd.count()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.count)**
4. Retrieve the first element of an RDD: 
**[`RDD.first()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.first)**
5. Apply a transformation on each element of an RDD:
**[`RDD.map(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.map)**
4. Cache an RDD:
**[`RDD.cache()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.cache)**
5. Remove an RDD from memory: 
**[`RDD.unpersist()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.unpersist)**
5. Filter an RDD:
**[`RDD.filter(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.filter)**
6. Merge the values for each keys: 
**[`RDD.reduceByKey(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.reduceByKey)**
7. Get all elements of an RDD: 
**[`RDD.collect()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.collect)**
8. Sort the elements of an RDD:
**[`RDD.sortBy(func)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy)**
9. Get the top $N$ elements from an RDD:
**[`RDD.top(N)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.top)**
9. Join two RDDs by key:
**[`RDD.join(RDD)`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.join)**
10. End the SparkContext:
**[`SparkContext.stop()`](http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.stop)**

## 11. References

* [Berkeley AmpCamp 5 - Data Exploration Using Spark](http://ampcamp.berkeley.edu/5/exercises/data-exploration-using-spark.html)
* [edX - Introduction to Big Data with Apache Spark](https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x)
* [edX - Introduction to Big Data with Apache Spark (Github repo)](https://github.com/spark-mooc/mooc-setup)