# <center>Big Data for Engineers – Exercises</center>
## <center>Spring 2022 – Week 8 – ETH Zurich</center>
## <center>Spark </center>

## 1. Start docker

In your exercise 08 directory, start docker

```
docker compose up
```

After docker finishes downloading the images, you should be able to start the jupyter notebook by copying the following URL to your browser

```
http://127.0.0.1:8888/
```

## 2. Apache Spark Architecuture

Spark is a cluster computing platform designed to be fast and general purpose. Spark extends the MapReduce model to efficiently cover a wide range of workloads that previously required separate distributed systems, including interactive queries and stream processing. Spark offers the ability to run computations in memory.

At a high level, every Spark application consists of a **driver program** that launches various parallel operations on a cluster. The driver program contains your application's main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a **SparkContext** object, which represents a connection to a computing cluster. There is no need to create a SparkContext; it is created for you automatically when you run the first code cell in the Jupyter

The driver communicates with a potentially large number of distributed workers called **executors**. The driver runs in its own process and each executor is a separate process. A driver and its executors are together termed a Spark application.

![Image of Account](http://spark.apache.org/docs/latest/img/cluster-overview.png)

### 2.1 Understand resilient distributed datasets (RDD)

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. 

##### What are RDD operations?
RDDs offer two types of operations: **transformations** and **actions**.

* **Transformations** create a new dataset from an existing one. Transformations are lazy, meaning that no transformation is executed until you execute an action.
* **Actions** compute a result based on an RDD, and either return it to the driver program or save it to an external storage system (e.g., HDFS)




Transformations and actions are different because of the way Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a **lazy** fashion, that is, the first time they are used in an action.

##### How do I make an RDD?

RDDs can be created from stable storage or by transforming other RDDs. In this exercise, we will run the cells below to create RDDs from local files. Generally it is possible to read data from other resources using the following tokens:

* `file`: Read from the local file system.
* `hdfs`: Read from a Hadoop Distributed File System.
* `s3`  : Read from AWS S3 Storage.
* `wasb`: Read from Azure Blob Storage.

In [None]:
from pyspark.context import SparkContext
# sc is the Spark Context object 
sc = SparkContext('local', 'test')

In [None]:
# sc is the Spark Context object automatically created for you
fruits = sc.textFile('./fruits.txt')
yellowThings = sc.textFile('./yellowthings.txt')

##### RDD transformations
Following are examples of some of the common transformations available. For a detailed list, see [RDD Transformations](https://spark.apache.org/docs/2.0.0/programming-guide.html#transformations)

Run some transformations below to understand this better.

**Note:** If some of the queries are taking too long to complete, try restarting the kernel, and rerunning the cell *above*.

In [None]:
# map
fruitsReversed = fruits.map(lambda fruit: fruit[::-1])
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
fruitsReversed.collect()

In [None]:
# filter
shortFruits = fruits.filter(lambda fruit: len(fruit) <= 5)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
shortFruits.collect()

In [None]:
# flatMap
characters = fruits.flatMap(lambda fruit: list(fruit))
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
characters.collect()

In [None]:
# union
fruitsAndYellowThings = fruits.union(yellowThings)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
fruitsAndYellowThings.collect()

In [None]:
# intersection
yellowFruits = fruits.intersection(yellowThings)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
yellowFruits.collect()

In [None]:
# distinct
distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
distinctFruitsAndYellowThings.collect()

In [None]:
# groupByKey
yellowThingsByFirstLetter = yellowThings.map(lambda thing: (thing[0], thing)).groupByKey()
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
for letter, lst in yellowThingsByFirstLetter.collect():
        print("For letter", letter)
        for obj in lst:
            print(" > ", obj)

In [None]:
# reduceByKey
numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
numFruitsByLength.collect()

##### RDD actions
Following are examples of some of the common actions available. For a detailed list, see [RDD Actions](https://spark.apache.org/docs/2.3.0/programming-guide.html#actions).

Run some transformations below to understand this better. 

In [None]:
# collect
fruitsArray = fruits.collect()
yellowThingsArray = yellowThings.collect()
print(fruitsArray)
print(yellowThingsArray)

In [None]:
# count
numFruits = fruits.count()
numFruits

In [None]:
# take
first3Fruits = fruits.take(3)
first3Fruits

In [None]:
# reduce
letterSet = fruits.map(lambda fruit: set(fruit)).reduce(lambda x, y: x.union(y))
letterSet

##### Lazy evaluation
Lazy evaluation means that when we call a transformation on an RDD (for instance, calling `map()`), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as
consisting of instructions on how to compute the data that we build up through transformations. Loading data into an RDD is lazily evaluated in the same way transformations are. So, when we call `sc.textFile()`, the data is not loaded until it is necessary. As with transformations, the operation (in this case, reading the data) can
occur multiple times.


Finally, as you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies between different RDDs, called the lineage graph. For instance, the code bellow corresponds to the following graph:

<img src="resources/stages.png" style="width: 300px;">

In [None]:
apples = fruits.filter(lambda x: "apple" in x)
lemons = yellowThings.filter(lambda x: "lemon" in x)
applesAndLemons = apples.union(lemons)
print(applesAndLemons.collect())
print(applesAndLemons.toDebugString())

##### Persistence (Caching)

Spark's RDDs are by default recomputed each time you run an action on
them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using `RDD.persist()`. After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible.

If you attempt to cache too much data to fit in memory, Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-only storage levels, it will recompute these partitions the next time they are accessed,
while for the memory-and-disk ones, it will write them out to disk. In either case, this means that you don't have to worry about your job breaking if you ask Spark to cache too much data. However, caching unnecessary data can lead to eviction of useful data
and more recomputation time. Finally, RDDs come with a method called `unpersist()` that lets you manually remove them from the cache.


##### Working with Key/Value Pairs


Spark provides special operations on RDDs containing key/value pairs. These RDDs
are called *pair RDDs*. Pair RDDs are a useful building block in many programs, as
they expose operations that allow you to act on each key in parallel or regroup data
across the network. For example, pair RDDs have a `reduceByKey()` method that can
aggregate data separately for each key, and a `join()` method that can merge two
RDDs together by grouping elements with the same key. Pair RDDs are also still RDDs. 



In [None]:
#Example
rdd = sc.parallelize([("key1", 0) ,("key2", 3),("key1", 8) ,("key3", 3),("key3", 9)])
rdd2 = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(rdd2.collect())
print(rdd2.toDebugString())

#### Converting a user program into tasks

A Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure: they create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical **directed acyclic graph (DAG)** of operations.
When the driver runs, it converts this logical graph into a physical execution plan.

Spark performs several optimizations, such as "pipelining" map transformations together to merge them, and converts the execution graph into a set of **stages**.
Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.

Each RDD maintains a pointer to one or more parents along with metadata about what
type of relationship they have. For instance, when you call `val b = a.map()` on an
RDD, the RDD `b` keeps a reference to its parent `a`. These pointers allow an RDD to be
traced to all of its ancestors.

The following phases occur during Spark execution:
* User code defines a DAG (directed acyclic graph) of RDDs. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph.
* Actions force translation of the DAG to an execution plan. When you call an action on an RDD, it must be computed. This requires computing its parent RDDs as well. 
* Spark's scheduler submits a job to compute all needed RDDs. That job will have one or more stages, which are parallel waves of computation composed of tasks. Each stage will correspond to one or more RDDs in the DAG. A single stage can correspond to multiple RDDs due to pipelining.
* Tasks are scheduled and executed on a cluster
* Stages are processed in order, with individual tasks launching to compute segments of the RDD. Once the final stage is finished in a job, the action is complete.

### 3. The Great Language Game

Now, you will get to write some queries yourself on a larger dataset. You will be using the [language confusion dataset](https://quietlyamused.org/blog/2014/03/12/language-confusion/).

This exercise is a little bit different, in that it is part of a small project you will be doing over the following 3 weeks to compare Spark, Spark with DataFrames/SQL, and Sparksoniq. You will hear more about it in the coming weeks.

Apart from that, you will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need four things:
- The query you wrote
- Something related to its output (which you will be graded on)
- The time it took you to write it
- The time it took you to run it

On your own laptop, download and decompress the dataset into the ex08 folder using the commands below. You can also copy the URL to your browser to download it, then decompress it using the default decompression tools Windows/Mac. Alternatively, you can also run the commands in jupyter notebook, but it takes several minutes to decompress it in the docker container.

```
wget https://cloud.inf.ethz.ch/s/a8FoHew6dHKGYKK/download/confusion20140302.tbz2
tar -jxvf confusion20140302.tbz2
```

In [None]:
data = sc.textFile('./confusion-2014-03-02/confusion-2014-03-02.json')

Since the entries are JSON records, you will need to parse them and use their respective object representations. You can use this mapping for all queries. Since some of the queries take a long time to execute on the dataset, you may want to answer these queries on the first `100000` entries. 

**For the quiz, fill in the results by running the queries on the 100000-entry subset (test_entries as defined in the following cell) instead of the entire dataset.**

In [None]:
import json

testset = sc.parallelize(data.take(100000))
test_entries = testset.map(json.loads)

# entries = data.map(json.loads)
print(testset) 
print(test_entries)
# print(entries)

And test it. Is it working? You probably have noticed that we are just declaring RDDs without evaluating them. Now let's evaluate it using the 'take' action and look at the json objects.

In [None]:
target_german = test_entries.filter(lambda e: e["target"] == "German").take(1)
print(json.dumps(target_german, indent = 4))

Good! Let's get to work. A few last things:
- Take into account that some of the queries might have very large outputs, which Jupyter (or sometimes even Spark) won't be able to handle. It is normal for the queries to take some time, but if the notebook crashes or stops responding, try restarting the kernel. Avoid printing large outputs. You can print the first few entries to confirm the query has worked, as shown in query 1.
- Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and your resources will still be there.
- Refer to the [documentation](http://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.RDD), as well as the programming guides on actions and transformations linked to above.

And now to the actual queries: *Please make sure that in your queries you *only* use PySpark, and avoid any dataframes (they will covered in next week's exercises)*

1\. Find all games such that the guessed language is correct (=target), and such that this language is Russian. What is the length of the resulting sequence?

In [None]:
import time

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

2\. Find the number of all distinct values of the *target* languages (i.e. the *target* field). What is the length of the resulting sequence?

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

3\. Return the top three games where the guessed language is correct (=target) ordered by language (ascending), then country (ascending), then date (ascending). What is the date of the 3rd item in the list? Enter it without quotes, for example 2013-10-02 

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

4\. Aggregate all games by country and target language, counting the number of guessing games that were done for each pair (country, target). How many guesses have been made for Maltese from the Netherlands (NL, Maltese)?

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

5\. Among all the games where the guess was correct (=target), what is the percentage of cases where the first choice (among the array of possible answers) was the target?

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

6\. For each target language, compute the percentage of successful guess games (i.e. *guess* == *target*) relative to all games for that target language, and display the pairs `(target_language, percentage)` in ascending order of the percentage. What is the second language in this list? 

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

7\. How many games were played on the last day? 

In [None]:
start = time.time()
# Query:

end = time.time()
print('Time consumption {} sec'.format(end - start))

### 4. Exercise

1. Why is Spark faster than Hadoop MapReduce?
2. Which of the graphs below are DAGs?
<img src="resources/dags.png" style="width: 700px;">

### 5. True or False
Say if the following statements are *true* or *false*, and explain why.

1. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.
1. Transformations construct a new RDD from a previous one and immediately calculate the result
1. Spark's RDDs are by default recomputed each time you run an action on them
1. After computing an RDD, Spark will store its contents in memory and reuse them in future actions.
1. When you derive new RDDs using transformations, Spark keeps track of the set of dependencies between different RDDs.