# <center>Big Data &ndash; Exercises </center>
## <center>Fall 2025 &ndash; Week 8 &ndash; ETH Zurich</center>
## <center>YARN + Spark</center>

## Overview

In this exercise you will analyze YARN and core Spark concepts (architecture, RDDs, evaluation, caching, partitioning). The following is a brief summary of the exercises and their corresponding topics:

[0 – Preparation for the exercise](#0)

[1 – What is YARN?](#1)

[2 – Apache Spark Architecture](#2)

[3.1 – Understand resilient distributed datasets (RDD)](#31)

[3.2 – RDD transformations](#32)

[3.3 – RDD actions](#33)

[4 – Lazy evaluation](#4)

[5 – Persistence (Caching)](#5)

[6 – Working with Key/Value Pairs](#6)

[7 – Spark Partitioning](#7)

[8 – Converting a user program into tasks](#8)

[9 – Theory Exercises](#9)

[10 – The Great Language Game](#10)

[11 – TF-IDF in Spark (OPTIONAL)](#11)


<a id='0'></a>
## 0. Preparation for the exercise

1. Drag this notebook and everything else in the exercise08 folder in the `notebooks` folder of your exam magic box

2. Start docker with ```docker-compose up -d```

3. Launch Jupiter from the docker container

Note that you can also copy the data to your docker container using the CLI (<b>may need to change paths or container name</b>):<br>

`docker cp fruits.txt exam-magic-box-v10-jupyter-1:/home/week8/` <br>
`docker cp yellowthings.txt exam-magic-box-v10-jupyter-1:/home/week8/`

Access jupyter notebook at http://localhost:8888/lab/tree/work/Exercise08_Spark.ipynb

<a id='1'></a>
## 1 &ndash; What is YARN?

Fundamentally, “**Y**et **A**nother **R**esource **N**egotiator”. **YARN**  is a resource scheduler designed to work on existing and new Hadoop clusters. 

YARN supports pluggable schedulers. The task of the scheduler is to share the resources of a large cluster among different tenants (applications) while trying to meet application demands (memory, CPU). A user may have several applications active at a time. 

### Exercise 1.1 &ndash; List at least 3 main shortcomings of MapReduce v1 that are addressed by YARN.

### Exercise 1.2 &ndash; State which of the following statements are true:

1. The ResourceManager has to provide fault tolerance for resources across the cluster 

1. Container allocation/deallocation can take place in a dynamic fashion as the application progresses. 

1. YARN plans to allow applications to only request resources in terms of memory usage and number of CPUs.

1. Communications between the ResourceManager and NodeManagers are heartbeat-based. 

1. The ResourceManager does not have a global view of all usage of cluster resources. Therefore, it tries to make better scheduling decisions based on probabilistic prediction. 

1. ResourceManager has the ability to request resources back from a running application.

### Exercise 1.3 &ndash; Whose responsibility is it? Say which component of YARN is resposible for each of the following tasks.

1. Fault Tolerance of running applications *[ResourceManager | ApplicationMaster | NodeManager ]*
1. Asking for resources needed for an application *[ResourceManager | ApplicationMaster | NodeManager ]*

1. Providing leases to use containers *[ResourceManager | ApplicationMaster | NodeManager]*

1. Tracking status and progress of running applications *[ResourceManager | ApplicationMaster | NodeManager]*

### Exercise 1.4 &ndash; What is the typical configuration for YARN? Choose for the following components how many instances of them there are in a cluster.

```
1. ResourceManager                  a. One per cluster

2. ApplicationMaster                b. One per node

3. NodeManager                      c. Many per cluster, but usually not per node

4. Container                        d. Many per node 
```

<a id='2'></a>
## 2 &ndash; Apache Spark Architecture

Spark is a cluster computing platform designed to be fast and general purpose. Spark extends the MapReduce model to efficiently cover a wide range of workloads that previously required separate distributed systems, including interactive queries and stream processing. Spark offers the ability to run computations in memory.

At a high level, every Spark application consists of a **driver program** that launches various parallel operations on a cluster. The driver program contains your application's main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a **SparkContext** object, which represents a connection to a computing cluster. There is no need to create a SparkContext; it is created for you automatically when you run the first code cell in the Jupyter

The driver communicates with a potentially large number of distributed workers called **executors**. The driver runs in its own process and each executor is a separate process. A driver and its executors are together termed a Spark application.

![Image of Account](http://spark.apache.org/docs/latest/img/cluster-overview.png)

<a id='31'></a>
## 3.1 &ndash; Understand resilient distributed datasets (RDD)

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. 

##### What are RDD operations?
RDDs offer two types of operations: **transformations** and **actions**.

* **Transformations** create a new dataset (still RDD) from an existing one. Transformations are lazy, meaning that no transformation is executed until you execute an action.
* **Actions** compute a result based on an RDD, and either return it to the driver program or save it to an external storage system (e.g., HDFS). Actions trigger the execution of all requested transformations.


Transformations and actions are different because of the way Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a **lazy** fashion, that is, the first time they are used in an action. We are going to talk more about this in a bit.

##### Create Spark session (Spark install within jupyter docker image) and context

In [1]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local[*]').getOrCreate()
sc = spark.sparkContext

#### How do I make an RDD?

RDDs can be created from stable storage or by transforming other RDDs. In this exercise, we will run the cells below to create RDDs from local files. Generally it is possible to read data from other resources using the following tokens:

* `file`: Read from the local file system.
* `hdfs`: Read from a Hadoop Distributed File System.
* `s3`  : Read from AWS S3 Storage.
* `wasb`: Read from Azure Blob Storage.

In [2]:
# sc is the Spark Context object automatically created for you
fruits = sc.textFile('fruits.txt')
yellowThings = sc.textFile('yellowthings.txt')

<a id='32'></a>
## 3.2 &ndash; RDD transformations
Following are examples of some of the common transformations available. For a detailed list, see [RDD Transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations).

Run some transformations below to understand this better.

**Note1:** the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! \
**Note2:** If some of the queries are taking too long to complete, try restarting the kernel, and rerunning the cell *above*.

In [5]:
# map
fruitsReversed = fruits.map(lambda fruit: fruit[::-1])
fruitsReversed.collect()

In [6]:
# filter
shortFruits = fruits.filter(lambda fruit: len(fruit) <= 5)
shortFruits.collect()

In [7]:
# flatMap
characters = fruits.flatMap(lambda fruit: list(fruit))
characters.collect()

In [9]:
# union between fruits and yellowthings datasets
fruitsAndYellowThings = fruits.union(yellowThings)
fruitsAndYellowThings.collect()

In [10]:
# intersection between fruits and yellowthings datasets
yellowFruits = fruits.intersection(yellowThings)
yellowFruits.collect()

In [11]:
# distinct elements in the two datasets
distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
distinctFruitsAndYellowThings.collect()

In [12]:
# groupByKey
yellowThingsByFirstLetter = yellowThings.map(lambda thing: (thing[0], thing)).groupByKey()
for letter, lst in yellowThingsByFirstLetter.collect():
  print("For letter", letter)
  for obj in lst:
      print(" > ", obj)

In [13]:
# reduceByKey; key is the number of characters of the fruit name (len(fruit))
numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
numFruitsByLength.collect()

<a id='33'></a>
## 3.3 &ndash; RDD actions
Following are examples of some of the common actions available. For a detailed list, see [RDD Actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions).

Run some transformations below to understand this better. 

In [14]:
# collect
fruitsArray = fruits.collect()
yellowThingsArray = yellowThings.collect()
print(fruitsArray)
print(yellowThingsArray)

In [15]:
# count - how many fruits are
numFruits = fruits.count()
numFruits

In [16]:
# take - show the first three fruits
first3Fruits = fruits.take(3)
first3Fruits

In [17]:
# reduce - what letters are used
letterSet = fruits.map(lambda fruit: set(fruit)).reduce(lambda x, y: x.union(y))
# Note1: the set() operation converts each string into a set of unique characters.
# Note2: The union used here is a set union, meaning it combines the elements of two sets and removes duplicates automatically.
letterSet

<a id='4'></a>
## 4 &ndash; Lazy evaluation
Lazy evaluation means that when we call a transformation on an RDD (for instance, calling `map()`), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as
consisting of instructions on how to compute the data that we build up through transformations. Loading data into an RDD is lazily evaluated in the same way transformations are. So, when we call `sc.textFile()`, the data is not loaded until it is necessary. As with transformations, the operation (in this case, reading the data) can
occur multiple times.


Finally, as you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies between different RDDs, called the lineage graph. For instance, the code bellow corresponds to the following graph:

<img src="https://polybox.ethz.ch/index.php/s/TH3rEYtgRDByKqE/download/Stage5.png" width="400">

**Remember that in order to obtain data from the node on the lineage graph (our RDD of interest) you have to call an action!**

In [18]:
apples = fruits.filter(lambda x: "apple" in x)
lemons = yellowThings.filter(lambda x: "lemon" in x)
applesAndLemons = apples.union(lemons)
print(applesAndLemons.collect())
print(applesAndLemons.toDebugString().decode("utf-8")) # decode used for nice formatting

### 4.1 Exercise

1. What does the code below do?
1. Draw the linage graph for the code
1. List actions and transformations used in it
1. When are all computations executed?
1. If we call `result.collect()` again, what will Spark do to perform the action? 

In [21]:
text = sc.textFile('fruits.txt')
words = text.flatMap(lambda x: x.split(" "))
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
result.collect()

In [20]:
print(result.toDebugString().decode("utf-8")) # decode used for nice formatting

<a id='5'></a>
## 5 &ndash; Persistence (Caching)

Spark's RDDs are by default recomputed each time you run an action on
them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using `RDD.persist()`. After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible.

If you attempt to cache too much data to fit in memory, Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-only storage levels, it will recompute these partitions the next time they are accessed,
while for the memory-and-disk ones, it will write them out to disk. In either case, this means that you don't have to worry about your job breaking if you ask Spark to cache too much data. However, caching unnecessary data can lead to eviction of useful data
and more recomputation time. Finally, RDDs come with a method called `unpersist()` that lets you manually remove them from the cache.

Please note that both `persist()` and `cache()` (which is a simple wrapper that calls `persist(storageLevel=StorageLevel.MEMORY_ONLY)` - see [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.cache.html) for details -) are lazy operations themselves. The caching operation will, in fact, only take place when the first action is called. With successive action calls, the cached RDD will be used.  

To give a simple example to motivate the use of caching, let us consider the following case. Assume we have some a sample of points in our RDD and we want to compute different statistics - mean, minimum and maximum values. However, we also want to first prepare our data by doing some heavy preprocessing. If we do not use caching, we would have to recompute the preprocessing stage for each statistic, while by caching the preprocessed RDD will make us preprocess our data only once.

### 5.1 Exercise:
1. Write some code which can benefit from caching.
1. Where should we ask Spark to persist the RDD in Exercise 3.2 to prevent it from re-executing the code when we call `collect()` again?

#### Solution

In [None]:
# Solution 1
fruits = sc.textFile('fruits.txt')

In [None]:
# Solution 2
...

<a id='6'></a>
## 6 &ndash; Working with Key/Value Pairs


Spark provides special operations on RDDs containing key/value pairs. These RDDs
are called *pair RDDs*. Pair RDDs are a useful building block in many programs, as
they expose operations that allow you to act on each key in parallel or regroup data
across the network. For example, pair RDDs have a `reduceByKey()` method that can
aggregate data separately for each key, and a `join()` method that can merge two
RDDs together by grouping elements with the same key. Pair RDDs are also still RDDs. 

In [None]:
# Example
rdd = sc.parallelize([("key1", 0) ,("key2", 3),("key1", 8) ,("key3", 3),("key3", 9)])
rdd2 = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(rdd2.collect())
print(rdd2.toDebugString().decode("utf-8"))

### 6.1 Exercise
1. What does the code above do? 

Now consider the following RDD `sales_rdd` defined in the cell below.

In [31]:
sales_rdd = sc.parallelize([
    ("productA", 5),
    ("productB", 3),
    ("productA", 2),
    ("productC", 7),
    ("productB", 1),
    ("productC", 4),
    ("productC", 2),
    ("productD", 12),
    ("productA", 1),
    ("productB", 8),
    ("productB", 9)
])


2. Write a Spark program to compute the total quantity sold for each product.
3. Extend your program to also calculate the average quantity per transaction for each product, and return an RDD with the following format `[("productA", (total_quantity, average_quantity)), ...]`

<a id='7'></a>
## 7 &ndash; Spark Partitioning
Spark programs can choose to control their RDDs' partitioning
to reduce communication. Partitioning will not be helpful in all applications, for
example, if a given RDD is scanned only once, there is no point in partitioning it in
advance. It is useful only when a dataset is reused multiple times in key-oriented
operations such as joins.

Spark's partitioning is available on all RDDs of key/value pairs, and causes the system
to group elements based on a function of each key. Although Spark does not give
explicit control of which worker node each key goes to (partly because the system is
designed to work even if specific nodes fail), it lets the program ensure that a set of
keys will appear together on some node.


Many of Spark's operations involve shuffling data by key across the network. All of
these will benefit from partitioning. Examples of operations that benefit from
partitioning are `cogroup()`, `groupWith()`, `join()`, `leftOuterJoin()`, `rightOuterJoin()`, `groupByKey()`, `reduceByKey()`, `combineByKey()`, and `lookup()`.

By default PySpark uses hash partitioning as the partitioning function. A way to define a custom partition is by using the function `partitionBy()`. To use `partitionBy()` the RDD must consist of tuple objects. This function is a transformation, therefore a new RDD will be returned. In the following example we are going to see a default partitioning scheme of Spark as well as a custom partitioning.

Partitioning allows some Spark code to run more efficiently, in particular running 'pair' operations on pair RDD (eg. mapValues, reduceByKey) is guaranteed to produce no shuffling in the cluster and also preserve the partitions.

In [None]:
nums = [(1, 1), (2, 2), (3, 3)]

In [None]:
pairs = sc.parallelize(nums)

In [None]:
print("Number of partitions: {}".format(pairs.getNumPartitions()))
print("Partitions structure: {}".format(pairs.glom().collect()))

Let's try to define a custom partitioning now.

In [None]:
pairs = sc.parallelize(nums).partitionBy(2)

In [None]:
print("Number of partitions: {}".format(pairs.getNumPartitions()))
print("Partitions structure: {}".format(pairs.glom().collect()))

<a id='8'></a>
## 8 &ndash; Converting a user program into tasks

A Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure: they create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical **directed acyclic graph (DAG)** of operations.
When the driver runs, it converts this logical graph into a physical execution plan.

Spark performs several optimizations, such as "pipelining" map transformations together to merge them, and converts the execution graph into a set of **stages**.
Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.

Each RDD maintains a pointer to one or more parents along with metadata about what
type of relationship they have. For instance, when you call `val b = a.map()` on an
RDD, the RDD `b` keeps a reference to its parent `a`. These pointers allow an RDD to be
traced to all of its ancestors.

The following phases occur during Spark execution:
* User code defines a DAG (directed acyclic graph) of RDDs. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph.
* Actions force translation of the DAG to an execution plan. When you call an action on an RDD, it must be computed. This requires computing its parent RDDs as well. 
* Spark's scheduler submits a job to compute all needed RDDs. That job will have one or more stages, which are parallel waves of computation composed of tasks. Each stage will correspond to one or more RDDs in the DAG. A single stage can correspond to multiple RDDs due to pipelining.
* Tasks are scheduled and executed on a cluster
* Stages are processed in order, with individual tasks launching to compute segments of the RDD. Once the final stage is finished in a job, the action is complete.

If you visit the application's web UI, you will see how many stages occur in order to
fulfill an action. The Spark UI can be accessed via Azure Portal, see [Spark job debugging](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-job-debugging)

<a id='9'></a>
## 9 &ndash; Theory Exercises 

1. Why is Spark faster than Hadoop MapReduce?
1. What are scenarios in which one would still prefer Hadoop MapReduce to Spark?
1. Study the examples above via Spark UI. Observe how many stages they have. 
1. Which of the graphs below are DAGs?

<img src="https://polybox.ethz.ch/index.php/s/P4GMvUmQDLjFTRK/download" style="width: 700px;">

### 9.1 True or False
Say if the following statements are *true* or *false*, and explain why.

1. Each RDD is split into multiple partitions, which may be processed on different nodes of the cluster.
1. Transformations construct a new RDD from a previous one and immediately calculate the result
1. Spark's RDDs are by default recomputed each time you run an action on them
1. After computing an RDD, Spark will store its contents in memory and reuse them in future actions.
1. When you derive new RDDs using transformations, Spark keeps track of the set of dependencies between different RDDs.

<a id='10'></a>
## 10 &ndash; The Great Language Game

We will now explore the dataset that is going to be used in the graded exercise of this week. Follow the instructions below to build ```confusion-part.json```.

1. Move to the `notebooks` folder in the terminal
2. Download the data: <br>
   ```wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2``` <br>
   __or__ <br>
   ```curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2```
3. Extract the data: <br>
   ```tar -jxvf confusion-2014-03-02.tbz2```
4. Change directory to ```confusion-2014-03-02```
5. Extract the part of the dataset that we will work with in this exercise: <br>
   ```head -n 3000000 confusion-2014-03-02.json > confusion-part.json```

Alternatively, you can download the preprocessed dataset from our [Polybox](https://polybox.ethz.ch/index.php/s/zgZY2dCPAXn6HA3).

Next, load the data.

In [None]:
path = "./confusion-part.json"
dataset = sc.textFile(path)

Since the entries are JSON records, you will need to parse them and use their respective object representations. You can use this mapping for all queries. 

**For the rest of the exercise, fill in the results by running the queries on the created subset instead of the entire dataset.**

In [None]:
import json

entries = dataset.map(json.loads).cache()
print(entries)

And test it. Is it working? You probably have noticed that we are just declaring RDDs without evaluating them. Now let's evaluate it using the 'take' action and look at the json objects. We are goint to look at one of the entries for which the target language is German.

In [None]:
target_german = entries.filter(lambda e: e["target"] == "German").take(1)
# We use json.dumps() to turn the entry into a nicely formatted string
print(json.dumps(target_german, indent = 4))

Good! Let's get to work. A few last things:
- Take into account that some of the queries might have very large outputs, which Jupyter (or sometimes even Spark) won't be able to handle. It is normal for the queries to take some time, but if the notebook crashes or stops responding, try restarting the kernel. Avoid printing large outputs. You can print the first few entries to confirm the query has worked, as shown in query 1 with ```take(n)```.
- Refer to the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html), as well as the programming guides on actions and transformations linked to above.

And now to the actual queries: *Please make sure that in your queries you *only* use the Spark RDD API, and avoid any dataframes (they will covered in next week's exercises)*

### 10.1 How many games have been played by germans (country = "DE")?

### 10.2 Find the number of games such that the target language is German and such that the country is Germany (DE). 

### 10.3 Find the percentage of games where Germans got to hear their native language. Round your answer to an integer. For example: 15, NOT 0.146, 14.6, 14, 14%, 15.1

### 10.4 Find games where Swiss players (country = CH) got to hear one of the official Swiss languages (target is in `official_swiss_languages`). Group the number of games by language and return the language with the highest number of games.

**Hint**: use `countByKey`.

In [None]:
official_swiss_languages = ["German", "French", "Italian", "Romansh"]

### 10.5 For each country, compute the percentage of games with more than 3 choices (i.e. *len(choices)* > *3*) relative to all games for that country, and display the pairs `(country, percentage)` in the descending order of the percentage. What is the second country in this list? 

**Hint**: use `groupByKey` to aggregate per country and `mapValues` to compute the percentage.

<a id='11'></a>
## 11 &ndash; TF-IDF in Spark (OPTIONAL)
In this exercise you will implement a simple query engine over the Gutenberg dataset using Spark.
The [Gutenberg dataset](https://www.gutenberg.org/) consists of 3036 free ebooks. The goal of this exercise is to develop a search engine to find the most relevant books given a text query.

### 11.1 Get the data
1. You can download the dataset (the smallest one) from: https://zenodo.org/record/3360392

2. Unzip and put all .txt files under a directory gutenberg

3. Copy the directory over your docker container

### 10.2 Understand TF-IDF

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a statistic to determine the relative importance of the words in a set of documents. Is is computed as the product of two statistics, term frequency (`tf`) and inverse document frequency (`idf`). 

Given a word `t`, a document `d` (in this case, a book) and the collection of all documents `D` we can define `tf(t, d)` as the number of times `t` appears in `d`. This gives us some information about the content of a document but because some terms (eg. "the") are so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms.

The inverse document frequency `idf(t, D)` is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. It can be computed as:

<img src="https://polybox.ethz.ch/index.php/s/3G8rL9Ni6f6G3Z3/download" width="300">

where $|D|$ is the total number of documents and the denominator represents how many documents contain the word $t$ at least once. However, this would cause a division-by-zero exception if the user query a word that never appear in the dataset. A better formulation would be:

<img src="https://polybox.ethz.ch/index.php/s/EsAMYdFi6899szA/download" width="300">

Then, the `tdidf(t, d, D)` is calculated as follows:

<img src="https://polybox.ethz.ch/index.php/s/Xzqw5fyRg3RDQ9x/download" width="300">

A high weight in `tfidf` is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents.

Having already implemented TF-IDF last week in pseudocode, in this week we are going to implement it in Spark. The following code snippet imports the whole dataset into an RDD.

In [34]:
# sc is automatically defined as SparkContext
# docs will be an RDD in the format [(docName, content)]
docs = sc.wholeTextFiles("gutenberg/*.txt", minPartitions=100)

# number of documents in the folder
docs_number = docs.count()

# display the [(docName, content)] values
#docs.collect()

#### TF-IDF solution code