# <center>Big Data – Exercises</center>
## <center>Fall 2025 – Week 8 – ETH Zurich</center>
## <center>Notebook for the Spark RDD Moodle Quiz</center>

## Start docker

1. Drag this notebook (and everything else in the exercise08 folder) in the `notebooks` folder of your exam magic box

2. Start docker with ```docker-compose up -d```

3. Launch Jupiter from the docker container

In [None]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local[*]').getOrCreate()
sc = spark.sparkContext

## The Moodle Quiz

For this quiz we will be using the [language confusion dataset](https://lars.yencken.org/datasets/great-language-game).


For the Moodle quiz we ask you to submit the following things:
- The query you wrote (although it is not graded, having your query is helpful for arguing about points)
- Something related to its output (**the only part that is graded**)

You don't need to submit this notebook, only the queries you wrote.

You will need the same language game dataset used in the exercises. If you still do not have created the `confusion-part.json` file, follow these instruction.

1. Move to the `notebooks` folder in the terminal
2. Download the data: <br>
   ```wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2``` <br>
   __or__ <br>
   ```curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2```
3. Extract the data: <br>
   ```tar -jxvf confusion-2014-03-02.tbz2```
4. Change directory to ```confusion-2014-03-02```
5. Extract the part of the dataset that we will work with in this exercise: <br>
   ```head -n 3000000 confusion-2014-03-02.json > confusion-part.json```

Alternatively, you can download the preprocessed dataset from our [Polybox](https://polybox.ethz.ch/index.php/s/zgZY2dCPAXn6HA3).

## Preprocessing commands
In your newly created notebook run these commands in order to have the dataset into an RDD:

In [None]:
# watch out that your confusion-part.json is in the same folder as this notebook!
path = "confusion-part.json"
raw_data = sc.textFile(path)
dataset = raw_data.map(json.loads).cache()
print(dataset)

Since the entries are JSON records, you will need to parse them and use their respective object representations. You can use this mapping for all queries.

**For the quiz, fill in the results by running the queries on the ```confusion-part.json``` subset instead of the entire dataset.**

After that you will be able to run the queries of the moodle question of this week. The RDD that you have to perform your queries on is the ```dataset``` one. For example, the following command returns one element of the dataset:

In [None]:
dataset.take(1)

Good! Let's get to work. A few last things:
- Take into account that some of the queries might have very large outputs, which Jupyter (or sometimes even Spark) won't be able to handle. It is normal for the queries to take some time, but if the notebook crashes or stops responding, try restarting the kernel. Avoid printing large outputs. You can print the first few entries to confirm the query has worked, as shown in the query above.
- Refer to the [documentation](https://spark.apache.org/docs/latest/rdd-programming-guide.html) whenever in doubt.

And now to the actual queries: *Please make sure that in your queries you *only* use PySpark RDDs, and avoid any dataframes (they will covered in next week's exercises)*

## Assignment 1
Count the games where the guess is incorrect but the correct language (target) appears within the first two choices.

## Assignment 2
Return the five most common confusion pairs (target, guess) where the guess was wrong, ordered by frequency (desc).

**Hint**: You may want to have a look at the documentation for [sortByKey](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.sortByKey.html).

## Assignment 3
Find the three countries with the highest accuracy (percentage of correct guesses), considering only countries with at least 100 games.

## Assignment 4
Return the date with the most games played and the corresponding count (dates like YYYY-MM-DD).

## Assignment 5
Return the number of distinct sample IDs for games played in 2013 (dates like YYYY-MM-DD).

## Assignment 6
Among languages with accuracy $\ge$ $0.9$, return the one with the largest number of games (along with its total count and accuracy).