# <center>Big Data &ndash; Moodle Exercises </center>
## <center>Fall 2025 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and Spark SQL, Moodle exercise</center>

# Preparation for the moodle exercise in Spark

In this jupyter notebook we are going to make the preprocessing part of the dataset that is going to be used in the graded exercise of this week. It will be the same language game dataset as seen in the exercises. If you still do not have created the `confusion-part.json` file, follow these instruction. If you only have it in the `exercise08` folder please copy the `confusion-2014-03-02` folder in the `notebooks` folder of your exam magic box.

1. Move to the `notebooks` folder of the magicbox in the terminal
2. Download the data <br>
   ```bash
   wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   # or
   curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   ```
3. Extract the data <br>
   ```bash
   tar -jxvf confusion-2014-03-02.tbz2
   ```
4. Change directory to extracted folder <br> 
   ```bash
   cd confusion-2014-03-02/
   ```
5. Extract the part of the dataset that we will work with in this exercise <br>
   ```bash
   head -n 3000000 confusion-2014-03-02.json > confusion-part.json
   ```
## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in the `README.md` file inside the data bundle.

## Instructions:

In every query we ask you for three quantities: 
- the query itself,
- the result of the query,
- the productivity time - the development time of each query (time elapsed from before you start writing the query, until the time at which the correct, final query is ready).

Note that the time part of every question is optional and not graded. In order to make easier the time recording we created two functions that do it automatically. Run the cell below in order to import the functions into the current notebook. Then before each query we will have a ```start_exercise()``` cell that you have to run in order to start time recording. After you have finished your query and you are sure about the answer run the ```finish_exercise()``` one to get the time measurement. 

In [None]:
import time

def start_exercise():
    global last
    last = time.time()
    
def finish_exercise():
    global last
    print("This exercise took {0}s".format(int(time.time()-last)))

## <center>Spark Dataframes and SparkSQL</center>

Write queries for the same dataset as last week, but this time using Spark Dataframes or SparkSQL operations (the data loading will take a couple minutes)

### Data preprocessing

In [None]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local[*]').getOrCreate()
sc = spark.sparkContext

path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()

In [None]:
#test it out
dataset.limit(3).show()

Let's set up the SparkSQL magic extension.

In [None]:
# !pip install sparksql-magic --quiet
%load_ext sparksql_magic
dataset.createOrReplaceTempView("dataset")

In [None]:
%%sparksql
-- test it out
SELECT *
FROM dataset
LIMIT 3

## Assignment 1

Which language was most often wrongly guessed instead of German?

In [None]:
start_exercise()

In [None]:
finish_exercise()

## Assignment 2

Find the language with the third-highest guessed percentage among all countries whose country codes contain the letter ‘S’.

In [None]:
start_exercise()

In [None]:
finish_exercise()

## Assignment 3

In which month and year were the most games played?

In [None]:
start_exercise()

In [None]:
finish_exercise()

## Assignment 4

Focus on the first playing day of August 2013. Which language had the worst accuracy?

(In case of a tie, return the language with more games played.)


In [None]:
start_exercise()

In [None]:
finish_exercise()

## Assignment 5

Return number of games where the guessed language is correct and it appeared last in the choices list.

Hint: Check the docs of function [`element_at`](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/api/pyspark.sql.functions.element_at.html).

In [None]:
start_exercise()

In [None]:
finish_exercise()

## Assignment 6

Return the language choice that was offered in the most games. How many times did it appear?

In [None]:
start_exercise()

In [None]:
finish_exercise()