# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2024 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and Spark SQL, Moodle exercise</center>

# Preparation for the moodle exercise in Spark

In this jupyter notebook we are going to make the preprocessing part of the dataset that is going to be used in the graded exercise of this week. It will be the same language game dataset as seen in the exercises. If you still do not have created the `confusion-part.json` file, follow these instruction. If you only have it in the `exercise08` folder please copy the `confusion-2014-03-02` folder in the `notebooks` folder of your exam magic box.

1. Move to the `notebooks` folder in the terminal
2. Download the data: <br>
   ```wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2``` <br>
   __or__ <br>
   ```curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2```
3. Extract the data: <br>
   ```tar -jxvf confusion-2014-03-02.tbz2```
4. Change directory to ```confusion-2014-03-02```
5. Extract the part of the dataset that we will work with in this exercise: <br>
   ```head -n 3000000 confusion-2014-03-02.json > confusion-part.json```
## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in the `README.md` file inside the data bundle.

## Instructions:

In every query we ask you for three quantities: the query itself, the result of the query as well as the productivity time. That means the development time of each query (time elapsed from before you start writing the query, until the time at which the correct, final query is ready). Note that the time part of every question is optional and not graded. In order to make easier the time recording we created two functions that do it automatically. Run the cell below in order to import the functions into the current notebook. Then before each query we will have a ```start_exercise()``` cell that you have to run in order to start time recording. After you have finished your query and you are sure about the answer run the ```finish_exercise()``` one to get the time measurement. 

In [1]:
import time

def start_exercise():
    global last
    last = time.time()
    
def finish_exercise():
    global last
    print("This exercise took {0}s".format(int(time.time()-last)))

## <center>1. Spark Dataframes</center>

Write queries for the same dataset as last week, but this time using Spark Dataframes operations (the data loading will take a couple minutes)

### 1.0. Data preprocessing

In [2]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/14 18:45:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [3]:
#test it out
dataset.limit(3).show()

[Stage 1:>                                                          (0 + 1) / 1]

+--------------------+-------+----------+---------+--------------------+---------+
|             choices|country|      date|    guess|              sample|   target|
+--------------------+-------+----------+---------+--------------------+---------+
|[Maori, Mandarin,...|     AU|2013-08-19|Norwegian|48f9c924e0d98c959...|Norwegian|
|[Danish, Dinka, K...|     AU|2013-08-19|    Dinka|af5e8f27cef9e689a...|    Dinka|
|[German, Hungaria...|     AU|2013-08-19|  Turkish|509c36eb58dbce009...|   Samoan|
+--------------------+-------+----------+---------+--------------------+---------+



                                                                                

## Assignment 1
Find the number of games where one of the choices is Italian.

In [4]:
start_exercise()

In [5]:
from pyspark.sql.functions import array_contains

dataset.filter(array_contains("choices", "Italian")).count()

                                                                                

164496

In [6]:
finish_exercise()

This exercise took 5s


## Assignment 2
Return the number of (distinct) countries.

In [7]:
start_exercise()

In [8]:
dataset.select("country").distinct().count()

188

In [9]:
finish_exercise()

This exercise took 0s


## Assignment 3
Return the sample IDs (i.e., the "sample" field) of the top 2 games played in Switzerland (CH) where the guessed language is correct (equal to the target one) ordered by date (ascending), then by language (descending).

In [10]:
start_exercise()

In [11]:
dataset.select("sample").filter(dataset["guess"] == dataset["target"]).filter(dataset["country"] == "CH")\
.orderBy(dataset["date"].asc(), dataset["target"].desc()).limit(2).collect()

[Row(sample='74b5340a230b1e0c1d45787bc4280b05'),
 Row(sample='b7df3f9d67cef259fbcaa5abcad9d774')]

In [12]:
finish_exercise()

This exercise took 0s


## Assignment 4
Aggregate all games by guess and target language, ignoring games with correct guesses, count the number of guesses for each group and return the pair with the highest count i.e. the most confused pair.

In [13]:
start_exercise()

In [14]:
from pyspark.sql.functions import desc

dataset.filter(dataset["guess"] != dataset["target"]).groupBy(["guess", "target"]).count()\
.orderBy(desc("count")).select("guess", "target", "count").limit(1).collect()

[Row(guess='Mandarin', target='Cantonese', count=822)]

In [15]:
finish_exercise()

This exercise took 0s


## Assignment 5
Find the language with the second highest guessed percentage in Switzerland.

In [16]:
start_exercise()

In [17]:
correct = dataset.filter(dataset["country"] == "CH").filter(dataset["guess"] == dataset["target"]) \
    .groupBy("target").count().withColumnRenamed("count", "correct_guesses")  
total = dataset.filter(dataset["country"] == "CH").groupBy("target").count().withColumnRenamed("count", "total_guesses")
joined = correct.join(total, on="target")
joined.select("target",(joined["correct_guesses"]/joined["total_guesses"]).alias("percentage")).orderBy(desc("percentage")).limit(3).show()

+-------+------------------+
| target|        percentage|
+-------+------------------+
| German|0.9824561403508771|
|Italian|0.9789227166276346|
| French|0.9697802197802198|
+-------+------------------+



In [18]:
finish_exercise()

This exercise took 0s


## Assignment 6
On which day was any sample played the most ever? Hint: you might need to import `max()`.

In [19]:
start_exercise()

In [20]:
from pyspark.sql.functions import max

sample_counts = dataset.groupBy("date", "sample").count()
max_count = sample_counts.select(max("count")).collect()[0]["max(count)"]
sample_counts.select("date").filter(sample_counts["count"] == max_count).collect()

[Row(date='2013-09-04')]

In [21]:
finish_exercise()

This exercise took 0s


## <center>2. Spark SQL</center>

Now, for the second weekly quiz, write Spark SQL queries for the same questions as earlier.

### 2.0. Data preprocessing

In [22]:
!pip install sparksql-magic --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [23]:
%load_ext sparksql_magic

In [24]:
path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()
dataset.createOrReplaceTempView("dataset")

24/11/14 18:45:56 WARN CacheManager: Asked to cache already cached data.        


In [25]:
%%sparksql
-- test it out
SELECT *
FROM dataset
LIMIT 3

0,1,2,3,4,5
choices,country,date,guess,sample,target
"['Maori', 'Mandarin', 'Norwegian', 'Tongan']",AU,2013-08-19,Norwegian,48f9c924e0d98c959d8a6f1862b3ce9a,Norwegian
"['Danish', 'Dinka', 'Khmer', 'Lao']",AU,2013-08-19,Dinka,af5e8f27cef9e689a070b8814dcc02c3,Dinka
"['German', 'Hungarian', 'Samoan', 'Turkish']",AU,2013-08-19,Turkish,509c36eb58dbce009ccf93f375358d53,Samoan


## Assignment 1
Find the number of games where one of the choices is Italian.

In [26]:
start_exercise()

In [27]:
%%sparksql
SELECT count(*) 
FROM dataset
WHERE array_contains(choices, "Italian")

0
count(1)
164496


In [28]:
finish_exercise()

This exercise took 0s


## Assignment 2
Return the number of (distinct) countries.

In [29]:
start_exercise()

In [30]:
%%sparksql
SELECT count(distinct(country))
FROM dataset

0
count(DISTINCT country)
188


In [31]:
finish_exercise()

This exercise took 0s


## Assignment 3
Return the sample IDs (i.e., the "sample" field) of the top 2 games played in Switzerland (CH) where the guessed language is correct (equal to the target one) ordered by date (ascending), then by language (descending).

In [32]:
start_exercise()

In [33]:
%%sparksql
SELECT sample
FROM dataset
WHERE guess = target AND country = "CH"
ORDER BY date ASC, target DESC
LIMIT 2

0
sample
74b5340a230b1e0c1d45787bc4280b05
b7df3f9d67cef259fbcaa5abcad9d774


In [34]:
finish_exercise()

This exercise took 0s


## Assignment 4
Aggregate all games by guess and target language, ignoring games with correct guesses, count the number of guesses for each group and return the pair with the highest count i.e. the most confused pair.

In [35]:
start_exercise()

In [36]:
%%sparksql
SELECT guess, target, COUNT(*)
FROM dataset
WHERE guess != target
GROUP BY guess, target
ORDER BY COUNT(*) DESC
LIMIT 1

0,1,2
guess,target,count(1)
Mandarin,Cantonese,822


In [37]:
finish_exercise()

This exercise took 0s


## Assignment 5
Find the language with the second highest guessed percentage in Switzerland.

In [38]:
start_exercise()

In [39]:
%%sparksql
WITH correct AS (
    SELECT target, COUNT(*) AS correct_guesses
    FROM dataset
    WHERE guess = target AND country = "CH"
    GROUP BY target
    ),
total AS (
    SELECT target, COUNT(*) AS total_guesses
    FROM dataset
    WHERE country = "CH"
    GROUP BY target
    )
SELECT target, correct_guesses/total_guesses
FROM correct JOIN total USING(target)
ORDER BY correct_guesses/total_guesses DESC
LIMIT 3

0,1
target,(correct_guesses / total_guesses)
German,0.9824561403508771
Italian,0.9789227166276346
French,0.9697802197802198


In [40]:
finish_exercise()

This exercise took 0s


## Assignment 6
On which day was any sample played the most ever?

In [41]:
start_exercise()

In [42]:
%%sparksql
WITH sample_counts AS (
    SELECT date, sample, COUNT(*) AS count
    FROM dataset
    GROUP BY date, sample
),
max_count AS (
    SELECT MAX(count) AS max_count
    FROM sample_counts
)
SELECT date
FROM sample_counts
WHERE count = (SELECT max_count FROM max_count)


0
date
2013-09-04


In [43]:
finish_exercise()

This exercise took 0s
