# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2021 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and Spark SQL, Moodle exercise</center>

# Preparation for the moodle exercise in Spark

In this jupyter notebook we are going to make the preprocessing part of the dataset that is going to be used in the graded exercise of this week.
It will be the same language game dataset as in exercise08.

1. Change to exercise09 repository

2. Start docker <br>
```docker-compose up -d```

3. Getting the data:
Follow the procedure that is described below. The dataset can be found here: http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2. 

More specifically do the following:
- download the data      :<br> ```wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2```
- extract the data       :<br> ```tar -jxvf confusion-2014-03-02.tbz2```

4. copy the data to docker :<br> ```docker cp confusion-2014-03-02/confusion-2014-03-02.json jupyter:/home/jovyan/work``` <br>
(Copying the data to docker needs to be done only once and it might take 1-2 minutes.)

## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in this link: http://lars.yencken.org/datasets/languagegame/

## Instructions:

In every query we ask you for three quantities: the query itself, the result of the query as well as the productivity time. That means the development time of each query (time elapsed before you start writing the query, and the time at which the correct, final query is ready). Note that the time part of every question is optional and not graded. In order to make easier the time recording we created two functions that do it automatically. Run the cell below in order to import the functions into the current notebook. Then before each query we will have a ```start_exercise()``` cell that you have to run in order to start time recording. After you have finished your query and you are sure about the answer run the ```finish_exercise()``` one to get the time measurement. 

In [32]:
import time

def start_exercise():
    global last
    last = time.time()
    
def finish_exercise():
    global last
    print("This exercise took {0}s".format(int(time.time()-last)))

## <center>1. Spark Dataframes</center>

Write queries for the same questions as last week, but this time using Spark Dataframes operations (the data loading will take a minute)

### 1.0. Data preprocessing

In [33]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "confusion-2014-03-02.json"
dataset = spark.read.json(path).cache()

In [34]:
#test it out
dataset.limit(3).show()

+--------------------+-------+----------+---------+--------------------+---------+
|             choices|country|      date|    guess|              sample|   target|
+--------------------+-------+----------+---------+--------------------+---------+
|[Maori, Mandarin,...|     AU|2013-08-19|Norwegian|48f9c924e0d98c959...|Norwegian|
|[Danish, Dinka, K...|     AU|2013-08-19|    Dinka|af5e8f27cef9e689a...|    Dinka|
|[German, Hungaria...|     AU|2013-08-19|  Turkish|509c36eb58dbce009...|   Samoan|
+--------------------+-------+----------+---------+--------------------+---------+



In [5]:
dataset.printSchema()

root
 |-- choices: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country: string (nullable = true)
 |-- date: string (nullable = true)
 |-- guess: string (nullable = true)
 |-- sample: string (nullable = true)
 |-- target: string (nullable = true)



## Assignment 1
Find the number of games where the guessed language is correct (meaning equal to the target one) and that language is Russian.

In [5]:
start_exercise()

In [15]:
dataset.filter((dataset['guess']==dataset['target']) & (dataset['target'] == 'Russian')).count()

290818

In [16]:
finish_exercise()

This exercise took 449s


## Assignment 2
Return the number of distinct "target" languages.

In [17]:
start_exercise()

In [18]:
dataset.select('target').distinct().count()

78

In [19]:
finish_exercise()

This exercise took 47s


## Assignment 3
Return the sample IDs (i.e., the *sample* field) of the top three games where the guessed language is correct (equal to the target one) ordered by language (ascending), then by country (ascending), then by date (ascending).

In [20]:
start_exercise()

In [35]:
from pyspark.sql.functions import asc; dataset.filter(dataset['guess']==dataset['target']).select('sample').orderBy(asc('guess'), asc('country'), asc('date')).limit(3).collect()

[Row(sample='00b85faa8b878a14f8781be334deb137'),
 Row(sample='efcd813daec1c836d9f030b30caa07ce'),
 Row(sample='efcd813daec1c836d9f030b30caa07ce')]

In [22]:
finish_exercise()

This exercise took 318s


## Assignment 4
Aggregate all games by country and target language, counting the number of guesses for each group and return the frequencies of the three most frequent country/language combinations.

In [24]:
start_exercise()

In [28]:
from pyspark.sql.functions import desc; dataset.groupBy('country', 'target').count().orderBy(desc('count')).show(3)

+-------+-------+------+
|country| target| count|
+-------+-------+------+
|     US| French|112934|
|     US| German|112007|
|     US|Spanish|110919|
+-------+-------+------+
only showing top 3 rows



In [29]:
finish_exercise()

This exercise took 290s


## Assignment 5
Find the percentage of games where (the answer was correct && the correct guess was the first choice amongst the array of possible answers)

Please write the fraction rounding to 4 decimals (eg. 0.3323)

In [90]:
start_exercise()

In [29]:
dataset.filter((dataset['target']==dataset['guess']) & (dataset['guess']==dataset['choices'][0])).count() / dataset.count()
# By XYT
# round(dataset.filter((dataset["guess"]==dataset["target"]) & (dataset["guess"]==dataset["choices"].getItem(0))).count() / dataset.count(),4)

0.25603983084476356

In [30]:
finish_exercise()

This exercise took 1492s


## Assignment 6
Sort the languages by decreasing overall percentage of correct guesses and return the first three languages.

In [17]:
start_exercise()

In [28]:
from pyspark.sql.functions import count, desc; guess_count = dataset.groupBy('target').agg(count('*').alias('n')); correct_guess_count = dataset.filter(dataset.guess==dataset.target).groupBy('target').agg(count('*').alias('c')); guess_count.join(correct_guess_count, guess_count.target==correct_guess_count.target).select(guess_count.target, (correct_guess_count.c/guess_count.n).alias('p')).orderBy(desc('p')).show(3)
# By xyt
# from pyspark.sql.functions import count, col
# dataset.filter(dataset['target']==dataset['guess']).groupBy(dataset['target']).agg(count('target').alias('cnt_correct')).join(dataset.groupBy(dataset['target']).agg(count('target').alias('cnt_overall')),on='target',how='inner').orderBy(desc(col('cnt_correct')/col('cnt_overall'))).select('target').limit(3).show()


+-------+------------------+
| target|                 p|
+-------+------------------+
| French|0.9382414927447232|
| German|0.9197634593055483|
|Spanish|0.8956432115670598|
+-------+------------------+
only showing top 3 rows



In [27]:
finish_exercise()

This exercise took 1162s


## Assignment 7
Return the number of games played on the latest day.

In [30]:
start_exercise()

In [34]:
last_date = dataset.select('date').orderBy(desc('date')).limit(1).collect()[0][0]; dataset.filter(dataset['date'] == last_date).count()

65653

In [35]:
finish_exercise()

This exercise took 264s


## <center>2. Spark SQL</center>

Write Spark SQL queries for the same questions as earlier.

### 2.0. Data preprocessing

In [None]:
!pip install sparksql-magic

In [8]:
%load_ext sparksql_magic

In [9]:
path = "confusion-2014-03-02.json"
dataset = spark.read.json(path).cache()
dataset.registerTempTable("dataset")

In [10]:
%%sparksql
-- test it out
SELECT *
FROM dataset
LIMIT 3

0,1,2,3,4,5
choices,country,date,guess,sample,target
"['Maori', 'Mandarin', 'Norwegian', 'Tongan']",AU,2013-08-19,Norwegian,48f9c924e0d98c959d8a6f1862b3ce9a,Norwegian
"['Danish', 'Dinka', 'Khmer', 'Lao']",AU,2013-08-19,Dinka,af5e8f27cef9e689a070b8814dcc02c3,Dinka
"['German', 'Hungarian', 'Samoan', 'Turkish']",AU,2013-08-19,Turkish,509c36eb58dbce009ccf93f375358d53,Samoan


## Assignment 1
Find the number of games where the guessed language is correct (meaning equal to the target one) and that language is Russian.

In [None]:
start_exercise()

In [42]:
%%sparksql
SELECT COUNT(*)
FROM dataset
WHERE guess == target AND target == 'Russian'

0
count(1)
290818


In [43]:
finish_exercise()

This exercise took 542s


## Assignment 2
Return the number of distinct "target" languages.

In [44]:
start_exercise()

In [49]:
%%sparksql
SELECT COUNT(*)
FROM (
    SELECT DISTINCT target
    FROM dataset
)

0
count(1)
78


In [50]:
finish_exercise()

This exercise took 160s


## Assignment 3
Return the sample IDs (i.e., the *sample* field) of the top three games where the guessed language is correct (equal to the target one) ordered by language (ascending), then by country (ascending), then by date (ascending).

In [52]:
start_exercise()

In [36]:
%%sparksql
SELECT sample
FROM dataset
WHERE guess==target
ORDER BY target ASC, country ASC, date ASC
LIMIT 3

0
sample
00b85faa8b878a14f8781be334deb137
efcd813daec1c836d9f030b30caa07ce
efcd813daec1c836d9f030b30caa07ce


In [54]:
finish_exercise()

This exercise took 65s


## Assignment 4
Aggregate all games by country and target language, counting the number of guesses for each group and return the frequencies of the three most frequent country/language combinations.

In [62]:
start_exercise()

In [65]:
%%sparksql
SELECT country, target, COUNT(*) as cnt
FROM dataset
GROUP BY country, target
ORDER BY cnt DESC
LIMIT 3

0,1,2
country,target,cnt
US,French,112934
US,German,112007
US,Spanish,110919


In [64]:
finish_exercise()

This exercise took 7s


## Assignment 5
Find the percentage of games where (the answer was correct && the correct guess was the first choice amongst the array of possible answers)

Please write the fraction rounding to 4 decimals (eg. 0.3323)

In [6]:
start_exercise()

In [13]:
%%sparksql
SELECT round(COUNT(*) / (SELECT COUNT(*) FROM dataset), 4)
FROM dataset
WHERE target==guess AND guess==choices[0]

-- By XYT
-- SELECT round((SELECT COUNT(*) FROM dataset WHERE target=guess AND guess=choices[0]) / COUNT(*) ,4) FROM dataset

0
"round((CAST(count(1) AS DOUBLE) / CAST(scalarsubquery() AS DOUBLE)), 4)"
0.256


In [12]:
finish_exercise()

This exercise took 284s


## Assignment 6
Sort the languages by decreasing overall percentage of correct guesses and return the first three languages.

In [14]:
start_exercise()

In [31]:
%%sparksql
WITH guess_count AS (
    SELECT target, COUNT(*) AS num_guess
    FROM dataset
    GROUP BY target
),
correct_guess_count AS (
    SELECT target, COUNT(*) AS num_correct_guess
    FROM dataset
    WHERE target==guess
    GROUP BY target
)
SELECT target, num_correct_guess / num_guess as percentage
FROM guess_count NATURAL JOIN correct_guess_count
ORDER BY percentage DESC
LIMIT 3

-- By XYT
/*
SELECT overall.target
FROM (
    SELECT target, COUNT(*) AS cnt_correct
    FROM dataset
    WHERE target=guess
    GROUP BY target) AS correct 
    JOIN (
    SELECT target, COUNT(*) AS cnt_overall
    FROM dataset
    GROUP BY target) AS overall 
    ON correct.target = overall.target
ORDER BY cnt_correct/cnt_overall DESC
LIMIT 3
*/

0,1
target,percentage
French,0.9382414927447232
German,0.9197634593055483
Spanish,0.8956432115670598


In [16]:
finish_exercise()

This exercise took 222s


## Assignment 7
Return the number of games played on the latest day.

In [None]:
start_exercise()

In [66]:
%%sparksql
SELECT COUNT(*)
FROM dataset JOIN (
    SELECT max(date) AS ld
    FROM dataset
) AS last_date ON dataset.date== last_date.ld

0
count(1)
65653


In [67]:
finish_exercise()

This exercise took 300s
