# <center>Big Data for Engineers – Exercises</center>
## <center>Spring 2022 – Week 8 – ETH Zurich</center>
## <center>Notebook for the Spark Moodle Quiz - Solution</center>

## Start docker

In your exercise 08 directory, start docker

```
docker compose up
```

After docker finishes downloading the images, you should be able to start the jupyter notebook by copying the following URL to your browser

```
http://127.0.0.1:8888/lab
```

In [2]:
import time
from pyspark.context import SparkContext
# sc is the Spark Context object 
sc = SparkContext('local', 'test')

### The Moodle Quiz

For this quiz we will be using the [language confusion dataset](https://quietlyamused.org/blog/2014/03/12/language-confusion/).

As mentioned in the exercise, this quiz is a part of the small project you will be doing over the following 3 weeks to compare Spark, Spark with DataFrames/SQL, and JSONiq. You will hear more about it in the coming weeks.

For the Moodle quiz we ask you to submit the following things:
- The query you wrote (although it is not graded, having your query is helpful for arguing about points)
- Something related to its output (**the only part that is graded**)
- The time it took you to write it (thinking time)
- The time it took you to run it (execution time)

You don't need to submit this notebook, only the queries you wrote.

On your own laptop, download and decompress the dataset into the ex08 folder using the commands below. You can also copy the URL to your browser to download it, then decompress it using the default decompression tools Windows/Mac. Alternatively, you can also run the commands in jupyter notebook, but it takes several minutes to decompress it in the docker container.

```bash
wget https://cloud.inf.ethz.ch/s/a8FoHew6dHKGYKK/download/confusion20140302.tbz2; tar -jxvf confusion20140302.tbz2
```

In [3]:
data = sc.textFile('./confusion-2014-03-02/confusion-2014-03-02.json')

Since the entries are JSON records, you will need to parse them and use their respective object representations. You can use this mapping for all queries. Since some of the queries take a long time to execute on the dataset, you may want to answer these queries on the first `100000` entries. 

**For the quiz, fill in the results by running the queries on the 100000-entry subset (`test_entries` as defined in the following cell) instead of the entire dataset.**

In [4]:
import json

testset = sc.parallelize(data.take(100000))
test_entries = testset.map(json.loads)

print(test_entries.first())

{'guess': 'Norwegian', 'target': 'Norwegian', 'country': 'AU', 'choices': ['Maori', 'Mandarin', 'Norwegian', 'Tongan'], 'sample': '48f9c924e0d98c959d8a6f1862b3ce9a', 'date': '2013-08-19'}


Good! Let's get to work. A few last things:
- Take into account that some of the queries might have very large outputs, which Jupyter (or sometimes even Spark) won't be able to handle. It is normal for the queries to take some time, but if the notebook crashes or stops responding, try restarting the kernel. Avoid printing large outputs. You can print the first few entries to confirm the query has worked, as shown in query 1.
- Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and your resources will still be there.
- Refer to the [documentation](http://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.RDD), as well as the programming guides on actions and transformations linked to above.

And now to the actual queries: *Please make sure that in your queries you *only* use PySpark, and avoid any dataframes (they will covered in next week's exercises)*

1\. Find all games such that the guessed language is correct (=target), and such that this language is Spanish. What is the length of the resulting sequence?

In [5]:
start = time.time()
# Query:
matching = test_entries.filter(lambda e: e["target"] == "Spanish").filter(lambda e: e["target"] == e["guess"]).collect()
end = time.time()
# Only print the first few entries
print(json.dumps(matching[:3], indent=4))
# print number of matching 
print('Number of matching games ', len(matching))
print('Time consumption {} sec'.format(end - start))

[
    {
        "guess": "Spanish",
        "target": "Spanish",
        "country": "AU",
        "choices": [
            "Amharic",
            "Czech",
            "Sinhalese",
            "Spanish"
        ],
        "sample": "dc3ace49393de518e87d4f8d3ae8d9db",
        "date": "2013-08-19"
    },
    {
        "guess": "Spanish",
        "target": "Spanish",
        "country": "AU",
        "choices": [
            "Albanian",
            "Dinka",
            "Spanish",
            "Thai"
        ],
        "sample": "c8f4f097079404bf9a0e94d604efd1d5",
        "date": "2013-08-19"
    },
    {
        "guess": "Spanish",
        "target": "Spanish",
        "country": "AU",
        "choices": [
            "Maori",
            "Dari",
            "Korean",
            "Spanish"
        ],
        "sample": "c8f4f097079404bf9a0e94d604efd1d5",
        "date": "2013-08-19"
    }
]
Number of matching games  2094
Time consumption 0.6488256454467773 sec


2\. Find the number of all distinct values of the *guessed* languages (i.e. the *guess* field). What is the length of the resulting sequence?

In [20]:
start = time.time()
# Query:
count = test_entries.map(lambda e: e["guess"]).distinct().count()
end = time.time()
print('Time consumption {} sec'.format(end - start))
print('Number of distinct targets ', count)

Time consumption 0.5597050189971924 sec
Number of distinct targets  68


3\. Return the top three games where the guessed language is incorrect ($\ne$target) ordered by country (ascending), then target language (ascending), then date (ascending). What is the sample id of the 3rd item in the list? 

Enter it without quotes, for example 48f9c924e0d98c959d8a6f1862b3ce9a

In [23]:
start = time.time()
# Query:
correct = test_entries.filter(lambda e: e["target"] != e["guess"])
first_three = correct.sortBy(lambda e: (e["country"], e["target"], e["date"])).take(3)
end = time.time()
print('Time consumption {} sec'.format(end - start))
print(json.dumps(first_three, indent=4))
print('Sample id of the third item', first_three[-1]["sample"])

Time consumption 0.5318145751953125 sec
[
    {
        "guess": "Tagalog",
        "target": "Maori",
        "country": "A1",
        "choices": [
            "Greek",
            "Maori",
            "Tagalog"
        ],
        "sample": "bed42a067e94b5bdcf9a9190dc7acae3",
        "date": "2013-09-03"
    },
    {
        "guess": "Kannada",
        "target": "Somali",
        "country": "A1",
        "choices": [
            "Kannada",
            "Somali"
        ],
        "sample": "2591d3700224a4729cbe46e53fdbd707",
        "date": "2013-09-03"
    },
    {
        "guess": "Czech",
        "target": "Albanian",
        "country": "AE",
        "choices": [
            "Albanian",
            "Czech",
            "Somali"
        ],
        "sample": "00b85faa8b878a14f8781be334deb137",
        "date": "2013-09-03"
    }
]
Sample id of the third item 00b85faa8b878a14f8781be334deb137


4\. Aggregate all games by guessed and target language, counting the number of guessing games that were done for each pair (guess, target). How many times has Dutch been mistaken for Norwegian (i.e. Dutch was the true answer)?

In [30]:
start = time.time()
# Query:
counts = test_entries.map(lambda e: (e["guess"], e["target"])).countByValue()
end = time.time()
print('Time consumption {} sec'.format(end - start))
# Print the first few items 
for k in list(counts)[:3]:
    print((k, counts[k]))
    
print('number of games matching (Norwegian, Dutch) ', counts[('Norwegian', 'Dutch')])

Time consumption 0.41964077949523926 sec
(('Norwegian', 'Norwegian'), 1403)
(('Dinka', 'Dinka'), 268)
(('Turkish', 'Samoan'), 17)
number of games matching (Norwegian, Dutch)  19


In [31]:
# Alternative solution
start = time.time()
# Query:
counts = test_entries.map(lambda e: ((e["guess"], e["target"]), 1)).reduceByKey(lambda x, y: x+y).collect()
end = time.time()
print('Time consumption {} sec'.format(end - start))
# Print the first few items 
for k in counts[:3]:
    print((k[0], k[1]))
    
print('number of games matching (Norwegian, Dutch) ', dict(counts)[('Norwegian', 'Dutch')])

Time consumption 0.681814432144165 sec
(('Norwegian', 'Norwegian'), 1403)
(('Dinka', 'Dinka'), 268)
(('Turkish', 'Samoan'), 17)
number of games matching (Norwegian, Dutch)  19


5\. Among all the games where the guess was correct (=target), what is the percentage of cases where the second choice (among the array of possible answers) was the target?

Please write the fraction rounding to 4 decimals (eg. 0.3323)

In [38]:
start = time.time()
# Query:
correct = test_entries.filter(lambda e: e["target"] == e["guess"])
percent = correct.filter(lambda e: e["target"] == e["choices"][1]).count() / correct.count()
end = time.time()
print('Time consumption {} sec'.format(end - start))
print('percentage ', round(percent, 4))

Time consumption 0.7118470668792725 sec
percentage  0.3789


6\. For each target language, compute the percentage of successful guess games (i.e. *guess* == *target*) relative to all games for that target language, and display the pairs `(target_language, percentage)` in descending order of the percentage. What is the third language in this list? 

In [39]:
start = time.time()
# Query:
correct = test_entries.map(lambda e: (e["target"], e["target"] == e["guess"])).groupByKey()
outcomes = correct.mapValues(lambda e: float(list(e).count(True))/len(e)).sortBy(lambda e: e[1], ascending=False).collect()
end = time.time()
print('Time consumption {} sec'.format(end - start))
# Print the first few values
print(outcomes[:10])

print('The second language on the list ', outcomes[2][0])

Time consumption 0.7850742340087891 sec
[('French', 0.9597374179431072), ('German', 0.9455670103092784), ('Mandarin', 0.927967985771454), ('Spanish', 0.9277802392556491), ('Cantonese', 0.9201053555750659), ('Italian', 0.9194690265486726), ('Japanese', 0.9111008751727314), ('Korean', 0.899624765478424), ('Russian', 0.8850305021116847), ('Vietnamese', 0.8693492300049677)]
The second language on the list  Mandarin


7\. How many games in France (country=FR) were played on the last day? 

In [40]:
start = time.time()
# Query 
games_in_france = test_entries.filter(lambda e: e["country"] == "FR")
latest = games_in_france.map(lambda e: e["date"]).sortBy(lambda e: e, ascending=False).first()
count = games_in_france.filter(lambda e: e["date"]==latest).count()
end = time.time()
print('Time consumption {} sec'.format(end - start))
print('latest date ', latest)
print('games played on last date ', count)

Time consumption 1.0298762321472168 sec
latest date  2013-09-03
games played on last date  494
