# <center>Big Data &ndash; Exercises </center>
## <center>Fall 2024 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and SparkSQL</center>

## Preparation for the exercise in Spark

1. Drag this notebook in the `notebooks` folder of your exam magic box

2. Start docker with ```docker-compose up -d```

3. Launch Jupiter from the docker container

## <center>1. Spark Dataframes</center>

Spark Dataframes allow the user to perform simple and efficient operations on data, as long as the data is structured and has a schema. Dataframes are similar to relational tables in relational databases but they allow for nestedness: conceptually a dataframe is a specialization of a Spark RDD with schema information attached. You can find more information in Karau, H. et al. (2015). Learning Spark, Chapter 9 (optional reading).

### 1.1. Data preprocessing

In [None]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "orders.jsonl"
orders_df = spark.read.json(path).cache()

The type of our dataset object is DataFrame

In [None]:
type(orders_df)

Print the schema

In [None]:
orders_df.printSchema()

Print one row

In [None]:
orders_df.limit(1).collect()

You can access the underlying RDD object and use any functions you learned for Spark RDDs.

In [None]:
orders_df.rdd.filter(lambda ordr: ordr.customer.last_name == "Landry").count()

Check the Spark version

In [None]:
print(spark.version)

### 1.2. Dataframe Operations
We will perform some queries using operations on Dataframes ([Here](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations) is a guide on DF Operations with a link to the [API Documentation](https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html))

We can select columns and show the result

In [None]:
orders_df.select("customer.first_name", "customer.last_name").limit(5).show()

As you can see we can navigate inside nested items with the dot notation

In [None]:
orders_df.filter(orders_df["customer.last_name"] == "Landry").count()

How about nested arrays?

In [None]:
orders_df.select("order_id", "items").orderBy("order_id").limit(5).show()

Let us try to find orders of a fan.

In [None]:
orders_df.filter(orders_df["items.product"] == "fan").count()

The above code doesn't work! We can use ```array_contains()``` instead and have to import it first.

In [None]:
from pyspark.sql.functions import array_contains

orders_df.filter(array_contains("items.product", "fan")).count()

Let us try to unnest the data.

We can unnest the products with explode.

Explode will generate as many rows as there are elements in the array and match them to other attributes. You should name the newly generated exploded column in order to be able to refer to it. 

Here you can find the syntax.

In [None]:
from pyspark.sql.functions import explode

orders_df.select(explode("items").alias("i"), "order_id").select("i.product", "order_id").orderBy("order_id").limit(5).show()

Now we can use this table to filter.

In [None]:
exploded_df = orders_df.select(explode("items").alias("i"), "order_id").select("i.product", "order_id")
exploded_df.filter(exploded_df["product"] == "fan").count()

You might have tried to access the ```i.product``` column directly using a ```.filter``` right after the ```.select```. That, however, does not work, because the column is not available to ```orders_df``` when creating a clause like ```(orders_df["i.product"] == "fan")```. In order to filter on a freshly exploded column is best to proceed in stps and create an intermediate table. 

A possible workaround when using Dataframe operations is that of using a string clause in ```.filter```, so that the product column will be resolved after it has been added with the ```.select```.

In [None]:
orders_df.select(explode("items").alias("i"), "order_id").select("i.product", "order_id").filter("product = 'fan'").count()

Project the nested columns

In [None]:
orders_df.select(explode("items").alias("i"), "*").select(
    "order_id", "customer.*", "date", "i.*").limit(3).show()

Just like in SQL ```*``` is a shortcut for just selecting all of the fields involved.

### 1.3 Exercises

#### 1. Find the number of distinct products

#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 
(Hint: you need to import the function ```desc``` from ```pyspark.sql.functions``` to define descending order)

#### 3. Find the most expensive order

## <center>2. Spark SQL</center>

Spark SQL enables users to write queries using an SQL-like dialect, but it requires DataFrames, since they closely resemble relational tables. In addition to providing a familiar interface, SQL queries can deliver better performance compared to RDDs, leveraging the efficiency of DataFrame operations and Spark's automatic query optimization.

First we need to install and load the sparksql magic command

In [None]:
!pip install sparksql-magic --quiet

In [None]:
%load_ext sparksql_magic

In order to use sql we need to create a temporary table.

This table only exists for the current session.

In [None]:
orders_df.createOrReplaceTempView("orders")

### 2.1 Queries

Finally, run SQL queries on the registered tables. We will run the same queries as during the previous section, but with SQL.

As you can see we can navigate to the nested items with the dot.

In [None]:
%%sparksql
SELECT count(*)
FROM orders
WHERE orders.customer.last_name == "Landry"

How about nested arrays?

In [None]:
%%sparksql
SELECT order_id, items
FROM orders AS o
ORDER BY order_id
LIMIT 5

Let us try to find orders of a fan.

In [None]:
%%sparksql 
SELECT count(*)
FROM orders
WHERE items.product = "fan"

The above code doesn't work! We need once again to use ```array contains``` instead.

In [None]:
%%sparksql
SELECT count(*)
FROM orders
WHERE array_contains(items.product, "fan")

Let us try to unnest the data.

We can unnest the products with explode.

Explode will generate as many rows as there are elements in the array and match them to the other attributes.

In [None]:
%%sparksql
SELECT i.product, order_id
FROM (
    SELECT explode(items) AS i, order_id
    FROM orders
) exploded_items
ORDER BY order_id
LIMIT 5

Now we can use this table to filter. For example we want to find out how many times does "fan" appear.

In [None]:
%%sparksql
SELECT count(*)
FROM (
    SELECT explode(items) AS i
    FROM orders
    )
WHERE i.product = "fan"

You might have tried to access the i.product column directly in the same ```SELECT``` clause. That, however, just like before, does not work. This is because the column is not available to the ```WHERE``` clause right away. In order to access the built columns directly, we need to unnest the data and make it part of our ```FROM``` clause. ```LATERAL VIEW``` lets us do just that, matching each non-array attribute to an unnested row from the array.  

In [None]:
%%sparksql
SELECT *
FROM orders lateral view explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

With this we can also project the nested columns

In [None]:
%%sparksql
SELECT order_id, customer.first_name, customer.last_name, date, flat_items.*
FROM orders lateral view explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

Having built an unnested table, we can now easily aggregate over the previously nested columns

### 2.2 Exercises

#### 1. Find the number of distinct products

In [None]:
%%sparksql


#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 

In [None]:
%%sparksql


#### 3. Find the most expensive order

In [None]:
%%sparksql


## <center>3. More queries</center>

We will now explore the dataset that is going to be used in the graded exercise of this week. It will be the same language game dataset as in exercise08. If you already have it from last week you just have to copy the `confusion-2014-03-02` folder in the `notebooks` folder of your exam magic box, otherwise here are the instructions again.

1. Move to the `notebooks` folder in the terminal
2. Download the data: <br>
   ```wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2``` <br>
   __or__ <br>
   ```curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2```
3. Extract the data: <br>
   ```tar -jxvf confusion-2014-03-02.tbz2```
4. Change directory to ```confusion-2014-03-02```
5. Extract the part of the dataset that we will work with in this exercise: <br>
   ```head -n 3000000 confusion-2014-03-02.json > confusion-part.json```

### 3.1 Data processing

In [None]:
path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()

Have a look at the data

In [None]:
dataset.limit(3).show()

Print the schema

In [None]:
dataset.printSchema()

### 3.2 Spark Dataframe queries

#### 1. Find the number of games where the guessed language and target language is Maltese.

#### 2. Return the number of distinct "target" languages.

#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending). 
(Hint: passing `truncate=False` to `show()` allows you to see the full output, otherwise you can simply use `collect()` instead) 

#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

#### 5 Sort the languages by decreasing overall percentage of correct guesses and return the first four languages. 
(Hint: `withColumnRenamed()` allows you to set the names of the generated columns and remember it is possible to `join()` Dataframes)

### 3.2 Spark SQL queries
We will now go over the same queries but using Spark SQL instead

In [None]:
dataset.createOrReplaceTempView("dataset")

In [None]:
%%sparksql
SELECT *
FROM dataset
LIMIT 3

#### 1. Find the number of games where the guessed language and target language is Maltese.

In [None]:
%%sparksql


#### 2. Return the number of distinct "target" languages.

In [None]:
%%sparksql


#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending).

In [None]:
%%sparksql


#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

In [None]:
%%sparksql


#### 5. Sort the languages by decreasing overall percentage of correct guesses and return the first four languages.

In [None]:
%%sparksql


## <center>4. Optional Exercise: PageRank</center>

The PageRank algorithm, named after Google's Larry Page, assigns a measure of importance to each node (page) in a graph based on the importance of incoming edges (links). The importance of each edge is, in turn, derived from the importance of the source node and its out-degree. PageRank was designed to rank web pages based on hyperlinks between pages, but it can be also used to rank scientific articles, or influential users in a social network.

The algorithm maintains two datasets: one collection of (*pageID*, *linkList*) elements containing the list of neighbors of each page, and one collection of (*pageID*, *rank*) elements containing the current rank for each page. The algorithm proceeds as follows:
1. Initialize each page's rank to $1.0$.
2. On each iteration, have page $x$ send a contribution of $\frac{rank(x)}{numNeighbors(x)}$ to its neighbors (the pages it has links to).
3. Set each page's rank to $0.15 + 0.85 \times contributionsReceived$.

The algorithm runs multiple iterations (of step 2 and 3) until it converges.

Implement the PageRank algorithm in Spark for a simple dataset, running the loop for a fixed number of iterations.

For instance, you can use "parallelize" for that as follows: 
```
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
```
where 1,2,3,4 represents ids of pages.

### 4.1 Use Spark RDDs

In [None]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)]).groupByKey().cache()

#Your code here

ranks.collect()

### 4.2 Use Spark DataFrames

In [None]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id").cache()

In [None]:
#Your code here

ranks.show()

### 4.3 Use Spark SQL
Hint: you can use
```
new_df = spark.sql("... SQL query ...")
new_df.createOrReplaceTempView("new_table")
```
to perform a query inside a for loop and making the updated *new_table* available from SQL at every step.

In [None]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id").cache()

links_df.createOrReplaceTempView("links")

In [None]:
#Your code here

ranks.show()