# <center>Big Data &ndash; Exercises </center>
## <center>Fall 2025 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and SparkSQL</center>

## Preparation for the exercise in Spark

1. Drag this notebook in the `notebooks` folder of your exam magic box

2. Start docker with ```docker-compose up -d```

3. Launch Jupiter from the docker container

## <center>1. Spark Dataframes</center>

Spark Dataframes allow the user to perform simple and efficient operations on data, as long as the data is structured and has a schema. Dataframes are similar to relational tables in relational databases but they allow for nestedness: conceptually a dataframe is a specialization of a Spark RDD with schema information attached. You can find more information in Karau, H. et al. (2015). Learning Spark, Chapter 9 [link](https://learning.oreilly.com/library/view/learning-spark/9781449359034/?ar) (optional reading).

### 1.1. Data preprocessing

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').getOrCreate()
print("Spark Version", spark.version)

sc = spark.sparkContext

path = "orders.jsonl"
orders_df = spark.read.json(path).cache()

The type of our dataset object is DataFrame and we can check its schema with the `printSchema()` method.

In [None]:
print(type(orders_df))
orders_df.printSchema()

Print two sample rows. (Vertical and truncate are optional parameters)

In [None]:
orders_df.limit(2).show(vertical=True, truncate=False)

You can access the underlying RDD object and use any functions you learned for Spark RDDs.

In [None]:
orders_df.rdd.filter(lambda ordr: ordr.customer.last_name == "Landry").count()

### 1.2. Dataframe Operations
We will perform some queries using operations on Dataframes ([Here](https://spark.apache.org/docs/4.0.0/sql-getting-started.html) is a quick guide on DF Operations, with a link to the [Functions API Documentation](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/functions.html) and [DataFrame API Documentation](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/dataframe.html))

We can select columns and show the result

In [None]:
orders_df.select("customer.first_name", "customer.last_name").limit(5).show()

As you can see we can navigate inside nested items with the dot notation. We can filter data based on the columns, and we can use the binary operators to filter the data. Keep in mind, that instead of `or` we use `|`, instead of `and` we use `&`.

In [None]:
orders_df.filter((orders_df["customer.last_name"] == "Landry") & (orders_df["customer.first_name"] == "John")).count()

How about nested arrays?

In [None]:
orders_df.select("order_id", "items").orderBy("order_id").limit(5).show()

If we try to find all the orders that include a fan this way:

In [None]:
orders_df.filter(orders_df["items.product"] == "fan").count()

We notice, that the above code doesn't work! The reason behind it is, that the left side of the `==` operator is an array of strings, and the right side is a string. We would need to check for inclusion instead. Luckaly, Spark provides a function for that. It is called ```array_contains()``` and have to import it from the ```pyspark.sql.functions``` module.

In [None]:
from pyspark.sql.functions import array_contains

orders_df.filter(array_contains("items.product", "fan")).count()

Let us try to unnest the data. We can do so by using the ```explode``` function.

Explode will generate as many rows as there are elements in the array and match them to other attributes. You should name the newly generated exploded column in order to be able to refer to it.

In [None]:
from pyspark.sql.functions import explode

(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "i.quantity", "order_id")
    .orderBy("order_id", "product").limit(5).show()
)

Now we can use this table to further filter for the orders that include a fan. You might want to access the ```i.product``` column directly inside a ```.filter```, like so:

In [None]:
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter(orders_df["i.product"] == "fan")
    .distinct()
    .count()
)

That, however, does not work, because the column ```i.product``` is not available on `orders_df`.

(Note the usage of `.distinct()`. Why would we want to use it? Hint: check out order with id=2)

In order to filter on a newly added column we have a few different options.

1. The most verbose version is to use an intermediate table:

In [None]:
exploded_df = (orders_df
                .select(explode("items").alias("i"), "order_id")
                .select("i.product", "order_id"))

(exploded_df
    .filter(exploded_df["product"] == "fan")
    .select('order_id')
    .distinct()
    .count()
)

2. We can use a SQL expression inside ```.filter```. This is done by providing one string argument to the ```.filter``` method, and the product column will be resolved inside of it:

In [None]:
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter("i.product == 'fan'")
    .distinct()
    .count()
)

3. We can also use a helper function ```col``` from ```pyspark.sql.functions``` to create a [Column](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html#pyspark.sql.Column) expression on the fly:

In [None]:
from pyspark.sql.functions import col
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter(col("i.product") == "fan")
    .distinct()
    .count()
)

We can also project the nested columns. Just like in SQL ```*``` is a shortcut for just selecting all of the fields involved.



In [None]:
(orders_df
    .select(explode("items").alias("i"), "*")
    .select("order_id", "customer.*", "date", "i.*")
    .limit(3).show())

Sorting can be done by using the ```orderBy``` method. There is multiple different ways of specifying the sort order.

1. By an SQL expression(s). Note, all of the columns have the same sort order - either ascending or descending.

In [None]:
orders_df.orderBy("customer.last_name", "customer.first_name", ascending=True).limit(2).show()

2. By using Column objects and their ```asc``` and ```desc``` methods:

In [None]:
(orders_df
    .orderBy(
        orders_df["customer.last_name"].asc(),
        orders_df["customer.first_name"].desc(),
        orders_df["order_id"].asc())
    .limit(2).show()
)

3. By using the `asc` and `desc` functions from `pyspark.sql.functions`:

In [None]:
from pyspark.sql.functions import asc, desc
(orders_df
    .orderBy(
        asc("customer.last_name"),
        desc("customer.first_name"),
        asc("order_id"))
    .limit(2).show()
)

Note: Only the last ```orderBy``` method is considered for the final sort order - chaining multiple does not combined them, as you might have expected. You can verify this using an `.explain()` call.

In [None]:
(orders_df
    .orderBy("customer.last_name", ascending=True).orderBy("customer.first_name", ascending=False)
    .explain()
)
(orders_df
    .orderBy(orders_df["customer.last_name"].asc(), orders_df["customer.first_name"].desc())
    .explain()
)

### 1.3 Exercises

#### 1. Find the number of distinct products

#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity.

#### 3. Find the most expensive order

## <center>2. Spark SQL</center>

Spark SQL enables users to write queries using an SQL-like dialect, but it requires DataFrames, since they closely resemble relational tables. In addition to providing a familiar interface, SQL queries can deliver better performance compared to RDDs, leveraging the efficiency of DataFrame operations and Spark's automatic query optimization.

The sparksql-magic should come preinstalled in the exam magic box. We just need to load it.

In [None]:
# !pip install sparksql-magic --quiet
%load_ext sparksql_magic

In order to use sql we need to create a temporary table.

This table only exists for the current session.

In [None]:
orders_df.createOrReplaceTempView("orders")

### 2.1 Queries

Finally, run SQL queries on the registered tables. We will run the same queries as during the previous section, but with SQL.

As you can see we can navigate to the nested items with the dot.

In [None]:
%%sparksql
SELECT count(*)
FROM orders
WHERE orders.customer.last_name == "Landry"

How about nested arrays?

In [None]:
%%sparksql
SELECT order_id, items
FROM orders AS o
ORDER BY order_id
LIMIT 5

Let us try to find orders of a fan.

In [None]:
%%sparksql 
SELECT count(*)
FROM orders
WHERE items.product = "fan"

The above code doesn't work! We need once again to use ```array contains``` instead.

In [None]:
%%sparksql
SELECT count(*)
FROM orders
WHERE array_contains(items.product, "fan")

Let us try to unnest the data. We can do so by using the ```explode``` function.

Explode will generate as many rows as there are elements in the array and match them to other attributes.

In [None]:
%%sparksql
SELECT i.product, order_id
FROM (
    SELECT explode(items) AS i, order_id
    FROM orders
) exploded_items
ORDER BY order_id
LIMIT 5

Now we can use this table to filter. For example we want to find out how many times does "fan" appear.

In [None]:
%%sparksql
SELECT count(*)
FROM (
    SELECT explode(items) AS i
    FROM orders
)
WHERE i.product = "fan"

You might have tried to filter on the `i.product` column directly in the same ```SELECT``` clause:

In [None]:
%%sparksql
SELECT explode(items) AS i, i.product 
FROM orders
WHERE i.product = "fan"

That, however, just like before, does not work. This is because the column is not available to the ```WHERE``` clause right away. In order to access the built columns directly, we need to unnest the data and make it part of our ```FROM``` clause. ```LATERAL VIEW``` lets us do just that, matching each non-array attribute to an unnested row from the array.  

In [None]:
%%sparksql
SELECT *
FROM orders LATERAL VIEW explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

With this we can also project the nested columns

In [None]:
%%sparksql
SELECT order_id, customer.first_name, customer.last_name, date, flat_items.*
FROM orders LATERAL VIEW explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

Having built an unnested table, we can now easily aggregate over the previously nested columns

### 2.2 Exercises

#### 1. Find the number of distinct products

In [None]:
%%sparksql

#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 

In [None]:
%%sparksql

#### 3. Find the most expensive order

In [None]:
%%sparksql
SELECT order_id, SUM(flat_items.quantity * flat_items.price) AS total
FROM orders lateral view explode(items) AS flat_items
GROUP BY order_id
ORDER BY total DESC
LIMIT 3

## <center>3. More queries</center>

We will now explore the dataset that is going to be used in the graded exercise of this week. It will be the same language game dataset as in exercise08. If you already have it from last week you just have to copy the `confusion-2014-03-02` folder in the `notebooks` folder of your exam magic box, and run commands from the instruction 4. onwards, to create a reduced version of it.

If you don't have the dataset at all, also perform the first 3 steps.

1. Move to the `notebooks` folder of the magicbox in the terminal
2. Download the data <br>
   ```bash
   wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   # or
   curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   ```
3. Extract the data <br>
   ```bash
   tar -jxvf confusion-2014-03-02.tbz2
   ```
4. Change directory to extracted folder <br> 
   ```bash
   cd confusion-2014-03-02/
   ```
5. Extract the part of the dataset that we will work with in this exercise <br>
   ```bash
   head -n 3000000 confusion-2014-03-02.json > confusion-part.json
   ```
## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in the `README.md` file inside the data bundle.

### 3.1 Data processing

In [None]:
path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()

Have a look at the data

In [None]:
dataset.limit(3).show()

Print the schema

In [None]:
dataset.printSchema()

### 3.2 Spark Dataframe queries

#### 1. Find the number of games where the guessed language and target language is Maltese.

#### 2. Return the number of distinct "target" languages.

#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending). 
(Hint: passing `truncate=False` to `show()` allows you to see the full output, otherwise you can simply use `collect()` instead) 

#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

#### 5 Sort the languages by decreasing overall percentage of correct guesses and return the first four languages. 
(Hint: `withColumnRenamed()` allows you to set the names of the generated columns and remember it is possible to `join()` Dataframes)

### 3.2 Spark SQL queries
We will now go over the same queries but using Spark SQL instead

In [None]:
dataset.createOrReplaceTempView("dataset")

In [None]:
%%sparksql
SELECT *
FROM dataset
LIMIT 3

#### 1. Find the number of games where the guessed language and target language is Maltese.

In [None]:
%%sparksql

#### 2. Return the number of distinct "target" languages.

In [None]:
%%sparksql


#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending).

In [None]:
%%sparksql


#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

In [None]:
%%sparksql


#### 5. Sort the languages by decreasing overall percentage of correct guesses and return the first four languages.

In [None]:
%%sparksql


## <center>4. Optional Exercise: PageRank</center>

The PageRank algorithm, named after Google's Larry Page, assigns a measure of importance to each node (page) in a graph based on the importance of incoming edges (links). The importance of each edge is, in turn, derived from the importance of the source node and its out-degree. PageRank was designed to rank web pages based on hyperlinks between pages, but it can be also used to rank scientific articles, or influential users in a social network.

The algorithm maintains two datasets: one collection of (*pageID*, *linkList*) elements containing the list of neighbors of each page, and one collection of (*pageID*, *rank*) elements containing the current rank for each page. The algorithm proceeds as follows:
1. Initialize each page's rank to $1.0$.
2. On each iteration, have page $x$ send a contribution of $\frac{rank(x)}{numNeighbors(x)}$ to its neighbors (the pages it has links to).
3. Set each page's rank to $0.15 + 0.85 \times contributionsReceived$.

The algorithm runs multiple iterations (of step 2 and 3) until it converges.

Implement the PageRank algorithm in Spark for a simple dataset, running the loop for a fixed number of iterations.

For instance, you can use "parallelize" for that as follows: 
```
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
```
where 1,2,3,4 represents ids of pages.

### 4.1 Use Spark RDDs

In [None]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)]).groupByKey().cache()

### 4.2 Use Spark DataFrames

In [None]:
from pyspark.sql.functions import collect_list
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id")
links_df = links_df.groupBy("page_id").agg(collect_list("linked_id").alias("neighbors")).cache()
links_df.show()

In [None]:
# Your code

### 4.3 Use Spark SQL
Hint: you can use
```
new_df = spark.sql("... SQL query ...")
new_df.createOrReplaceTempView("new_table")
```
to perform a query inside a for loop and making the updated *new_table* available from SQL at every step.

In [None]:
from pyspark.sql.functions import collect_list

links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id")
links_df = links_df.groupBy("page_id").agg(collect_list("linked_id").alias("neighbors")).cache()
links_df.createOrReplaceTempView("links")
links_df.show()