# <center>Big Data for Engineers &ndash; Exercises &ndash; Solution</center>
## <center>Spring 2022 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and SparkSQL</center>

# Preparation for the exercise in Spark

1. Change to `exercise09` repository

2. Start docker <br>
```docker-compose up -d``` <br>
(This process can take up to 10 minutes.)

3. After docker finishes downloading the images, you should be able to start the jupyter notebook by copying the following URL to your browser <br>
```http://127.0.0.1:8888/``` 

4. copy the data to docker <br> ```docker cp orders.jsonl jupyter:/home/jovyan/work``` <br>
(Copying the data to docker needs to be done only once and it might take 1-2 minutes.)

## <center>1. Spark Dataframes</center>

Spark Dataframes allow the user to perform simple and efficient operations on data, as long as the data is structured and has a schema. Dataframes are similar to relational tables in relational databases: conceptually a dataframe is a specialization of a Spark RDD with schema information attached. You can find more information in Karau, H. et al. (2015). Learning Spark, Chapter 9 (optional reading).

Throughout the exercise, you can see the equivalency of Spark RDD, Spark Dataframes and SparkSQL. 

### 1.1. Data preprocessing

In [1]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "orders.jsonl"
orders_df = spark.read.json(path).cache()

The type of our dataset object is DataFrame.

In [2]:
type(orders_df)

pyspark.sql.dataframe.DataFrame

Print the schema.

In [3]:
orders_df.printSchema()

root
 |-- customer: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- product: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- order_id: long (nullable = true)



Print one row.

In [4]:
orders_df.limit(1).collect()

[Row(customer=Row(first_name='Preston', last_name='Landry'), date='2018-2-4', items=[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)], order_id=0)]

You can access the underlying RDD object and use any functions you learned for Spark RDDs.

In [5]:
orders_df.rdd.filter(lambda ordr: ordr.customer.last_name == "Landry").count()

1960

### 1.2. Dataframe Operations
We perform some queries using operations on Dataframes ([Here](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations) is a guide on DF Operations with a link to the [API Documentation](https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html))

We can select columns and show the results.

In [7]:
orders_df.select("customer.first_name", "customer.last_name").limit(5).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Preston|   Landry|
|    Jamari|Dominguez|
|   Brendon|  Sicilia|
|    Armani|   Ardeni|
|    Jamari|     Miao|
+----------+---------+



As you can see we can navigate to the nested items with the dot.

In [8]:
orders_df.filter(orders_df["customer.last_name"] == "Landry").count()

1960

How about nested arrays?

In [9]:
orders_df.select("order_id", "items").orderBy("order_id").limit(5).show()

+--------+--------------------+
|order_id|               items|
+--------+--------------------+
|       0|[{1.53, fan, 5}, ...|
|       1|[{1.61, fan, 7}, ...|
|       2|[{1.41, the book,...|
|       3|[{1.05, computer ...|
|       4|[{1.92, headphone...|
+--------+--------------------+



Let us try to find orders of a fan.

In [None]:
orders_df.filter(orders_df["items.product"] == "fan").count()

The above code doesn't work! Use [```array_contains```](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.array_contains.html) instead.

In [11]:
from pyspark.sql.functions import array_contains

orders_df.filter(array_contains("items.product", "fan")).count()

32778

<b>Let us try to unnest the data.</b>

Unnest the products with [`explode`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.explode.html).

`explode` will generate as many rows as there are elements in the array and match them to other attributes via projection.

In [12]:
from pyspark.sql.functions import explode

orders_df.select(explode("items").alias("i"), "i.product", "order_id").orderBy("order_id").limit(5).show()

+--------------------+---------------+--------+
|                   i|        product|order_id|
+--------------------+---------------+--------+
|      {1.53, fan, 5}|            fan|       0|
|{1.33, computer s...|computer screen|       0|
|   {1.06, kettle, 6}|         kettle|       0|
|{1.96, stuffed an...| stuffed animal|       0|
| {1.09, the book, 7}|       the book|       0|
+--------------------+---------------+--------+



Now we can use this table to filter.

In [13]:
exploded_df = orders_df.select(explode("items").alias("i"), "i.product", "order_id")
exploded_df.filter(exploded_df["product"] == "fan").count()

39922

You might have tried to access the `i.product` column directly using a ```.filter``` right after the ```.select```. That, however, does not work, because the column is not available to ```orders_df``` when creating a clause like ```(orders_df["i.product"] == "fan")```. A possible workaround when using Dataframe operations is to use a string clause in ```.filter```, so that the product column will be resolved after it has been added with the ```.select```.

In [14]:
orders_df.select(explode("items").alias("i"), "i.product", "order_id").filter("product = 'fan'").count()

39922

Any ideas why there are more "fan" in the `explode` query than the `array_contain` one? 

This is because that there could be more than one "fan" types in each order. You will find about that when inspecting the `orders.jsonl` data. 
E.g., 
```json
{"order_id": 2, "date": "2016-6-6", "customer": {"first_name": "Brendon", "last_name": "Sicilia"}, "items": [..., {"product": "fan", "quantity": 7, "price": 1.1}, ..., {"product": "fan", "quantity": 8, "price": 1.15}]}
```

<b>Project the nested columns.</b>

In [15]:
orders_df.select(explode("items").alias("i"), "*").select(
    "order_id", "customer.*", "date", "i.*").limit(3).show()

+--------+----------+---------+--------+-----+---------------+--------+
|order_id|first_name|last_name|    date|price|        product|quantity|
+--------+----------+---------+--------+-----+---------------+--------+
|       0|   Preston|   Landry|2018-2-4| 1.53|            fan|       5|
|       0|   Preston|   Landry|2018-2-4| 1.33|computer screen|       6|
|       0|   Preston|   Landry|2018-2-4| 1.06|         kettle|       6|
+--------+----------+---------+--------+-----+---------------+--------+



### 1.3. Exercises

1) Find the average quantity at which each product is purchased. Only show the top 10 products by average quantity. <br> 
(Hint: You may need to import the function ```desc``` from ```pyspark.sql.functions``` to define descending order)

In [16]:
from pyspark.sql.functions import desc

orders_df.select(explode("items").alias("i"), "*").select(
    "i.product", "i.quantity"
).groupBy("product").avg("quantity").orderBy(desc("avg(quantity)")).limit(10).show()

+---------------+-----------------+
|        product|    avg(quantity)|
+---------------+-----------------+
|        toaster|5.515549016184942|
|       the book|5.514178678641427|
|         kettle|5.512053325314489|
|computer screen|5.504839685420448|
|     mouse trap|5.503895651308093|
|            fan|5.496342868593758|
|     headphones|5.485920795060985|
|       notebook|5.483182341458532|
| whiskey bottle|5.475555222463714|
| stuffed animal|5.470854598218753|
+---------------+-----------------+



2) Find the most expensive order. <br>
(Hint: You first build a dataframe by `explode` the items. Then you calculate the total price and aggregate per order. 

In [17]:
exploded_df = orders_df.select(explode("items").alias("i"), "*")
exploded_df.select(
    "order_id", (exploded_df["i.quantity"] * exploded_df["i.price"]).alias("total")
).groupBy("order_id").sum("total").orderBy(desc("sum(total)")).limit(1).show()

+--------+------------------+
|order_id|        sum(total)|
+--------+------------------+
|   99636|104.95999999999998|
+--------+------------------+



## <center>2. Spark SQL</center>

Spark SQL allows the users to formulate their queries using SQL. The requirement is the use of Dataframes, which as said before are similar to relational tables. In addition to a familiar interface, writing queries in SQL might provide better performance than RDDs, inheriting efficiency from the Dataframe operations, while also performing automatic optimization of queries.

First we need to install the `sparksql` magic command.

In [None]:
!pip install sparksql-magic

In [20]:
%load_ext sparksql_magic

In order to use SQL we need to create a temporary table.

<b>Note this table only exists for the current session.</b>

In [21]:
orders_df.registerTempTable("orders")

### 2.1. Queries

Finally, run SQL queries on the registered table `orders`. We will run the same queries as during the previous section, but with SQL.

As you can see we can navigate to the nested items with the dot.

In [22]:
%%sparksql
-- Finally, run SQL queries on the registered table "orders"
-- As you can see we can navigate to the nested items with the dot
SELECT count(*)
FROM orders
WHERE orders.customer.last_name == "Landry"

0
count(1)
1960


How about nested arrays?

In [23]:
%%sparksql
-- How about nested arrays?
SELECT order_id, items
FROM orders AS o
ORDER BY order_id
LIMIT 5

0,1
order_id,items
0,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]"
1,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]"
2,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]"
3,"[Row(price=1.05, product='computer screen', quantity=10), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.42, product='whiskey bottle', quantity=10)]"
4,"[Row(price=1.92, product='headphones', quantity=2), Row(price=1.44, product='fan', quantity=2), Row(price=1.84, product='kettle', quantity=4), Row(price=1.44, product='stuffed animal', quantity=5)]"


Let us try to find orders of a fan.

In [None]:
%%sparksql 
SELECT count(*)
FROM orders
WHERE items.product = "fan"

The above code doesn't work! Use [```array_contains```](https://spark.apache.org/docs/latest/api/sql/index.html#array_contains) instead.

In [25]:
%%sparksql

SELECT count(*)
FROM orders
WHERE array_contains(items.product, "fan")

0
count(1)
32778


Let us try to unnest the data.

Unnest the products with [`explode`](https://spark.apache.org/docs/latest/api/sql/index.html#explode).

`explode` will generate as many rows as there are elements in the array and match them to other attributes.

In [26]:
%%sparksql
SELECT explode(items) as i, i.product, order_id
FROM orders
ORDER BY order_id
limit 5

0,1,2
i,product,order_id
"Row(price=1.53, product='fan', quantity=5)",fan,0
"Row(price=1.33, product='computer screen', quantity=6)",computer screen,0
"Row(price=1.06, product='kettle', quantity=6)",kettle,0
"Row(price=1.96, product='stuffed animal', quantity=3)",stuffed animal,0
"Row(price=1.09, product='the book', quantity=7)",the book,0


Now we can use this table to filter.

In [27]:
%%sparksql
-- Filter on product
SELECT count(*)
    FROM (
    SELECT explode(items) as i, i.product, order_id
    FROM orders
    ORDER BY order_id
    )
WHERE product = "fan"

0
count(1)
39922


You might have tried to access the `i.product` column directly in the same ```SELECT``` clause. That, however, does not work, because the column is not available to the ```WHERE``` clause. In order to access the built columns directly, we need to unnest the data and make it part of our ```FROM``` clause. [```LATERAL VIEW```](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-lateral-view.html) lets us do just that, matching each non-array attribute to an unnested row from the array.  

In [48]:
%%sparksql
SELECT *
FROM orders LATERAL VIEW explode(items) as flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4
customer,date,items,order_id,flat_items
"Row(first_name='Preston', last_name='Landry')",2018-2-4,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]",0,"Row(price=1.53, product='fan', quantity=5)"
"Row(first_name='Jamari', last_name='Dominguez')",2016-1-8,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]",1,"Row(price=1.61, product='fan', quantity=7)"
"Row(first_name='Brendon', last_name='Sicilia')",2016-6-6,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]",2,"Row(price=1.1, product='fan', quantity=7)"


Project the nested columns.

In [49]:
%%sparksql
SELECT order_id, customer.first_name, customer.last_name, date, flat_items.*
FROM orders LATERAL VIEW explode(items) item_table as flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4,5,6
order_id,first_name,last_name,date,price,product,quantity
0,Preston,Landry,2018-2-4,1.53,fan,5
1,Jamari,Dominguez,2016-1-8,1.61,fan,7
2,Brendon,Sicilia,2016-6-6,1.1,fan,7


Having built an unnested table, we can now easily aggregate over the previously nested columns.

### 2.2. Exercises

1) Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 

In [52]:
%%sparksql
SELECT flat_items.product, AVG(flat_items.quantity) as avg_quantity
FROM orders LATERAL VIEW explode(items) flat_table flat_items
GROUP BY flat_items.product
ORDER BY avg_quantity DESC
LIMIT 10

0,1
product,avg_quantity
toaster,5.515549016184942
the book,5.514178678641427
kettle,5.512053325314489
computer screen,5.504839685420448
mouse trap,5.503895651308093
fan,5.496342868593758
headphones,5.485920795060985
notebook,5.483182341458532
whiskey bottle,5.475555222463714


2) Find the most expensive order.

In [53]:
%%sparksql
SELECT order_id, SUM(flat_items.quantity * flat_items.price) as total
FROM orders LATERAL VIEW explode(items) flat_table flat_items
GROUP BY order_id
ORDER BY total desc
LIMIT 1

0,1
order_id,total
99636,104.95999999999998


## <center>3. Create Nestedness (Optional)</center>

We've already had a look at the solution of dataframes/SparkSQL towards <b>unnesting</b> arrays by using `explode` method. For the other way round, Spark Dataframes / Spark SQL also provide ways for us to nest our data by creating arrays, especially after clauses like `groupBy`.

In traditional PostgreSQL, we have to use one of the aggregation functions (`max, sum, count,`...) to process the result after the `groupBy` operation. For example, for each customer (assume there are no customers with both the same first name and last name), we want to find out the number of distinct dates when they placed an order. You can fill in the queries in both Spark DataFrames and Spark SQL. The query could look like this using [`countDistinct`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.countDistinct.html):

In [32]:
from pyspark.sql.functions import countDistinct
orders_df.groupBy("customer.first_name", "customer.last_name").agg(countDistinct("date")).show()

+-------------------+------------------+-----------+
|customer.first_name|customer.last_name|count(date)|
+-------------------+------------------+-----------+
|               Zane|              Dahl|          3|
|             Dorian|              Dahl|          2|
|               Rory|             Dower|          2|
|             Morgan|              Miao|          2|
|             Ashlyn|             Hatch|          1|
|             Landen|         Galatioto|          2|
|              Allan|                Po|          4|
|           Clarissa|           Sicilia|          2|
|              Annie|             Dower|          2|
|              Micah|                Mo|          4|
|             Morgan|           Poitras|          2|
|             Gordon|            Gruber|          2|
|         Alexandria|       Butterfield|          3|
|             Thomas|             Dower|          1|
|              Ariel|           Coulson|          3|
|            Xiomara|         Christofi|      

In [None]:
%%sparksql
select customer.first_name, customer.last_name, count(distinct date) from orders 
group by customer.first_name, customer.last_name

But what if we are interested not only in the count of distinct dates, but the actual
dates themselves? Luckily Spark Dataframes / Spark SQL do provide us with methods to preserve the original information of the date list. If now we would like to know for each customer, on which dates they placed an order, we shall use [`collect_set`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_set.html) method:

In [35]:
from pyspark.sql.functions import collect_set
orders_df.groupBy("customer.first_name", "customer.last_name").agg(collect_set("date")).show()

+-------------------+--------------------+--------------------+
|customer.first_name|  customer.last_name|   collect_set(date)|
+-------------------+--------------------+--------------------+
|              Abbie|                Egan|[2018-4-8, 2016-3...|
|           Abigayle|           Mc namara|[2016-4-2, 2017-6-9]|
|            Adalynn|              Ardeni|[2018-2-2, 2018-6...|
|               Aden|          Rosenbloom|         [2016-3-10]|
|             Adonis|              Badash|          [2017-6-8]|
|            Agustin|          Srivastava|         [2018-4-10]|
|              Aiden|             Suchoff|          [2018-2-8]|
|             Aiyana|              Landry|          [2018-2-3]|
|             Alaina|              Gruber|[2016-3-1, 2018-5-1]|
|             Alayna|               Mayer|         [2016-3-10]|
|         Alexandria|         Butterfield|[2018-5-3, 2017-4...|
|         Alexzander|              Landry|[2017-1-3, 2017-6-3]|
|              Alice|             Balste

In [None]:
%%sparksql
select customer.first_name, customer.last_name, collect_set(date) from orders 
group by customer.first_name, customer.last_name

For some other cases where we want to preserve all the entries rather than the de-duplicated ones, we can use  [`collect_list`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.html) method. For example, for each date we want to record the last names of customers. Since two customers might share the same last name, we need to collect all of them. The query should look like this:

In [40]:
from pyspark.sql.functions import collect_list
orders_df.groupBy("date").agg(collect_list("customer.last_name")).show()

+---------+--------------------------------+
|     date|collect_list(customer.last_name)|
+---------+--------------------------------+
| 2017-4-8|            [Egan, Gruber, Le...|
| 2016-2-5|            [Dahl, Findley, M...|
| 2016-1-1|            [Suchoff, Lowe, D...|
| 2018-2-6|            [Gottardo, Po, Go...|
| 2018-1-7|            [Macmahon, Hirsch...|
| 2016-6-7|            [Horah, Whitla, H...|
|2018-5-10|            [Gruber, Drago, S...|
| 2017-5-4|            [Srivastava, Ho-s...|
| 2018-4-7|            [Drago, Mayer, La...|
|2018-6-10|            [Badash, Decaro, ...|
| 2017-3-3|            [Badash, Marinko,...|
| 2017-3-5|            [Rosenbloom, Bals...|
| 2017-6-4|            [Whitla, Egan, Lo...|
| 2018-6-3|            [Cerda, Berenguie...|
| 2018-1-6|            [Zapata, Miao, Ne...|
| 2016-4-9|            [Badash, Dahlsted...|
| 2016-1-5|            [Suchoff, Srivast...|
|2018-1-10|            [Srivastava, Domi...|
| 2017-1-5|            [Dower, Zapata, M...|
| 2018-1-4

In [None]:
%%sparksql
select date, collect_list(customer.last_name) from orders group by date

Now try it on yourself.

For every order on 2016-1-1, return the list of products that appeared in that order:

In [42]:
from pyspark.sql.functions import explode
exploded_df = orders_df.select(explode("items").alias("i"), "*")
exploded_df.filter(exploded_df["date"] == "2016-1-1").groupBy("order_id").agg(collect_list("i.product")).show()

+--------+-----------------------+
|order_id|collect_list(i.product)|
+--------+-----------------------+
|    8484|   [headphones, whis...|
|   33209|   [notebook, the bo...|
|   84024|   [computer screen,...|
|   91703|   [notebook, stuffe...|
|   28555|   [kettle, computer...|
|    3120|   [whiskey bottle, ...|
|   48533|               [kettle]|
|   97472|   [toaster, the boo...|
|    1280|   [whiskey bottle, ...|
|   74743|           [mouse trap]|
|   85793|   [whiskey bottle, ...|
|   23754|   [the book, the bo...|
|   24308|   [fan, fan, whiske...|
|    9418|           [mouse trap]|
|   98787|   [kettle, stuffed ...|
|   35723|   [computer screen,...|
|   47083|   [mouse trap, fan,...|
|   58037|   [fan, kettle, mou...|
|   66103|           [headphones]|
|   94704|   [notebook, notebo...|
+--------+-----------------------+
only showing top 20 rows



For every product, return the set of dates when it's purchased:

In [43]:
from pyspark.sql.functions import collect_set
exploded_df.orderBy("date").groupBy("i.product").agg(collect_set("date")).show()

+---------------+--------------------+
|      i.product|   collect_set(date)|
+---------------+--------------------+
|       the book|[2018-4-6, 2018-4...|
|     mouse trap|[2018-4-6, 2018-4...|
|computer screen|[2018-4-6, 2018-4...|
| whiskey bottle|[2018-4-6, 2018-4...|
|        toaster|[2018-4-6, 2018-4...|
| stuffed animal|[2018-4-6, 2018-4...|
|         kettle|[2018-4-6, 2018-4...|
|            fan|[2018-4-6, 2018-4...|
|     headphones|[2018-4-6, 2018-4...|
|       notebook|[2018-4-6, 2018-4...|
+---------------+--------------------+



One of the drawbacks of the <font face="courier">collect_set/collect_list</font> method is they only accept one column as the argument. Later we will see how we can create nestedness on pretty much everything after we get the hang of the mighty JSONiq.