# <center>Big Data &ndash; Exercises </center>
## <center>Fall 2025 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and SparkSQL</center>

## Preparation for the exercise in Spark

1. Drag this notebook in the `notebooks` folder of your exam magic box

2. Start docker with ```docker-compose up -d```

3. Launch Jupiter from the docker container

## <center>1. Spark Dataframes</center>

Spark Dataframes allow the user to perform simple and efficient operations on data, as long as the data is structured and has a schema. Dataframes are similar to relational tables in relational databases but they allow for nestedness: conceptually a dataframe is a specialization of a Spark RDD with schema information attached. You can find more information in Karau, H. et al. (2015). Learning Spark, Chapter 9 [link](https://learning.oreilly.com/library/view/learning-spark/9781449359034/?ar) (optional reading).

### 1.1. Data preprocessing

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local[*]').getOrCreate()
print("Spark Version", spark.version)

sc = spark.sparkContext

path = "orders.jsonl"
orders_df = spark.read.json(path).cache()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/17 16:31:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark Version 4.0.0


                                                                                

The type of our dataset object is DataFrame and we can check its schema with the `printSchema()` method.

In [2]:
print(type(orders_df))
orders_df.printSchema()

<class 'pyspark.sql.classic.dataframe.DataFrame'>
root
 |-- customer: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- product: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- order_id: long (nullable = true)



Print two sample rows. (Vertical and truncate are optional parameters)

In [3]:
orders_df.limit(2).show(vertical=True, truncate=False)

-RECORD 0---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 customer | {Preston, Landry}                                                                                                                                                 
 date     | 2018-2-4                                                                                                                                                          
 items    | [{1.53, fan, 5}, {1.33, computer screen, 6}, {1.06, kettle, 6}, {1.96, stuffed animal, 3}, {1.09, the book, 7}, {1.42, headphones, 9}, {1.67, whiskey bottle, 3}] 
 order_id | 0                                                                                                                                                                 
-RECORD 1--------------------------------------------------------------------------------------------------------------------

You can access the underlying RDD object and use any functions you learned for Spark RDDs.

In [4]:
orders_df.rdd.filter(lambda ordr: ordr.customer.last_name == "Landry").count()

                                                                                

1960

### 1.2. Dataframe Operations
We will perform some queries using operations on Dataframes ([Here](https://spark.apache.org/docs/4.0.0/sql-getting-started.html) is a quick guide on DF Operations, with a link to the [Functions API Documentation](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/functions.html) and [DataFrame API Documentation](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/dataframe.html))

We can select columns and show the result

In [5]:
orders_df.select("customer.first_name", "customer.last_name").limit(5).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Preston|   Landry|
|    Jamari|Dominguez|
|   Brendon|  Sicilia|
|    Armani|   Ardeni|
|    Jamari|     Miao|
+----------+---------+



As you can see we can navigate inside nested items with the dot notation. We can filter data based on the columns, and we can use the binary operators to filter the data. Keep in mind, that instead of `or` we use `|`, instead of `and` we use `&`.

In [6]:
orders_df.filter((orders_df["customer.last_name"] == "Landry") & (orders_df["customer.first_name"] == "John")).count()

2

How about nested arrays?

In [7]:
orders_df.select("order_id", "items").orderBy("order_id").limit(5).show()

+--------+--------------------+
|order_id|               items|
+--------+--------------------+
|       0|[{1.53, fan, 5}, ...|
|       1|[{1.61, fan, 7}, ...|
|       2|[{1.41, the book,...|
|       3|[{1.05, computer ...|
|       4|[{1.92, headphone...|
+--------+--------------------+



If we try to find all the orders that include a fan this way:

In [8]:
orders_df.filter(orders_df["items.product"] == "fan").count()

{"ts": "2025-11-17 16:32:06.056", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve \"(items.product = fan)\" due to data type mismatch: the left and right operands of the binary operator have incompatible types (\"ARRAY<STRING>\" and \"STRING\"). SQLSTATE: 42K09", "context": {"file": "line 1 in cell [8]", "line": "", "fragment": "__eq__", "errorClass": "DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o31.filter.\n: org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve \"(items.product = fan)\" due to data type mismatch: the left and right operands of the binary operator have incompatible types (\"ARRAY<STRING>\" and \"STRING\"). SQLSTATE: 42K09;\n'Filter (items#8.product = fan)\n+- Relation [customer#6,date#7,items#8,order_id#9L] json\n\n\tat org.apache.spark.sql.catalyst.analysis.package$AnalysisEr

AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(items.product = fan)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("ARRAY<STRING>" and "STRING"). SQLSTATE: 42K09;
'Filter (items#8.product = fan)
+- Relation [customer#6,date#7,items#8,order_id#9L] json


We notice, that the above code doesn't work! The reason behind it is, that the left side of the `==` operator is an array of strings, and the right side is a string. We would need to check for inclusion instead. Luckaly, Spark provides a function for that. It is called ```array_contains()``` and have to import it from the ```pyspark.sql.functions``` module.

In [9]:
from pyspark.sql.functions import array_contains

orders_df.filter(array_contains("items.product", "fan")).count()

32778

Let us try to unnest the data. We can do so by using the ```explode``` function.

Explode will generate as many rows as there are elements in the array and match them to other attributes. You should name the newly generated exploded column in order to be able to refer to it.

In [10]:
from pyspark.sql.functions import explode

(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "i.quantity", "order_id")
    .orderBy("order_id", "product").limit(5).show()
)

+---------------+--------+--------+
|        product|quantity|order_id|
+---------------+--------+--------+
|computer screen|       6|       0|
|            fan|       5|       0|
|     headphones|       9|       0|
|         kettle|       6|       0|
| stuffed animal|       3|       0|
+---------------+--------+--------+



Now we can use this table to further filter for the orders that include a fan. You might want to access the ```i.product``` column directly inside a ```.filter```, like so:

In [11]:
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter(orders_df["i.product"] == "fan")
    .distinct()
    .count()
)

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `i`.`product` cannot be resolved. Did you mean one of the following? [`customer`, `date`, `items`, `order_id`]. SQLSTATE: 42703

That, however, does not work, because the column ```i.product``` is not available on `orders_df`.

(Note the usage of `.distinct()`. Why would we want to use it? Hint: check out order with id=2)

In order to filter on a newly added column we have a few different options.

1. The most verbose version is to use an intermediate table:

In [12]:
exploded_df = (orders_df
                .select(explode("items").alias("i"), "order_id")
                .select("i.product", "order_id"))

(exploded_df
    .filter(exploded_df["product"] == "fan")
    .select('order_id')
    .distinct()
    .count()
)

32778

2. We can use a SQL expression inside ```.filter```. This is done by providing one string argument to the ```.filter``` method, and the product column will be resolved inside of it:

In [13]:
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter("i.product == 'fan'")
    .distinct()
    .count()
)

32778

3. We can also use a helper function ```col``` from ```pyspark.sql.functions``` to create a [Column](https://spark.apache.org/docs/4.0.0/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html#pyspark.sql.Column) expression on the fly:

In [14]:
from pyspark.sql.functions import col
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "order_id")
    .filter(col("i.product") == "fan")
    .distinct()
    .count()
)

32778

We can also project the nested columns. Just like in SQL ```*``` is a shortcut for just selecting all of the fields involved.



In [15]:
(orders_df
    .select(explode("items").alias("i"), "*")
    .select("order_id", "customer.*", "date", "i.*")
    .limit(3).show())

+--------+----------+---------+--------+-----+---------------+--------+
|order_id|first_name|last_name|    date|price|        product|quantity|
+--------+----------+---------+--------+-----+---------------+--------+
|       0|   Preston|   Landry|2018-2-4| 1.53|            fan|       5|
|       0|   Preston|   Landry|2018-2-4| 1.33|computer screen|       6|
|       0|   Preston|   Landry|2018-2-4| 1.06|         kettle|       6|
+--------+----------+---------+--------+-----+---------------+--------+



Sorting can be done by using the ```orderBy``` method. There is multiple different ways of specifying the sort order.

1. By an SQL expression(s). Note, all of the columns have the same sort order - either ascending or descending.

In [16]:
orders_df.orderBy("customer.last_name", "customer.first_name", ascending=True).limit(2).show()

+---------------+--------+--------------------+--------+
|       customer|    date|               items|order_id|
+---------------+--------+--------------------+--------+
|{Aaden, Ardeni}|2017-5-6|[{1.94, stuffed a...|   91132|
|{Abbie, Ardeni}|2016-5-8|[{1.93, computer ...|   29081|
+---------------+--------+--------------------+--------+



2. By using Column objects and their ```asc``` and ```desc``` methods:

In [17]:
(orders_df
    .orderBy(
        orders_df["customer.last_name"].asc(),
        orders_df["customer.first_name"].desc(),
        orders_df["order_id"].asc())
    .limit(2).show()
)

+--------------+--------+--------------------+--------+
|      customer|    date|               items|order_id|
+--------------+--------+--------------------+--------+
|{Zion, Ardeni}|2017-6-9|[{1.79, computer ...|   19611|
|{Zion, Ardeni}|2018-1-3|[{1.59, fan, 8}, ...|   65150|
+--------------+--------+--------------------+--------+



3. By using the `asc` and `desc` functions from `pyspark.sql.functions`:

In [18]:
from pyspark.sql.functions import asc, desc
(orders_df
    .orderBy(
        asc("customer.last_name"),
        desc("customer.first_name"),
        asc("order_id"))
    .limit(2).show()
)

+--------------+--------+--------------------+--------+
|      customer|    date|               items|order_id|
+--------------+--------+--------------------+--------+
|{Zion, Ardeni}|2017-6-9|[{1.79, computer ...|   19611|
|{Zion, Ardeni}|2018-1-3|[{1.59, fan, 8}, ...|   65150|
+--------------+--------+--------------------+--------+



Note: Only the last ```orderBy``` method is considered for the final sort order - chaining multiple does not combined them, as you might have expected. You can verify this using an `.explain()` call.

In [19]:
(orders_df
    .orderBy("customer.last_name", ascending=True).orderBy("customer.first_name", ascending=False)
    .explain()
)
(orders_df
    .orderBy(orders_df["customer.last_name"].asc(), orders_df["customer.first_name"].desc())
    .explain()
)

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [customer#6.first_name DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(customer#6.first_name DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [plan_id=756]
      +- InMemoryTableScan [customer#6, date#7, items#8, order_id#9L]
            +- InMemoryRelation [customer#6, date#7, items#8, order_id#9L], StorageLevel(disk, memory, deserialized, 1 replicas)
                  +- FileScan json [customer#6,date#7,items#8,order_id#9L] Batched: false, DataFilters: [], Format: JSON, Location: InMemoryFileIndex(1 paths)[file:/home/jupyter/work/orders.jsonl], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<customer:struct<first_name:string,last_name:string>,date:string,items:array<struct<price:d...


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [customer#6.last_name ASC NULLS FIRST, customer#6.first_name DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(customer#6.last_name ASC NULLS FIRST,

### 1.3 Exercises

#### 1. Find the number of distinct products

In [20]:
# Solution
(orders_df
    .select(explode("items").alias("i"))
    .select("i.product")
    .distinct()
    .count()
)

10

#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity.

In [21]:
# Solution
from pyspark.sql.functions import avg, desc, sum

# Explanation:
"""
We should not assume, that every order has distinct items with their associated quantities.
Take a look at order with id=2. It has 2 items with the same product - "fan".
"""
orders_df.filter(orders_df["order_id"] == 2).select(explode("items")).show()
"""
That is the reason, why we first need to group by order_id and product,
to get the "unique_quantity" for each product in each order,
and then group by product to get the average quantity for each product.
"""

(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select("i.product", "i.quantity", "order_id")

    .groupBy("order_id", "product")
    .agg(sum("quantity").alias("unique_quantity"))

    .groupBy("product")
    .agg(avg("unique_quantity").alias("avg_quantity"))

    .orderBy(desc("avg_quantity"))
    .limit(10).show()
)

+--------------------+
|                 col|
+--------------------+
| {1.41, the book, 6}|
|  {1.3, notebook, 5}|
|       {1.1, fan, 7}|
|{1.5, stuffed ani...|
|{1.39, headphones...|
|{1.78, whiskey bo...|
|      {1.15, fan, 8}|
+--------------------+

+---------------+------------------+
|        product|      avg_quantity|
+---------------+------------------+
|        toaster|6.7136569225632154|
|       the book| 6.699219338698496|
|            fan| 6.694276648971871|
|       notebook| 6.674699162223915|
|         kettle| 6.672450403445975|
|     mouse trap| 6.670729499788813|
|computer screen| 6.661826612165212|
|     headphones| 6.659314546839299|
| stuffed animal| 6.656550590527213|
| whiskey bottle| 6.645876288659794|
+---------------+------------------+



#### 3. Find the most expensive order

In [22]:
# Solution
from pyspark.sql.functions import col, sum
(orders_df
    .select(explode("items").alias("i"), "order_id")
    .select((col("i.quantity") * col("i.price")).alias("total_price"), "order_id")
    
    .groupBy("order_id")
    .agg(sum("total_price").alias("total_price"))
    .orderBy(desc("total_price"))
    .limit(3).show()
)


+--------+------------------+
|order_id|       total_price|
+--------+------------------+
|   99636|104.95999999999998|
|   43932|103.30000000000001|
|   90800|             103.0|
+--------+------------------+



## <center>2. Spark SQL</center>

Spark SQL enables users to write queries using an SQL-like dialect, but it requires DataFrames, since they closely resemble relational tables. In addition to providing a familiar interface, SQL queries can deliver better performance compared to RDDs, leveraging the efficiency of DataFrame operations and Spark's automatic query optimization.

The sparksql-magic should come preinstalled in the exam magic box. We just need to load it.

In [23]:
# !pip install sparksql-magic --quiet
%load_ext sparksql_magic

In order to use sql we need to create a temporary table.

This table only exists for the current session.

In [24]:
orders_df.createOrReplaceTempView("orders")

### 2.1 Queries

Finally, run SQL queries on the registered tables. We will run the same queries as during the previous section, but with SQL.

As you can see we can navigate to the nested items with the dot.

In [25]:
%%sparksql
SELECT count(*)
FROM orders
WHERE orders.customer.last_name == "Landry"

0
count(1)
1960


How about nested arrays?

In [26]:
%%sparksql
SELECT order_id, items
FROM orders AS o
ORDER BY order_id
LIMIT 5

0,1
order_id,items
0,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]"
1,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]"
2,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]"
3,"[Row(price=1.05, product='computer screen', quantity=10), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.42, product='whiskey bottle', quantity=10)]"
4,"[Row(price=1.92, product='headphones', quantity=2), Row(price=1.44, product='fan', quantity=2), Row(price=1.84, product='kettle', quantity=4), Row(price=1.44, product='stuffed animal', quantity=5)]"


Let us try to find orders of a fan.

In [27]:
%%sparksql 
SELECT count(*)
FROM orders
WHERE items.product = "fan"

{"ts": "2025-11-17 16:32:42.843", "level": "ERROR", "logger": "SQLQueryContextLogger", "msg": "[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve \"(items.product = fan)\" due to data type mismatch: the left and right operands of the binary operator have incompatible types (\"ARRAY<STRING>\" and \"STRING\"). SQLSTATE: 42K09", "context": {"errorClass": "DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o27.sql.\n: org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve \"(items.product = fan)\" due to data type mismatch: the left and right operands of the binary operator have incompatible types (\"ARRAY<STRING>\" and \"STRING\"). SQLSTATE: 42K09; line 3 pos 6;\n'Aggregate [unresolvedalias(count(1))]\n+- 'Filter (items#8.product = fan)\n   +- SubqueryAlias orders\n      +- View (`orders`, [customer#6, date#7, items#8, order_id#9L])\n         +- Relation [customer#6,date#7,

AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(items.product = fan)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("ARRAY<STRING>" and "STRING"). SQLSTATE: 42K09; line 3 pos 6;
'Aggregate [unresolvedalias(count(1))]
+- 'Filter (items#8.product = fan)
   +- SubqueryAlias orders
      +- View (`orders`, [customer#6, date#7, items#8, order_id#9L])
         +- Relation [customer#6,date#7,items#8,order_id#9L] json


The above code doesn't work! We need once again to use ```array contains``` instead.

In [28]:
%%sparksql
SELECT count(*)
FROM orders
WHERE array_contains(items.product, "fan")

0
count(1)
32778


Let us try to unnest the data. We can do so by using the ```explode``` function.

Explode will generate as many rows as there are elements in the array and match them to other attributes.

In [29]:
%%sparksql
SELECT i.product, order_id
FROM (
    SELECT explode(items) AS i, order_id
    FROM orders
) exploded_items
ORDER BY order_id
LIMIT 5

0,1
product,order_id
fan,0
computer screen,0
kettle,0
stuffed animal,0
the book,0


Now we can use this table to filter. For example we want to find out how many times does "fan" appear.

In [30]:
%%sparksql
SELECT count(*)
FROM (
    SELECT explode(items) AS i
    FROM orders
)
WHERE i.product = "fan"

0
count(1)
39922


You might have tried to filter on the `i.product` column directly in the same ```SELECT``` clause:

In [31]:
%%sparksql
SELECT explode(items) AS i, i.product 
FROM orders
WHERE i.product = "fan"

{"ts": "2025-11-17 16:33:19.243", "level": "ERROR", "logger": "SQLQueryContextLogger", "msg": "[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `i`.`product` cannot be resolved. Did you mean one of the following? [`orders`.`date`, `orders`.`items`, `orders`.`customer`, `orders`.`order_id`]. SQLSTATE: 42703", "context": {"errorClass": "UNRESOLVED_COLUMN.WITH_SUGGESTION"}, "exception": {"class": "Py4JJavaError", "msg": "An error occurred while calling o27.sql.\n: org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `i`.`product` cannot be resolved. Did you mean one of the following? [`orders`.`date`, `orders`.`items`, `orders`.`customer`, `orders`.`order_id`]. SQLSTATE: 42703; line 3 pos 6;\n'Project ['explode('items) AS i#2487, 'i.product]\n+- 'Filter ('i.product = fan)\n   +- SubqueryAlias orders\n      +- View (`orders`, [customer#6, date#7, items#8, order_id#9L])\n       

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `i`.`product` cannot be resolved. Did you mean one of the following? [`orders`.`date`, `orders`.`items`, `orders`.`customer`, `orders`.`order_id`]. SQLSTATE: 42703; line 3 pos 6;
'Project ['explode('items) AS i#2487, 'i.product]
+- 'Filter ('i.product = fan)
   +- SubqueryAlias orders
      +- View (`orders`, [customer#6, date#7, items#8, order_id#9L])
         +- Relation [customer#6,date#7,items#8,order_id#9L] json


That, however, just like before, does not work. This is because the column is not available to the ```WHERE``` clause right away. In order to access the built columns directly, we need to unnest the data and make it part of our ```FROM``` clause. ```LATERAL VIEW``` lets us do just that, matching each non-array attribute to an unnested row from the array.  

In [32]:
%%sparksql
SELECT *
FROM orders LATERAL VIEW explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4
customer,date,items,order_id,flat_items
"Row(first_name='Preston', last_name='Landry')",2018-2-4,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]",0,"Row(price=1.53, product='fan', quantity=5)"
"Row(first_name='Jamari', last_name='Dominguez')",2016-1-8,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]",1,"Row(price=1.61, product='fan', quantity=7)"
"Row(first_name='Brendon', last_name='Sicilia')",2016-6-6,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]",2,"Row(price=1.1, product='fan', quantity=7)"


With this we can also project the nested columns

In [33]:
%%sparksql
SELECT order_id, customer.first_name, customer.last_name, date, flat_items.*
FROM orders LATERAL VIEW explode(items) AS flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4,5,6
order_id,first_name,last_name,date,price,product,quantity
0,Preston,Landry,2018-2-4,1.53,fan,5
1,Jamari,Dominguez,2016-1-8,1.61,fan,7
2,Brendon,Sicilia,2016-6-6,1.1,fan,7


Having built an unnested table, we can now easily aggregate over the previously nested columns

### 2.2 Exercises

#### 1. Find the number of distinct products

In [34]:
%%sparksql
-- Solution
SELECT COUNT(DISTINCT flat_items.product)
FROM orders LATERAL VIEW explode(items) AS flat_items

0
count(DISTINCT flat_items.product)
10


#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 

In [35]:
%%sparksql
-- Solution
-- check the explanation under Exercise 1.3.2
SELECT product, AVG(unique_quantity) AS av_quantity
FROM 
(
    SELECT order_id, flat_items.product, SUM(flat_items.quantity) AS unique_quantity
    FROM orders lateral view explode(items) flat_table flat_items
    GROUP BY order_id, flat_items.product
)
GROUP BY product
ORDER BY av_quantity DESC
LIMIT 10

0,1
product,av_quantity
toaster,6.7136569225632154
the book,6.699219338698496
fan,6.694276648971871
notebook,6.674699162223915
kettle,6.672450403445975
mouse trap,6.670729499788813
computer screen,6.661826612165212
headphones,6.659314546839299
stuffed animal,6.656550590527213


#### 3. Find the most expensive order

In [36]:
%%sparksql
SELECT order_id, SUM(flat_items.quantity * flat_items.price) AS total
FROM orders lateral view explode(items) AS flat_items
GROUP BY order_id
ORDER BY total DESC
LIMIT 3

0,1
order_id,total
99636,104.95999999999998
43932,103.30000000000001
90800,103.0


## <center>3. More queries</center>

We will now explore the dataset that is going to be used in the graded exercise of this week. It will be the same language game dataset as in exercise08. If you already have it from last week you just have to copy the `confusion-2014-03-02` folder in the `notebooks` folder of your exam magic box, and run commands from the instruction 4. onwards, to create a reduced version of it.

If you don't have the dataset at all, also perform the first 3 steps.

1. Move to the `notebooks` folder of the magicbox in the terminal
2. Download the data <br>
   ```bash
   wget https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   # or
   curl -O https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2
   ```
3. Extract the data <br>
   ```bash
   tar -jxvf confusion-2014-03-02.tbz2
   ```
4. Change directory to extracted folder <br> 
   ```bash
   cd confusion-2014-03-02/
   ```
5. Extract the part of the dataset that we will work with in this exercise <br>
   ```bash
   head -n 3000000 confusion-2014-03-02.json > confusion-part.json
   ```
## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in the `README.md` file inside the data bundle.

### 3.1 Data processing

In [37]:
path = "confusion-2014-03-02/confusion-part.json"
dataset = spark.read.json(path).cache()

                                                                                

Have a look at the data

In [38]:
dataset.limit(3).show()

+--------------------+-------+----------+---------+--------------------+---------+
|             choices|country|      date|    guess|              sample|   target|
+--------------------+-------+----------+---------+--------------------+---------+
|[Maori, Mandarin,...|     AU|2013-08-19|Norwegian|48f9c924e0d98c959...|Norwegian|
|[Danish, Dinka, K...|     AU|2013-08-19|    Dinka|af5e8f27cef9e689a...|    Dinka|
|[German, Hungaria...|     AU|2013-08-19|  Turkish|509c36eb58dbce009...|   Samoan|
+--------------------+-------+----------+---------+--------------------+---------+



                                                                                

Print the schema

In [39]:
dataset.printSchema()

root
 |-- choices: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- country: string (nullable = true)
 |-- date: string (nullable = true)
 |-- guess: string (nullable = true)
 |-- sample: string (nullable = true)
 |-- target: string (nullable = true)



### 3.2 Spark Dataframe queries

#### 1. Find the number of games where the guessed language and target language is Maltese.

In [40]:
# Solution
dataset.filter(dataset["target"] == "Maltese").filter(dataset["guess"] == "Maltese").count()

                                                                                

11258

#### 2. Return the number of distinct "target" languages.

In [41]:
# Solution
dataset.select("target").distinct().count()

78

#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending). 
(Hint: passing `truncate=False` to `show()` allows you to see the full output, otherwise you can simply use `collect()` instead) 

In [42]:
# Solution
(dataset
    .select("sample")
    .filter(dataset["target"] == dataset["guess"])
    .orderBy(
        dataset["date"].desc(),
        dataset["target"].asc(),
        dataset["country"].desc())
    .limit(3).show(truncate=False)
)

+--------------------------------+
|sample                          |
+--------------------------------+
|fdf23d0a7063ba2fcef4b18eb7d57ad8|
|00b85faa8b878a14f8781be334deb137|
|1dd8e1883037c6305b87afe382c4feba|
+--------------------------------+



#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

In [43]:
# Solution
(dataset
    .groupBy("country", "guess").count()
    .orderBy(desc("count"))
    .limit(2).show())

+-------+------+-----+
|country| guess|count|
+-------+------+-----+
|     US|German|20932|
|     US|French|20780|
+-------+------+-----+



                                                                                

#### 5 Sort the languages by decreasing overall percentage of correct guesses and return the first four languages. 
(Hint: `withColumnRenamed()` allows you to set the names of the generated columns and remember it is possible to `join()` Dataframes)

In [44]:
# Solution
from pyspark.sql.functions import col, avg, count
(dataset
    .groupBy("target")
    .agg(
        avg((col('target') == col('guess')).cast('int')).alias('correct_guesses'),
    )
    .orderBy(desc("correct_guesses"))
    .limit(4).show()
)

# Alternative solution with joins
correct_df = dataset.filter(dataset["target"] == dataset["guess"]).groupBy("target").agg(count('*').alias("correct_guesses"))
total_df = dataset.groupBy("target").agg(count('*').alias("total_guesses"))
joined_df = correct_df.join(total_df, "target")
(joined_df
    .select("target", (joined_df["correct_guesses"] / joined_df["total_guesses"]).alias("percentage"))
    .orderBy(desc("percentage"))
    .limit(4).show()
)

+-------+------------------+
| target|   correct_guesses|
+-------+------------------+
| French|0.9617235377572363|
| German|0.9482122107988057|
|Italian|0.9191241444382873|
|Russian|0.9079549864183158|
+-------+------------------+



                                                                                

+-------+------------------+
| target|        percentage|
+-------+------------------+
| French|0.9617235377572363|
| German|0.9482122107988057|
|Italian|0.9191241444382873|
|Russian|0.9079549864183158|
+-------+------------------+



### 3.2 Spark SQL queries
We will now go over the same queries but using Spark SQL instead

In [45]:
dataset.createOrReplaceTempView("dataset")

In [46]:
%%sparksql
SELECT *
FROM dataset
LIMIT 3

0,1,2,3,4,5
choices,country,date,guess,sample,target
"['Maori', 'Mandarin', 'Norwegian', 'Tongan']",AU,2013-08-19,Norwegian,48f9c924e0d98c959d8a6f1862b3ce9a,Norwegian
"['Danish', 'Dinka', 'Khmer', 'Lao']",AU,2013-08-19,Dinka,af5e8f27cef9e689a070b8814dcc02c3,Dinka
"['German', 'Hungarian', 'Samoan', 'Turkish']",AU,2013-08-19,Turkish,509c36eb58dbce009ccf93f375358d53,Samoan


#### 1. Find the number of games where the guessed language and target language is Maltese.

In [47]:
%%sparksql
-- Solution
SELECT count(*) FROM dataset
WHERE target == "Maltese" AND guess == "Maltese"

0
count(1)
11258


#### 2. Return the number of distinct "target" languages.

In [48]:
%%sparksql
-- Solution
SELECT COUNT(DISTINCT target)
FROM dataset

0
count(DISTINCT target)
78


#### 3. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending).

In [49]:
%%sparksql
-- Solution
SELECT sample
FROM dataset
WHERE target = guess
ORDER BY date DESC, target ASC, country DESC
LIMIT 3

0
sample
fdf23d0a7063ba2fcef4b18eb7d57ad8
00b85faa8b878a14f8781be334deb137
1dd8e1883037c6305b87afe382c4feba


#### 4. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

In [50]:
%%sparksql
-- Solution
SELECT country, guess, count(guess) as num_guesses
FROM dataset 
GROUP BY country, guess
ORDER BY num_guesses desc
LIMIT 2

0,1,2
country,guess,num_guesses
US,German,20932
US,French,20780


#### 5. Sort the languages by decreasing overall percentage of correct guesses and return the first four languages.

In [51]:
%%sparksql
-- Solution
SELECT target, AVG(CAST(target = guess AS INT)) AS fract
FROM dataset
GROUP BY target
ORDER BY fract DESC 
LIMIT 4

0,1
target,fract
French,0.9617235377572363
German,0.9482122107988057
Italian,0.9191241444382873
Russian,0.9079549864183158


In [52]:
%%sparksql
-- Alternative Solution
WITH correct AS (
    SELECT target, count(*) AS correct_guesses
    FROM dataset
    WHERE target = guess
    GROUP BY target
    ),
total AS (
    SELECT target, count(*) as total_guesses
    FROM dataset
    GROUP BY target
    )
SELECT target, correct_guesses/total_guesses AS fract
FROM correct JOIN total USING(target)
ORDER BY fract DESC 
LIMIT 4

0,1
target,fract
French,0.9617235377572363
German,0.9482122107988057
Italian,0.9191241444382873
Russian,0.9079549864183158


## <center>4. Optional Exercise: PageRank</center>

The PageRank algorithm, named after Google's Larry Page, assigns a measure of importance to each node (page) in a graph based on the importance of incoming edges (links). The importance of each edge is, in turn, derived from the importance of the source node and its out-degree. PageRank was designed to rank web pages based on hyperlinks between pages, but it can be also used to rank scientific articles, or influential users in a social network.

The algorithm maintains two datasets: one collection of (*pageID*, *linkList*) elements containing the list of neighbors of each page, and one collection of (*pageID*, *rank*) elements containing the current rank for each page. The algorithm proceeds as follows:
1. Initialize each page's rank to $1.0$.
2. On each iteration, have page $x$ send a contribution of $\frac{rank(x)}{numNeighbors(x)}$ to its neighbors (the pages it has links to).
3. Set each page's rank to $0.15 + 0.85 \times contributionsReceived$.

The algorithm runs multiple iterations (of step 2 and 3) until it converges.

Implement the PageRank algorithm in Spark for a simple dataset, running the loop for a fixed number of iterations.

For instance, you can use "parallelize" for that as follows: 
```
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
```
where 1,2,3,4 represents ids of pages.

### 4.1 Use Spark RDDs

In [53]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)]).groupByKey().cache()

# Step 1.
ranks = links.mapValues(lambda x: 1.0)

def contrib_calc(xs):
    src_id, (neighbors, src_rank) = xs
    rank_per_neighbor = src_rank / len(neighbors)
    return [(dest, rank_per_neighbor) for dest in neighbors]

for i in range(10):
    # Step 2.
    contributions = links.join(ranks).flatMap(contrib_calc)

    # Step 3.
    new_ranks = contributions.reduceByKey(lambda x, y: x + y).mapValues(lambda v: 0.15 + 0.85*v)
    
    # Convergence check
    tolerance = 1e-3
    has_non_coverged = new_ranks.join(ranks).mapValues(lambda x: abs(x[0]-x[1]) > tolerance).values().reduce(lambda x,y: x or y)
    if not has_non_coverged:
        break
    ranks = new_ranks.cache()

ranks.collect()

                                                                                

[(1, 0.49149688216717613),
 (2, 0.7568028566886496),
 (3, 0.49149688216717613),
 (4, 0.3522676188962165)]

### 4.2 Use Spark DataFrames

In [54]:
from pyspark.sql.functions import collect_list
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id")
links_df = links_df.groupBy("page_id").agg(collect_list("linked_id").alias("neighbors")).cache()
links_df.show()

+-------+---------+
|page_id|neighbors|
+-------+---------+
|      1|   [2, 4]|
|      3|      [2]|
|      2|   [1, 3]|
+-------+---------+



In [55]:
# Solution
from pyspark.sql.functions import col, sum, abs, array_size, explode

# Step 1.
ranks = links_df.selectExpr("page_id", "1.0 as rank")

for i in range(10):
    # Step 2. and 3.
    new_ranks = (ranks
        .join(links_df, "page_id")
        .select(explode("neighbors").alias("linked_id"), (col("rank") / array_size("neighbors")).alias("contrib"))
        .groupBy("linked_id").agg(sum("contrib").alias("sum_contrib"))
        .select(col("linked_id").alias("page_id"), (0.15 + 0.85*col("sum_contrib")).alias("rank"))
    )
    tolerance = 1e-3
    are_approx_equal = (ranks.withColumnRenamed("rank", "old_rank")
        .join(new_ranks, "page_id")
        .filter(abs(col("old_rank") - col("rank")) > tolerance).count()) == 0
    if are_approx_equal:
        break
    ranks = new_ranks

ranks.show()

+-------+-------------------+
|page_id|               rank|
+-------+-------------------+
|      1|0.49149688216717613|
|      3|0.49149688216717613|
|      2| 0.7568028566886496|
|      4| 0.3522676188962165|
+-------+-------------------+



### 4.3 Use Spark SQL
Hint: you can use
```
new_df = spark.sql("... SQL query ...")
new_df.createOrReplaceTempView("new_table")
```
to perform a query inside a for loop and making the updated *new_table* available from SQL at every step.

In [56]:
from pyspark.sql.functions import collect_list

links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id")
links_df = links_df.groupBy("page_id").agg(collect_list("linked_id").alias("neighbors")).cache()
links_df.createOrReplaceTempView("links")
links_df.show()

+-------+---------+
|page_id|neighbors|
+-------+---------+
|      1|   [2, 4]|
|      3|      [2]|
|      2|   [1, 3]|
+-------+---------+



In [57]:
# Solution

# Step 1.
ranks = spark.sql("""
    SELECT page_id, 1.0 AS rank
    FROM links
    GROUP BY page_id
""")
ranks.createOrReplaceTempView("ranks")

for i in range(10):
    # Step 2. and 3.
    new_ranks = spark.sql("""
        SELECT linked_id AS page_id, (0.15 + 0.85 * SUM(r.rank / array_size(l.neighbors))) AS rank
        FROM ranks r
        JOIN links l ON r.page_id = l.page_id
        LATERAL VIEW explode(l.neighbors) AS linked_id
        GROUP BY linked_id
    """)
    new_ranks.createOrReplaceTempView("new_ranks")

    exist_different_columns = spark.sql("""
        SELECT ANY(abs(r.rank - n.rank) > 0.0001)
        FROM ranks r
        JOIN new_ranks n ON r.page_id = n.page_id
    """)
    exist_different_columns = exist_different_columns.take(1)[0][0]
    if not exist_different_columns:
        break
    spark.catalog.dropTempView("ranks")
    new_ranks.createOrReplaceTempView("ranks")

spark.catalog.dropTempView("new_ranks")
# Show final ranks
ranks = spark.sql("SELECT * FROM ranks")

ranks.show()

                                                                                

+-------+---------+
|page_id|     rank|
+-------+---------+
|      1|0.4914969|
|      3|0.4914969|
|      2|0.7568029|
|      4|0.3522676|
+-------+---------+

