# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2021 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and SparkSQL</center>

# Preparation for the exercise in Spark

1. Change to exercise09 repository

2. Start docker <br>
```docker-compose up -d```

3. Copy the data to the same directory as the `docker_compose.yml` in the exam magic box

## <center>1. Spark Dataframes</center>

Spark Dataframes allow the user to perform simple and efficient operations on data, as long as the data is structured and has a schema. Dataframes are similar to relational tables in relational databases: conceptually a dataframe is a specialization of a Spark RDD with schema information attached. You can find more information in Karau, H. et al. (2015). Learning Spark, Chapter 9 (optional reading).

### 1.1. Data preprocessing

In [2]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "orders.jsonl"
orders_df = spark.read.json(path).cache()

The type of our dataset object is DataFrame

In [2]:
type(orders_df)

pyspark.sql.dataframe.DataFrame

Print the schema

In [3]:
orders_df.printSchema()

root
 |-- customer: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- product: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |-- order_id: long (nullable = true)



Print one row

In [4]:
orders_df.limit(1).collect()

                                                                                

[Row(customer=Row(first_name='Preston', last_name='Landry'), date='2018-2-4', items=[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)], order_id=0)]

You can access the underlying RDD object and use any functions you learned for Spark RDDs.

In [5]:
orders_df.rdd.filter(lambda ordr: ordr.customer.last_name == "Landry").count()

                                                                                

1960

### 1.2. Dataframe Operations
We perform some queries using operations on Dataframes ([Here](https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations) is a guide on DF Operations with a link to the [API Documentation](https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html))

We can select columns and show the result

In [6]:
orders_df.select("customer.first_name", "customer.last_name").limit(5).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|   Preston|   Landry|
|    Jamari|Dominguez|
|   Brendon|  Sicilia|
|    Armani|   Ardeni|
|    Jamari|     Miao|
+----------+---------+



As you can see we can navigate to the nested items with the dot

In [7]:
orders_df.filter(orders_df["customer.last_name"] == "Landry").count()

1960

How about nested arrays?

In [8]:
orders_df.select("order_id", "items").orderBy("order_id").limit(5).show()

+--------+--------------------+
|order_id|               items|
+--------+--------------------+
|       0|[{1.53, fan, 5}, ...|
|       1|[{1.61, fan, 7}, ...|
|       2|[{1.41, the book,...|
|       3|[{1.05, computer ...|
|       4|[{1.92, headphone...|
+--------+--------------------+



Let us try to find orders of a fan.

In [9]:
orders_df.filter(orders_df["items.product"] == "fan").count()

AnalysisException: cannot resolve '(items.`product` = 'fan')' due to data type mismatch: differing types in '(items.`product` = 'fan')' (array<string> and string).;
'Filter (items#9.product = fan)
+- Relation [customer#7,date#8,items#9,order_id#10L] json


The above code doesn't work! Use ```array contains``` instead.

In [10]:
from pyspark.sql.functions import array_contains

orders_df.filter(array_contains("items.product", "fan")).count()

32778

Let us try to unnest the data.

Unnest the products with explode.

Explode will generate as many rows as there are elements in the array and match them to other attributes.

In [11]:
from pyspark.sql.functions import explode

orders_df.select(explode("items").alias("i"), "i.product", "order_id").orderBy("order_id").limit(5).show()

+--------------------+---------------+--------+
|                   i|        product|order_id|
+--------------------+---------------+--------+
|      {1.53, fan, 5}|            fan|       0|
|{1.33, computer s...|computer screen|       0|
|   {1.06, kettle, 6}|         kettle|       0|
|{1.96, stuffed an...| stuffed animal|       0|
| {1.09, the book, 7}|       the book|       0|
+--------------------+---------------+--------+



Now we can use this table to filter.

In [12]:
exploded_df = orders_df.select(explode("items").alias("i"), "i.product", "order_id")
exploded_df.filter(exploded_df["product"] == "fan").count()

39922

You might have tried to access the i.product column directly using a ```.filter``` right after the ```.select```. That, however, does not work, because the column is not available to ```orders_df``` when creating a clause like ```(orders_df["i.product"] == "fan")```. A possible workaround when using Dataframe operations is that of using a string clause in ```.filter```, so that the product column will be resolved after it has been added with the ```.select```.

In [13]:
orders_df.select(explode("items").alias("i"), "i.product", "order_id").filter("product = 'fan'").count()

39922

Project the nested columns

In [14]:
orders_df.select(explode("items").alias("i"), "*").select(
    "order_id", "customer.*", "date", "i.*").limit(3).show()

+--------+----------+---------+--------+-----+---------------+--------+
|order_id|first_name|last_name|    date|price|        product|quantity|
+--------+----------+---------+--------+-----+---------------+--------+
|       0|   Preston|   Landry|2018-2-4| 1.53|            fan|       5|
|       0|   Preston|   Landry|2018-2-4| 1.33|computer screen|       6|
|       0|   Preston|   Landry|2018-2-4| 1.06|         kettle|       6|
+--------+----------+---------+--------+-----+---------------+--------+



### 1.3 Exercises

1) Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. (Hint: you may need to import the function ```desc``` from ```pyspark.sql.functions``` to define descending order)

In [15]:
from pyspark.sql.functions import desc

orders_df.select(explode("items").alias("i"), "*").select(
    "i.product", "i.quantity"
).groupBy("product").avg("quantity").orderBy(desc("avg(quantity)")).limit(10).show()

+---------------+-----------------+
|        product|    avg(quantity)|
+---------------+-----------------+
|        toaster|5.515549016184942|
|       the book|5.514178678641427|
|         kettle|5.512053325314489|
|computer screen|5.504839685420448|
|     mouse trap|5.503895651308093|
|            fan|5.496342868593758|
|     headphones|5.485920795060985|
|       notebook|5.483182341458532|
| whiskey bottle|5.475555222463714|
| stuffed animal|5.470854598218753|
+---------------+-----------------+



2) Find the most expensive order

In [16]:
exploded_df = orders_df.select(explode("items").alias("i"), "*")
exploded_df.select(
    "order_id", (exploded_df["i.quantity"] * exploded_df["i.price"]).alias("total")
).groupBy("order_id").sum("total").orderBy(desc("sum(total)")).limit(1).show()

+--------+------------------+
|order_id|        sum(total)|
+--------+------------------+
|   99636|104.95999999999998|
+--------+------------------+



## <center>2. Spark SQL</center>

Spark SQL allows the users to formulate their queries using SQL. The requirement is the use of Dataframes, which as said before are similar to relational tables. In addition to a familiar interface, writing queries in SQL might provide better performance than RDDs, inheriting efficiency from the Dataframe operations, while also performing automatic optimization of queries.

First we need to install the sparksql magic command

In [17]:
!pip install sparksql-magic

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [18]:
%load_ext sparksql_magic

In order to use sql we need to create a temporary table.

This table only exists for the current session.

In [20]:
orders_df.createOrReplaceTempView("orders")

### 2.1 Queries

Finally, run SQL queries on the registered tables. We will run the same queries as during the previous section, but with SQL.

As you can see we can navigate to the nested items with the dot.

In [21]:
%%sparksql
-- Finally, run SQL queries on the registered tables
-- As you can see we can navigate to the nested items with the dot
SELECT count(*)
FROM orders
WHERE orders.customer.last_name == "Landry"

0
count(1)
1960


How about nested arrays?

In [22]:
%%sparksql
-- How about nested arrays?
SELECT order_id, items
FROM orders AS o
ORDER BY order_id
LIMIT 5

0,1
order_id,items
0,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]"
1,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]"
2,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]"
3,"[Row(price=1.05, product='computer screen', quantity=10), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.42, product='whiskey bottle', quantity=10)]"
4,"[Row(price=1.92, product='headphones', quantity=2), Row(price=1.44, product='fan', quantity=2), Row(price=1.84, product='kettle', quantity=4), Row(price=1.44, product='stuffed animal', quantity=5)]"


Let us try to find orders of a fan.

In [23]:
%%sparksql 
SELECT count(*)
FROM orders
WHERE items.product = "fan"

AnalysisException: cannot resolve '(orders.items.`product` = 'fan')' due to data type mismatch: differing types in '(orders.items.`product` = 'fan')' (array<string> and string).; line 3 pos 6;
'Aggregate [unresolvedalias(count(1), None)]
+- 'Filter (items#9.product = fan)
   +- SubqueryAlias orders
      +- View (`orders`, [customer#7,date#8,items#9,order_id#10L])
         +- Relation [customer#7,date#8,items#9,order_id#10L] json


The above code doesn't work! Use ```array contains``` instead.

In [24]:
%%sparksql

SELECT count(*)
FROM orders
WHERE array_contains(items.product, "fan")

0
count(1)
32778


Let us try to unnest the data.

Unnest the products with explode.

Explode will generate as many rows as there are elements in the array and match them to other attributes.

In [25]:
%%sparksql
SELECT explode(items) as i, i.product, order_id
FROM orders
ORDER BY order_id
limit 5

0,1,2
i,product,order_id
"Row(price=1.53, product='fan', quantity=5)",fan,0
"Row(price=1.33, product='computer screen', quantity=6)",computer screen,0
"Row(price=1.06, product='kettle', quantity=6)",kettle,0
"Row(price=1.96, product='stuffed animal', quantity=3)",stuffed animal,0
"Row(price=1.09, product='the book', quantity=7)",the book,0


Now we can use this table to filter.

In [26]:
%%sparksql
-- Filter on product
SELECT count(*)
    FROM (
    SELECT explode(items) as i, i.product, order_id
    FROM orders
    ORDER BY order_id
    )
WHERE product = "fan"

0
count(1)
39922


You might have tried to access the i.product column directly in the same ```SELECT``` clause. That, however, does not work, because the column is not available to the ```WHERE``` clause. In order to access the built columns directly, we need to unnest the data and make it part of our ```FROM``` clause. ```LATERAL VIEW``` lets us do just that, matching each non-array attribute to an unnested row from the array.  

In [27]:
%%sparksql
SELECT *
FROM orders lateral view explode(items) as flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4
customer,date,items,order_id,flat_items
"Row(first_name='Preston', last_name='Landry')",2018-2-4,"[Row(price=1.53, product='fan', quantity=5), Row(price=1.33, product='computer screen', quantity=6), Row(price=1.06, product='kettle', quantity=6), Row(price=1.96, product='stuffed animal', quantity=3), Row(price=1.09, product='the book', quantity=7), Row(price=1.42, product='headphones', quantity=9), Row(price=1.67, product='whiskey bottle', quantity=3)]",0,"Row(price=1.53, product='fan', quantity=5)"
"Row(first_name='Jamari', last_name='Dominguez')",2016-1-8,"[Row(price=1.61, product='fan', quantity=7), Row(price=1.39, product='whiskey bottle', quantity=2)]",1,"Row(price=1.61, product='fan', quantity=7)"
"Row(first_name='Brendon', last_name='Sicilia')",2016-6-6,"[Row(price=1.41, product='the book', quantity=6), Row(price=1.3, product='notebook', quantity=5), Row(price=1.1, product='fan', quantity=7), Row(price=1.5, product='stuffed animal', quantity=10), Row(price=1.39, product='headphones', quantity=8), Row(price=1.78, product='whiskey bottle', quantity=3), Row(price=1.15, product='fan', quantity=8)]",2,"Row(price=1.1, product='fan', quantity=7)"


Project the nested columns

In [28]:
%%sparksql
SELECT order_id, customer.first_name, customer.last_name, date, flat_items.*
FROM orders lateral view explode(items) item_table as flat_items
WHERE flat_items.product = "fan"
ORDER BY order_id
LIMIT 3

0,1,2,3,4,5,6
order_id,first_name,last_name,date,price,product,quantity
0,Preston,Landry,2018-2-4,1.53,fan,5
1,Jamari,Dominguez,2016-1-8,1.61,fan,7
2,Brendon,Sicilia,2016-6-6,1.1,fan,7


Having built an unnested table, we can now easily aggregate over the previously nested columns

### 2.2 Exercises

1) Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 

In [29]:
%%sparksql
SELECT flat_items.product, AVG(flat_items.quantity) as av_quantity
FROM orders lateral view explode(items) flat_table flat_items
GROUP BY flat_items.product
ORDER BY av_quantity DESC
LIMIT 10

0,1
product,av_quantity
toaster,5.515549016184942
the book,5.514178678641427
kettle,5.512053325314489
computer screen,5.504839685420448
mouse trap,5.503895651308093
fan,5.496342868593758
headphones,5.485920795060985
notebook,5.483182341458532
whiskey bottle,5.475555222463714


2) Find the most expensive order

In [30]:
%%sparksql
SELECT order_id, SUM(flat_items.quantity * flat_items.price) as total
FROM orders lateral view explode(items) flat_table flat_items
GROUP BY order_id
ORDER BY total desc
LIMIT 1

0,1
order_id,total
99636,104.95999999999998


## <center>3. Exercise: PageRank</center>

The PageRank algorithm, named after Google's Larry Page, assigns a measure of importance to each node (page) in a graph based on the importance of incoming edges (links). The importance of each edge is, in turn, derived from the importance of the source node and its out-degree. PageRank was designed to rank web pages based on hyperlinks between pages, but it can be also used to rank scientific articles, or influential users in a social network.

The algorithm maintains two datasets: one collection of (*pageID*, *linkList*) elements containing the list of neighbors of each page, and one collection of (*pageID*, *rank*) elements containing the current rank for each page. The algorithm proceeds as follows:
1. Initialize each page's rank to $1.0$.
2. On each iteration, have page $x$ send a contribution of $\frac{rank(x)}{numNeighbors(x)}$ to its neighbors (the pages it has links to).
3. Set each page's rank to $0.15 + 0.85 \times contributionsReceived$.

The algorithm runs multiple iterations (of step 2 and 3) until it converges.

Implement the PageRank algorithm in Spark for a simple dataset, running the loop for a fixed number of iterations.

For instance, you can use "parallelize" for that as follows: 
```
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
```
where 1,2,3,4 represents ids of pages.

### 3.1 Use Spark RDDs

In [31]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)]).groupByKey().cache()

ranks = links.mapValues(lambda x: 1.0)

for i in range(10):
    contributions = links.join(ranks).flatMap(
        lambda xs: [(dest, xs[1][1]/len(list(xs[1][0]))) for dest in list(xs[1][0])]
    )
    ranks = contributions.reduceByKey(lambda x, y: x + y).mapValues(lambda v: 0.15 + 0.85*v)
    
ranks.collect()

[(2, 0.7568028566886496),
 (4, 0.3522676188962165),
 (1, 0.49149688216717613),
 (3, 0.49149688216717613)]

### 3.2 Use Spark DataFrames

In [32]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id").cache()

In [33]:
links_c = links_df.groupBy("page_id").count()
ranks = links_df.selectExpr("page_id", "1.0 as rank")

for i in range(10):
    ranks = ranks.join(links_df, "page_id").join(links_c, "page_id").selectExpr(
        "*", "rank / count as contrib"
    ).groupBy("linked_id").sum("contrib").withColumnRenamed("sum(contrib)", "sum_contrib").selectExpr(
        "linked_id as page_id", "(0.15 + 0.85*sum_contrib) as rank"
    ).distinct()
    
ranks.show()

+-------+---------+
|page_id|     rank|
+-------+---------+
|      1|0.5070700|
|      3|0.5070700|
|      2|0.8035221|
|      4|0.3678407|
+-------+---------+



### 3.3 Use Spark SQL
Hint: you can use
```
new_df = spark.sql("... SQL query ...")
new_df.registerTempTable("new_table")
```
to perform a query inside a for loop and making the updated *new_table* available from SQL at every step

In [3]:
links = sc.parallelize([(1, 2),(1, 4),(2, 1),(2, 3),(3, 2)])
links_df = spark.createDataFrame(links).toDF("page_id", "linked_id").cache()

links_df.createOrReplaceTempView("links")

In [5]:
# Calculate link counts
links_c = spark.sql("""
    SELECT page_id, COUNT(linked_id) AS count_links
    FROM links
    GROUP BY page_id
""")
links_c.createOrReplaceTempView("links_c")

# Initialize ranks
ranks = spark.sql("""
    SELECT page_id, 1.0 AS rank
    FROM links
    GROUP BY page_id
""")
ranks.createOrReplaceTempView("ranks")

# PageRank algorithm iterations
for i in range(15):
    new_ranks = spark.sql("""
        SELECT l.linked_id AS page_id, (0.15 + 0.85 * SUM(r.rank / c.count_links)) AS rank
        FROM links l
        JOIN ranks r ON l.page_id = r.page_id
        JOIN links_c c ON l.page_id = c.page_id
        GROUP BY l.linked_id
    """)

    # Drop the old 'ranks' view
    spark.catalog.dropTempView("ranks")

    # Create new 'ranks' view with updated data
    new_ranks.createOrReplaceTempView("ranks")

# Show final ranks
ranks = spark.sql("SELECT * FROM ranks")
ranks.show()

+-------+--------+
|page_id|    rank|
+-------+--------+
|      1|0.468064|
|      3|0.468064|
|      2|0.754215|
|      4|0.351405|
+-------+--------+

