# Spark Lab 6: Use Spark SQL to Join and Aggregate Data

In this lab, we will learn a few things:

- how to read json type (a specific kind with each row is a json-encoded record) data into Spark SQL
- how to use SQL functions to do some more advanced data manipulations (e.g. split).
- how to do Inner joins 
- how to do some reporting using windowing functions (more advanced, but optional).

## Dataset

We will use a few files distributed with `sparkdata.zip`. If you have not previously downloaded `sparkdata.zip`, you can download it from `http://idsdl.csom.umn.edu/c/share/sparkdata.zip` using `wget`. Alternatively, you can copy the URL in your browser and download it from there. 

- `loudacre/device.json`: list of devices
- `loudacre/webpage.json`: inventory of webpages
- `loudacre/websitehit.json`:  hits on webpage with device_id

## Step 1. Inspect the Data

First, you want to inspect the data so that you understand its format.

Use OS commands to view a sample of each json file

**Question**: 
- given the format, what is the best way to read these files?  
- Do we have consistent field names across tables?

**Answer**:


## Step 2. Read, Inspect, and Analyze `webpage`

First load `webpage.json` into a DataFrame called `webpage`

Inspect the schema and first 20 rows. Fix any issue, if any, before you proceed.

**tip**: use `show(truncate=False)` to show long fields completely.

You notice that the associated_files lists multiple files separated by commas. Next we want to list these files individually, such as:

```
+------------+-----------------+
|web_page_num|       assoc_file|
+------------+-----------------+
|           1|        theme.css|
|           1|          code.js|
|           1|sorrento_f00l.jpg|
|           2|        theme.css|
|           2|          code.js|
|           2| titanic_2100.jpg|
```

Achive the above goal (this helps, for example, you to run query on file hits).

- i.e. create a dataframe `page_files` with `web_page_num` and `assoc_file`

**Hint**: consider using Spark SQL functions to first split the field, then explode it.

Verify what you obtain:

To practice JOIN with Spark, we ask you to join the `webpage` and `page_files`


Verify your results

## Step 3: Find top most-used devices for each page (optional, more challenging)

When a user visits a page using a device, this gets saved to `websitehit`. We want to analyze **for each webpage, what are the top 2 devices used for visiting this page**? 

This is most conveniently accomplished using Spark SQL's window functions (in particular its `rank()` function, because if you can get the rank of records by # of hits per device, then you can filter the dataset by rank to show just the first two). if you need refresher of window functions, you can visit [https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html](https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html).


But before you can get the top ranks, you may first need to aggregate the data to get the # of hits.

In the end, we want you to show something like this (some table joins are needed, think about when you should do joins):

```
+-------------------------------+-------------+----+
|web_page_file_name             |device_name  |hits|
+-------------------------------+-------------+----+
|sorrento_f30l_sales.html       |Sorrento F41L|125 |
|sorrento_f30l_sales.html       |Titanic 1100 |68  |
|sorrento_f41l_sales.html       |Sorrento F41L|116 |
|sorrento_f41l_sales.html       |Titanic 1000 |64  |
|ronin_novelty_note_4_sales.html|Sorrento F41L|123 |
|ronin_novelty_note_4_sales.html|Titanic 1100 |63  |
|sorrento_f24l_sales.html       |Sorrento F41L|122 |
|sorrento_f24l_sales.html       |Titanic 1100 |63  |
```
