# ST446 Distributed Computing for Big Data


## Week 4 class: Spark SQL and DataFrames


### LT 2021

This notebook contains code samples from Chapter 4 of the book Learning PySpark.

You should open a Jupyter Notebook in your Dataproc cluster and create a PySpark notebook.

## 1. Generate a DataFrame 

### a. Data from file
We again use the `author-large.txt` file from dblp, which we used in previous exercises. We load the data into a DataFrame using the function `spark.read.csv()`.

In [None]:
# remember to change the command to use your bucket name

filename = 'gs://jialin-bucket/data/author-large.txt'

author_large_no_schema = spark.read.csv(filename, 
                    header='false', sep='\t')
author_large_no_schema.createOrReplaceTempView("author_large_no_schema")

#### Inferring the Schema Using Reflection
Spark can infer the schema using _reflection_; i.e. automaticaly determining the schema of the data based on sampling the data.

In [None]:
author_large_no_schema.printSchema()

#### Programmatically Specifying the Schema
Spark also allows to programmatically specify the schema.

In [None]:
from pyspark.sql.types import *

schema = StructType([
    StructField("author", StringType(), True),    
    StructField("journal", StringType(), True),
    StructField("title", StringType(), True),
    StructField("year", LongType(), True)
])

author_large = spark.read.csv(filename, 
                    header='false', schema=schema, sep='\t')
author_large.createOrReplaceTempView("author_large")

In [None]:
author_large.printSchema()

As you can see from above, we can programmatically apply the `schema` instead of allowing the Spark to infer the schema via reflection.

Additional resources:
* [PySpark API Reference](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html)
* [Spark SQL, DataFrames, and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema) (for programmatically specifying the schema from a `csv` file).

#### SparkSession

Notice that we're no longer using `sqlContext.read...` but instead `spark.read...`.  This is because as part of Spark 2.0, `HiveContext`, `SQLContext`, `StreamingContext`, `SparkContext` have been merged together into the Spark Session `spark`, which provides:
* entry point for reading data,
* working with metadata,
* configuration,
* cluster resource management.

For more information, see [How to use SparkSession in Apache Spark 2.0](http://bit.ly/2br0Fr1).

### b. Generate your own DataFrame
Instead of accessing the file system, let's create a DataFrame by generating the data.  In this case, we'll first create the `stringRDD` RDD and then convert it into a DataFrame when we're reading `stringJSONRDD` using `spark.read.json`.

In [None]:
# Generate our own JSON data 
#   This way we don't have to access the file system yet.
stringJSONRDD = sc.parallelize((""" 
  { "id": "123",
    "name": "Katie",
    "age": 19,
    "eyeColor": "brown"
  }""",
   """{
    "id": "234",
    "name": "Michael",
    "age": 22,
    "eyeColor": "green"
  }""", 
  """{
    "id": "345",
    "name": "Simone",
    "age": 23,
    "eyeColor": "blue"
  }""")
) 
# Create DataFrame
swimmersJSON = spark.read.json(stringJSONRDD) 
# Create temporary table
swimmersJSON.createOrReplaceTempView("swimmersJSON") 
# DataFrame API
swimmersJSON.show()

## 2. Querying with the DataFrame API

Spark allows you to query a DataFrame using the DataFrame API.

### a. Show first few rows

In [None]:
author_large.show(10)

### b. Count rows

In [None]:
author_large.count()

### c. Filter 

Here we get the titles with year 2000:

In [None]:
author_large.select(author_large.title).distinct().filter(author_large.year == 2000).show(10)

In [None]:
author_large.select("title").distinct().filter("year = 2000").show(10)

## 3. Querying with Spark SQL
You can also write your queries using `Spark SQL` - a SQL dialect that is compatible with the Hive Query Language (or HiveQL). The following codes produces the same output that in the section "Querying with the DataFrame API".

### a. Show first few rows

In [None]:
spark.sql("select * from author_large limit 10 ").show()

### b. Count rows

In [None]:
 spark.sql("select count(1) from author_large").show()

### c. Filter 

Here we get the titles with year 2000:

In [None]:
spark.sql("select DISTINCT title from author_large where year = 2000 limit 10").show()

Here we query publications with title starting with letter `b`: 

In [None]:
spark.sql("select title from author_large where title like 'b%' limit 10").show()

## 4. Query by joining 2 tables

Let's run a flight performance using DataFrames.

Please, follow these steps:

1) create a folder called `flight-data` in your bucket.

2) copy the files below into your bucket (remember to change the command to use your bucket name). 

```
curl -L -o airport-codes-na.txt https://www.dropbox.com/sh/89xbpcjl
4oq0j4w/AACFbu4e8rkwGIIasVBKI08va/LearningPySpark/airport-codes-na.txt?dl=0

curl -L -o departuredelays.csv https://www.dropbox.com/sh/89xbpcjl4
oq0j4w/AADCRWn4tOKcp21WBPYtcFyna/LearningPySpark/departuredelays.csv?dl=0
```

3) And the result should be like (remember to change the command to use your bucket name):

```
(base) LSE021353-2:~ st446$ gsutil ls gs://jialin-bucket/flight-data
gs://jialin-bucket/flight-data/airport-codes-na.txt
gs://jialin-bucket/flight-data/departuredelays.csv
```

In [None]:
# Let's first build the DataFrames from the source datasets.
# Remember to change the command to use your bucket name

# Set File Paths
flightPerfFilePath = "gs://jialin-bucket/flight-data/departuredelays.csv"
airportsFilePath = "gs://jialin-bucket/flight-data/airport-codes-na.txt"

# Obtain Airports dataset
airports = spark.read.csv(airportsFilePath, header='true', inferSchema='true', sep='\t')
airports.createOrReplaceTempView("airports")

# Obtain Departure Delays dataset
flightPerf = spark.read.csv(flightPerfFilePath, header='true')
flightPerf.createOrReplaceTempView("FlightPerformance")

# Cache the Departure Delays dataset 
flightPerf.cache()

In [None]:
airports.show(5)
flightPerf.show(5)

Now we query flight departure delays by cities in WA by joining the performance and airport tables with the airport codes (to identify state and city):

In [None]:
# Query Sum of Flight Delays by City and Origin Code (for Washington State)
spark.sql("select a.City, f.origin, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.State = 'WA' group by a.City, f.origin order by sum(f.delay) desc").show()

Here we query flight departure delays by States in the US:

In [None]:
# Query Sum of Flight Delays by State (for the US)
spark.sql("select a.State, sum(f.delay) as Delays from FlightPerformance f join airports a on a.IATA = f.origin where a.Country = 'USA' group by a.State ").show()

For more information, please refer to:
* [Spark SQL, DataFrames and Datasets Guide](http://spark.apache.org/docs/latest/sql-programming-guide.html#sql)
* [PySpark SQL Module: DataFrame](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame)
* [PySpark SQL Functions Module](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)