## Spark SQL and DataFrames

### University of California, Santa Barbara  
### PSTAT 135/235: Big Data Analytics
### Last Updated: January 29, 2019

---  

### Sources 

Learning Spark, Chapter 9: Spark SQL

https://spark.apache.org/docs/latest/sql-programming-guide.html

https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning

Demonstration of several useful DataFrame operations:  
https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html

### OBJECTIVES
- Introduction to Spark SQL, the interface for working with structured and semistructured data
- Introduce DataFrames and show basic functionality

### CONCEPTS AND FUNCTIONS
- Schema
- SQL
- Dataset and DataFrame
- Partition
- Parquet files

---  

**NOTE**

These lecture notes are a quick outline of Spark SQL and DataFrames.  
There is a lot of functionality provided, and Spark SQL and DataFrames are relatively new.  
The DataFrame has replaced the SchemaRDD

**Spark SQL Basics**

A database *schema* is the structure that represents the logical view of the entire database.   
Defines how data is organized and how relations among them are associated  
Defines tables, views, integrity constraints

SQL is a structured query language used to communicate with relational databases.  
Commands include CREATE, SELECT, UPDATE, ALTER, INSERT INTO, DROP, DELETE

***Spark SQL Capabilities:***

1. Can load data from various structured formats including JSON, Hive, Parquet  
2. Can query data using SQL inside Spark or from external tools that connect to Spark (e.g., Tableau) 
3. Spark SQL integrates between SQL and Python/Java/Scala/R code. Can do things like join RDDs and SQL tables.


**Dataset and DataFrame**

- A Dataset is a distributed collection of data.   
- A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).  
- A DataFrame is a Dataset organized into named columns.   
Think of a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.   

- DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.  

- The DataFrame API is available in Scala, Java, Python, and R. 

**DataFrames vs RDDs**  
Use RDDs to perform low-level transformations and actions on unstructured data. 

This means that you don’t care about imposing a schema while processing or accessing the attributes by name or column. 

Use RDDs when you want to manipulate the data with functional programming constructs rather than domain specific expressions.

Use DataFrames to use high-level expressions, to perform SQL queries to explore the data, and to gain columnar access
 
**Creating a DataFrame from an RDD**  
The following example illustrates the conversion from an RDD to a DataFrame, where we impose a schema on the data.


In [None]:
# import modules 
from pyspark.sql import Row

# Map the RDD to a DF

df = rdd.map(lambda line: Row(longitude=line[0], 
                              latitude=line[1], 
                              housingMedianAge=line[2],
                              totalRooms=line[3],
                              totalBedRooms=line[4],
                              population=line[5], 
                              households=line[6],
                              medianIncome=line[7],
                              medianHouseValue=line[8])).toDF()

Set up SparkSession

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

**Create a DataFrame from some JSON data**  
(For an example of JSON data see: http://json.org/example.html)


In [None]:
# spark is an existing SparkSession
df = spark.read.json("examples/src/main/resources/people.json")
# Displays the content of the DataFrame to stdout
df.show()
# +----+-------+
# | age|   name|
# +----+-------+
# |null|Michael|
# |  30|   Andy|
# |  19| Justin|
# +----+-------+
df.columns
['age', 'name']
df.count()
3
# Take first 2 rows
dfh = df.limit(2)


Next, we turn to the documentation to explore more DataFrame functionality including 
Subsetting, filtering, aggregation
https://spark.apache.org/docs/latest/sql-programming-guide.html


### Some Useful Operations

### Filtering

In [None]:
df.filter(df['age'] > 21).show()

In [None]:
from pyspark.sql.functions import col, asc

filterDF = df.filter((col("firstName") == "xiangrui") | (col("firstName") == "michael")).sort(asc("lastName"))
filterDF.show()

### Fetch records w first name null or last name null

In [None]:
filterNonNullDF = DF.filter(col("firstName").isNull() | col("lastName").isNull()).sort("email")

### where() is equivalent to filter()

In [None]:
whereDF = DF.where((col("firstName") == "xiangrui") | (col("firstName") == "michael")) \
            .sort(asc("lastName"))

### Replace missing with 0

nonNullDF = DF.fillna(0)

### Summarize the salary field

In [None]:
DF.describe(“salary”).show()

### Read data from a registered table (e.g., Hive metastore) into DataFrame

In [None]:
df_2 = spark.sql("select * from sample_df")

### Aggregate on columns

In [None]:
import pyspark.sql.functions as F

# Provide the min, count, and avg and groupBy the location column
agg_df = df.groupBy("location").agg(F.min("id"), F.count("id"), F.avg("date_diff"))

### Write DF to Parquet file, partitioning on a column

In [None]:
df = df.withColumn('end_month', F.month('end_date'))
df = df.withColumn('end_year', F.year('end_date'))
df.write.partitionBy("end_year", "end_month").parquet("/tmp/sample_table")
display(dbutils.fs.ls("/tmp/sample_table"))

### Infer the schema when reading in file

In [None]:
adult_df = spark.read.\
    format("com.spark.csv").\
    option("header", "false").\
    option("inferSchema", "true").load("dbfs:/databricks-datasets/adult/adult.data")

adult_df.printSchema()

### SQL Temporary View
It is possible to register a DataFrame as a SQL temporary view, and then query the view writing straight SQL

In [None]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")


sqlDF = spark.sql("SELECT FROM people where name=='Andy'")
sqlDF.show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+


### Saving and Loading Data

#### Save / Load using Generic Functions

In [None]:
df = spark.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

#### Save / Load using Manually Specified Formats

In [None]:
df = spark.read.load("examples/src/main/resources/people.json", format="json")
df.select("name", "age").write.save("namesAndAges.parquet", format="parquet")

### Parquet Files

- Project was developed at Twitter, taken over by Apache Software Foundation (Apache)   
- Parquet is a columnar format that is supported by many other data processing systems  

- Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons.


Key observation: It can be much more efficient to store data in terms of columns than rows.  
Column data is stored in contiguous memory blocks.


#### Save / Load Operations using Parquet Files


In [None]:
# read in data in JSON format. This will produce a DataFrame.
peopleDF = spark.read.json("examples/src/main/resources/people.json")

# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet("people.parquet")

# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# Loading parquet files produces a DataFrame.
parquetFile = spark.read.parquet("people.parquet")

### Partition Discovery

Database tables can be partitioned to make querying more efficient.  
For example, the data can be
split by gender and country, producing smaller tables.  
If the analyst is only interested in a single country, the query will run faster.


In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory.  

All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. 


path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...
