d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 400px">
</div>

# Spark SQL
1. Run a SQL query
1. Create a DataFrame
1. Write same query using DataFrame transformations
1. Trigger computation with DataFrame actions
1. Convert between DataFrames and SQL

##### Methods
- SparkSession (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html" target="_blank">Scala</a>): `sql`, `table`
- DataFrame (<a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame" target="_blank">Python</a>/<a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">Scala</a>):
  - Transformations:  `select`, `where`, `orderBy`
  - Actions: `show`, `count`, `take`
  - Other methods: `printSchema`, `schema`, `createOrReplaceTempView`

In [0]:
%run ./Includes/Classroom-Setup-SQL

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Run a SQL query
Use `SparkSession` to run SQL

In [0]:
budgetDF = spark.sql("""
SELECT name, price
FROM products
WHERE price < 200
ORDER BY price
""")

In [0]:
%sql
SELECT * FROM products ;

item_id,name,price
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
P_FOAM_S,Standard Foam Pillow,59.0


View results in the returned DataFrame

In [0]:
%sql
select "Hi"

Hi
Hi


In [0]:
budgetDF.show()

+--------------------+-----+
|                name|price|
+--------------------+-----+
|Standard Foam Pillow| 59.0|
|    King Foam Pillow| 79.0|
|Standard Down Pillow|119.0|
|    King Down Pillow|159.0|
+--------------------+-----+



In [0]:
display(budgetDF)

name,price
Standard Foam Pillow,59.0
King Foam Pillow,79.0
Standard Down Pillow,119.0
King Down Pillow,159.0


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create a DataFrame
Use `SparkSession` to create a DataFrame from a table

In [0]:
productsDF = spark.table("products")
display(productsDF)

item_id,name,price
M_PREM_Q,Premium Queen Mattress,1795.0
M_STAN_F,Standard Full Mattress,945.0
M_PREM_F,Premium Full Mattress,1695.0
M_PREM_T,Premium Twin Mattress,1095.0
M_PREM_K,Premium King Mattress,1995.0
P_DOWN_S,Standard Down Pillow,119.0
M_STAN_Q,Standard Queen Mattress,1045.0
M_STAN_K,Standard King Mattress,1195.0
M_STAN_T,Standard Twin Mattress,595.0
P_FOAM_S,Standard Foam Pillow,59.0


Access schema of DataFrame

In [0]:
productsDF.printSchema()

root
 |-- item_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- price: double (nullable = true)



In [0]:
productsDF.schema

Out[12]: StructType(List(StructField(item_id,StringType,true),StructField(name,StringType,true),StructField(price,DoubleType,true)))

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Write same query with DataFrame transformations

In [0]:
budgetDF = (productsDF
  .select("name", "price")
  .where("price < 200")
  .orderBy("price")
)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Trigger computation with DataFrame actions

In [0]:
budgetDF.count()

In [0]:
budgetDF.take(2)

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Convert between DataFrames and SQL

In [0]:
budgetDF.createOrReplaceTempView("budget")

In [0]:
spark.sql("SELECT name, price FROM budget WHERE price<200 ORDER BY price desc ").show()

+--------------------+-----+
|                name|price|
+--------------------+-----+
|    King Down Pillow|159.0|
|Standard Down Pillow|119.0|
|    King Foam Pillow| 79.0|
|Standard Foam Pillow| 59.0|
+--------------------+-----+



In [0]:
display(spark.sql("SELECT * FROM budget"))

name,price
Standard Foam Pillow,59.0
King Foam Pillow,79.0
Standard Down Pillow,119.0
King Down Pillow,159.0


## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Lab

1. Create a DataFrame from the `Event` table
1. Display DataFrame and inspect schema
1. Apply transformations to filter and sort `macOS` events
1. Count results and take first 5 rows
1. Create the same DataFrame using SQL query

##### Methods
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html" target="_blank">SparkSession</a>: `sql`, `table`
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">DataFrame</a> transformations: `select`, `where`, `orderBy`
- <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">DataFrame</a> actions: `select`, `count`, `take`
- Other <a href="http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html" target="_blank">DataFrame</a> methods: `printSchema`, `schema`, `createOrReplaceTempView`

### 1. Create a DataFrame from the `events` table
- Use SparkSession to create a DataFrame from the `events` table

In [0]:
# TODO
eventDF = FILL_IN

### 2. Display DataFrame and inspect schema
- Use methods above to inspect DataFrame contents and schema

In [0]:
# TODO

In [0]:
# TODO

-sandbox
### 3. Apply transformations to filter and sort `macOS` events
- Filter for rows where `device` is `macOS`
- Sort rows by `event_timestamp`

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use single and double quotes in your filter SQL expression

In [0]:
# TODO
macDF = (eventDF
  .FILL_IN
)

### 4. Count results and take first 5 rows
- Use DataFrame actions to count and take rows

In [0]:
# TODO
numRows = macDF.FILL_IN
rows = macDF.FILL_IN

-sandbox
##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work

In [0]:
from pyspark.sql import Row

assert(numRows == 1938215)
assert(len(rows) == 5)
assert(type(rows[0]) == Row)

### 5. Create the same DataFrame using SQL query
- Use SparkSession to run a sql query on the `events` table
- Use SQL commands above to write the same filter and sort query used earlier

In [0]:
# TODO
macSQLDF = spark.FILL_IN

display(macSQLDF)

-sandbox
%md ##### <img alt="Best Practice" title="Best Practice" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-blue-ribbon.svg"/> Check your work
- You should only see `macOS` values in the `device` column  
- The fifth row should be an event with timestamp `1592539226602157`

### Classroom Cleanup

In [0]:
%run ./Includes/Classroom-Cleanup
