# Introduction to Spark SQL with Python

In [1]:
import findspark
findspark.init()

## Pyspark SQL

In this chapter you will learn how to create and query a SQL table in Spark. Spark SQL brings the expressiveness of SQL to Spark. You will also learn how to use SQL window functions in Spark. Window functions perform a calculation across rows taht are related to the current row. They greatly simplify achieving results that are difficult to express using only joins and traditional aggregations. We'll use window functions to perform running sums, running differences, and other operatios that are challenging to perform in basic SQL.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

### Create a SQL table from a dataframe

A dataframe can be used to create a **temporary table**. A *temporary table* is one that will not exist after the session ends.

In [3]:
# Load trainsched.txt
df = spark.read.csv("../data/trainsched.txt", header = True)

# create temporary tabl called table1
df.createOrReplaceTempView("schedule")

### Determine the column names of a table

After creating a DataFrame you can query the data using SQL statements
> spark.sql("SELECT * FROM schedule WHERE station = 'San Jose'").show()

> result = spark.sql("SHOW COLUMNS FROM tablename")
<br>result = spark.swl("SELECT * FROM tablename LIMIT 0")
<br>result = spark.sql("DESCRIBE tablename")
<br>result.show()
<br>print(results.columns)

In [4]:
spark.sql("DESCRIBE schedule").show()

+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|train_id|   string|   null|
| station|   string|   null|
|    time|   string|   null|
+--------+---------+-------+



### What is a Window Function SQL?
- Express operations more simply than dot notation or queries
- Each row uses the values of other rows to calculate its value

### Running sums using window function SQL

A window function is like an aggregate function, except that it gives an output for every row in the dataset instead of a singl row per group.

In [6]:
df.printSchema()

root
 |-- train_id: string (nullable = true)
 |-- station: string (nullable = true)
 |-- time: string (nullable = true)



In [19]:
# Add col running_total that sums diff_min col in each group
query = """
SELECT train_id, station, time,
LEAD(time, 1) OVER (PARTITION BY train_id ORDER BY time) AS time_next
FROM schedule"""

# Run the query and display the result
spark.sql(query).show()

+--------+-------------+-----+---------+
|train_id|      station| time|time_next|
+--------+-------------+-----+---------+
|     217|       Gilroy|6:06a|    6:15a|
|     217|   San Martin|6:15a|    6:21a|
|     217|  Morgan Hill|6:21a|    6:36a|
|     217| Blossom Hill|6:36a|    6:42a|
|     217|      Capitol|6:42a|    6:50a|
|     217|       Tamien|6:50a|    6:59a|
|     217|     San Jose|6:59a|     null|
|     324|San Francisco|7:59a|    8:03a|
|     324|  22nd Street|8:03a|    8:16a|
|     324|     Millbrae|8:16a|    8:24a|
|     324|    Hillsdale|8:24a|    8:31a|
|     324| Redwood City|8:31a|    8:37a|
|     324|    Palo Alto|8:37a|    9:05a|
|     324|     San Jose|9:05a|     null|
+--------+-------------+-----+---------+



### Dot notation and SQL

Pretty much a dot notation for every SQL clause, even window functions. For example:
> from pyspark.sql import Window
<br>from pyspark.sql.functions import row_number
<br>
<br>df.withColumn("id", row_number().over(Window.partitionBy('train_id').orderBy('time')))

Is the same as
>query = """
<br>SELECT *
<br>ROW_NUMBER() OVER(PARTITION BY train_id ORDER BY time) AS id
<br>FROM schedule
"""
<br>
<br>spark.sql(query).show(11)

In [20]:
# close connection
spark.stop()