# Lab 3: Spark Structured APIs

In this class, we will learn about the Spark Structured APIs, including DataFrame APIs and Basic Spark SQL operations. 

## 1. About Jupyter Notebook

Enter the IP and your jupyter port in the web browser. For example: `172.18.30.207:11223` and enter the default jupyter password. 

1. Check the default jupyter config by `jupyter lab --generate-config`.

2. The default jupyter file path is `/data/lab`.

3. If you want to select the right jupyter kernel. 

* Alternative method to install jupyter notebook in your conda env. [[link](https://zhuanlan.zhihu.com/p/107567637)]
* More about different jupyter kernels. [[link](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)]

## 2. About Spark installation

* Spark path `/opt/module/spark-3.5.0-bin-hadoop3/`
* List pyspark arguements by `pyspark --help`
* [Install spark from source](https://spark.apache.org/docs/3.5.1/building-spark.html)

Use pyspark from command line
```bash
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="--matplotlib"
/opt/module/spark-3.5.0-bin-hadoop3/bin/pyspark -h

```

Use pyspark from jupyter notebook, in your command line, change settings back to original settings.
```bash
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab'

```

## ^_^ Here is the [Spark documentation](https://spark.apache.org/docs/latest/)


In [12]:
import pyspark
pyspark.__version__

'3.5.0'

## 3. Install pyspark with different version

* Create conda env `sp` with `conda create -n sp python=3.11.7`
* init conda env with `conda init`
* activate conda env with `conda activate sp` or `source activate sp`
* Install pyspark with different version `pip install pyspark==3.5.0`  , -i change the mirror

---


## 4. Try databricks

* [databricks sign up](https://www.databricks.com/try-databricks#account), select Databricks Community Edition. 
* [databricks resources](https://www.databricks.com/resources)
* [databricks cases](https://github.com/orgs/databricks-industry-solutions/repositories)

---


## 5. Basic DataFrame Example

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StringType, StructType, StructField
import pyspark.sql.functions as F

In [2]:
## change the spark.ui.port to your own 4040 port. 
spark = SparkSession.builder.config('spark.ui.port', 64050).appName("pyspark SQL basic example").getOrCreate()

df = spark.read.json("/shareddata/data/people.json")
# Displays the content of the DataFrame to stdout
df.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/06 10:15:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [3]:
df

DataFrame[age: bigint, name: string]

In [4]:
# Print the schema in a tree format
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [5]:
# Select only the "name" column
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [6]:
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()

+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     NULL|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [7]:
# Select people older than 21
df.filter(df['age'] > 21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [8]:
# Count people by age
df.groupBy("age").count().show()

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|NULL|    1|
|  30|    1|
+----+-----+



In [9]:
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people_tmp")

sqlDF = spark.sql("SELECT * FROM people_tmp")
sqlDF.show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [10]:
# Register the DataFrame as a global temporary view
df.createOrReplaceGlobalTempView("people_global")

# Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people_global").show()

+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



这行代码在DataFrame `df`中创建了一个新的列，名为"This Long Column-Name"，其内容是原有'name'列的内容。

这行代码选择了名为"This Long Column-Name"的列，并显示了其内容。注意，由于列名中包含空格，所以在选择列时需要使用反引号（`）将列名括起来。

In [12]:
df = df.withColumn("This Long Column-Name", F.col('name'))
df.show()
df.selectExpr("`This Long Column-Name`").show()  ## Note the ` symbol. 

+----+-------+---------------------+
| age|   name|This Long Column-Name|
+----+-------+---------------------+
|NULL|Michael|              Michael|
|  30|   Andy|                 Andy|
|  19| Justin|               Justin|
+----+-------+---------------------+

+---------------------+
|This Long Column-Name|
+---------------------+
|              Michael|
|                 Andy|
|               Justin|
+---------------------+



`select`和`selectExpr`都是PySpark DataFrame的方法，用于选择DataFrame中的列，但它们的使用方式和功能有些不同。

`select`方法接受一系列列名作为参数，返回一个新的DataFrame，只包含指定的列。例如：

```python
df.select("name", "age").show()
```

这将返回一个新的DataFrame，只包含"name"和"age"两列。

而`selectExpr`方法则更加强大，它接受一系列表达式作为参数，这些表达式可以包含SQL风格的操作，如算术运算、聚合函数等。例如：

```python
df.selectExpr("name", "age * 2").show()
```

这将返回一个新的DataFrame，包含"name"列和"age"列的两倍。

在你的例子中，`selectExpr("`This Long Column-Name`")`使用了反引号来引用包含空格的列名，这是SQL语法的一部分，`select`方法则不能这样做。

## 6. Case Studies



#### (1). Line count

Count the lines in `data/SPARK_README.md`. 

In [24]:
from pyspark.sql.functions import col, instr

spark = SparkSession.builder.appName("pyspark case study").getOrCreate()

df = spark.read.text("/shareddata/data/SPARK_README.md").toDF("line")
df.show(10, truncate=False)
print(f"the file has {df.count()} lines")

+------------------------------------------------------------------------------+
|line                                                                          |
+------------------------------------------------------------------------------+
|# Apache Spark                                                                |
|                                                                              |
|Spark is a fast and general cluster computing system for Big Data. It provides|
|high-level APIs in Scala, Java, Python, and R, and an optimized engine that   |
|supports general computation graphs for data analysis. It also supports a     |
|rich set of higher-level tools including Spark SQL for SQL and DataFrames,    |
|MLlib for machine learning, GraphX for graph processing,                      |
|and Spark Streaming for stream processing.                                    |
|                                                                              |
|<http://spark.apache.org/> 

In [15]:
filtered = df.withColumn("result", instr(col('line'), 'Spark')>=1).where('result')
print(f"filtered text has {filtered.count()} lines")
filtered.show(20)
df.show(5)

filtered text has 17 lines
+--------------------+------+
|                line|result|
+--------------------+------+
|      # Apache Spark|  true|
|Spark is a fast a...|  true|
|rich set of highe...|  true|
|and Spark Streami...|  true|
|You can find the ...|  true|
|   ## Building Spark|  true|
|Spark is built us...|  true|
|To build Spark an...|  true|
|["Building Spark"...|  true|
|The easiest way t...|  true|
|Spark also comes ...|  true|
|    ./bin/run-exa...|  true|
|    MASTER=spark:...|  true|
|Testing first req...|  true|
|Spark uses the Ha...|  true|
|Hadoop, you must ...|  true|
|in the online doc...|  true|
+--------------------+------+

+--------------------+
|                line|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
+--------------------+
only showing top 5 rows



In [25]:
containsubstr1 = instr(col("line"), "Spark") >= 1
containsubstr2 = instr(col("line"), "talk") >= 1
filtered2 = df.withColumn("results_and", containsubstr1 & containsubstr2).where('results_and')
filtered2.show()
df.show(5)

+--------------------+-----------+
|                line|results_and|
+--------------------+-----------+
|Spark uses the Ha...|       true|
+--------------------+-----------+

+--------------------+
|                line|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
+--------------------+
only showing top 5 rows



Question: if we want to filter lines that contain *any word* from a long list, what should we do?

For example, `candidates = ['mesos', 'guidance', 'particular', 'Hadoop', 'setup', 'project']`


In [None]:
## code here

#### (2). mnm count

The data is in `data/mnm_dataset.csv`. 

In [27]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import count

spark = SparkSession.builder.appName("mnm_count").getOrCreate()

mnm_file = "/shareddata/data/mnm_dataset.csv"
mnm_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(mnm_file)

mnm_df.show(5)


24/03/06 12:05:12 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


+-----+------+-----+
|State| Color|Count|
+-----+------+-----+
|   TX|   Red|   20|
|   NV|  Blue|   66|
|   CO|  Blue|   79|
|   OR|  Blue|   71|
|   WA|Yellow|   93|
+-----+------+-----+
only showing top 5 rows



Aggregate count of all colors and groupBy state and color, orderBy descending order.

In [28]:
count_mnm_df = mnm_df.select("State", "Color", "Count").groupBy("State", "Color").agg(count("Count").alias("Total")).orderBy("Total", ascending=False)
count_mnm_df.show(n=10, truncate=False)
print("Total Rows = %d" % (count_mnm_df.count()))

+-----+------+-----+
|State|Color |Total|
+-----+------+-----+
|CA   |Yellow|1807 |
|WA   |Green |1779 |
|OR   |Orange|1743 |
|TX   |Green |1737 |
|TX   |Red   |1725 |
|CA   |Green |1723 |
|CO   |Yellow|1721 |
|CA   |Brown |1718 |
|CO   |Green |1713 |
|NV   |Orange|1712 |
+-----+------+-----+
only showing top 10 rows

Total Rows = 60


Find the aggregate count for California by filtering on State.

In [45]:
## code here
count_mnm_df.where('"State" == "CA" ').show()

+-----+-----+-----+
|State|Color|Total|
+-----+-----+-----+
+-----+-----+-----+



#### (3) San Francisco Fire Calls

Showing how to use DataFrame and Spark SQL for common data analytics patterns and operations on a [San Francisco Fire Department Calls ](https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3) dataset

In [37]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

spark = SparkSession.builder.appName("fire_calls").getOrCreate()
sf_fire_file = "/shareddata/data/sf-fire/sf-fire-calls.csv"

# Define our schema as the file has 4 million records. Inferring the schema is expensive for large files.

fire_schema = StructType([StructField('CallNumber', IntegerType(), True),
                     StructField('UnitID', StringType(), True),
                     StructField('IncidentNumber', IntegerType(), True),
                     StructField('CallType', StringType(), True),                  
                     StructField('CallDate', StringType(), True),      
                     StructField('WatchDate', StringType(), True),
                     StructField('CallFinalDisposition', StringType(), True),
                     StructField('AvailableDtTm', StringType(), True),
                     StructField('Address', StringType(), True),       
                     StructField('City', StringType(), True),       
                     StructField('Zipcode', IntegerType(), True),       
                     StructField('Battalion', StringType(), True),                 
                     StructField('StationArea', StringType(), True),       
                     StructField('Box', StringType(), True),       
                     StructField('OriginalPriority', StringType(), True),       
                     StructField('Priority', StringType(), True),       
                     StructField('FinalPriority', IntegerType(), True),       
                     StructField('ALSUnit', BooleanType(), True),       
                     StructField('CallTypeGroup', StringType(), True),
                     StructField('NumAlarms', IntegerType(), True),
                     StructField('UnitType', StringType(), True),
                     StructField('UnitSequenceInCallDispatch', IntegerType(), True),
                     StructField('FirePreventionDistrict', StringType(), True),
                     StructField('SupervisorDistrict', StringType(), True),
                     StructField('Neighborhood', StringType(), True),
                     StructField('Location', StringType(), True),
                     StructField('RowID', StringType(), True),
                     StructField('Delay', FloatType(), True)])


fire_df = spark.read.csv(sf_fire_file, header=True, schema=fire_schema)


24/03/06 12:40:52 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [39]:

# Cache the DataFrame since we will be performing some operations on it.
fire_df.cache()
# fire_df.count()
# fire_df.printSchema()
# fire_df.show(5)

24/03/06 12:41:11 WARN CacheManager: Asked to cache already cached data.


DataFrame[CallNumber: int, UnitID: string, IncidentNumber: int, CallType: string, CallDate: string, WatchDate: string, CallFinalDisposition: string, AvailableDtTm: string, Address: string, City: string, Zipcode: int, Battalion: string, StationArea: string, Box: string, OriginalPriority: string, Priority: string, FinalPriority: int, ALSUnit: boolean, CallTypeGroup: string, NumAlarms: int, UnitType: string, UnitSequenceInCallDispatch: int, FirePreventionDistrict: string, SupervisorDistrict: string, Neighborhood: string, Location: string, RowID: string, Delay: float]

In [46]:
# Filter out "Medical Incident" call types

few_fire_df = (fire_df.select("IncidentNumber", "AvailableDtTm", "CallType") 
              .where(col("CallType") != "Medical Incident"))

few_fire_df.show(5, truncate=False)

+--------------+----------------------+--------------+
|IncidentNumber|AvailableDtTm         |CallType      |
+--------------+----------------------+--------------+
|2003235       |01/11/2002 01:51:44 AM|Structure Fire|
|2003250       |01/11/2002 04:16:46 AM|Vehicle Fire  |
|2003259       |01/11/2002 06:01:58 AM|Alarms        |
|2003279       |01/11/2002 08:03:26 AM|Structure Fire|
|2003301       |01/11/2002 09:46:44 AM|Alarms        |
+--------------+----------------------+--------------+
only showing top 5 rows



**Q-1) How many distinct types of calls were made to the Fire Department? (exclude the null strings)**

In [None]:
## code here

**Q-2) What are distinct types of calls were made to the Fire Department?**

These are all the distinct type of call to the SF Fire Department

In [None]:
## code here

**Q-3) Find out all response or delayed times greater than 5 mins?**

1. Rename the column `Delay` - > `ReponseDelayedinMins`
2. Returns a new DataFrame
3. Find out all calls where the response time to the fire site was delayed for more than 5 mins

In [None]:
## code here


Transform the string dates to Spark Timestamp data type

In [None]:

## code here

**Q-4) What were the most common call types?**

List them in descending order

In [None]:
## code here

**Q-4a) What zip codes accounted for most common calls?**

Let's investigate what zip codes in San Francisco accounted for most fire calls and what type where they.

1. Filter out by CallType
2. Group them by CallType and Zip code
3. Count them and display them in descending order

In [None]:
## code here

**Q-4b) What San Francisco neighborhoods are in the zip codes 94102 and 94103**

Let's find out the neighborhoods associated with these two zip codes. In all likelihood, these are some of the contested neighborhood with high reported crimes.

In [None]:
## code here

**Q-5) What was the sum of all calls, average, min and max of the response times for calls?**

* Number of Total Alarms
* What were the min and max the delay in response time before the Fire Dept arrived at the scene of the call

In [None]:
## code here

**Q-6a) How many distinct years of data is in the CSV file?**

We can use the `year()` SQL Spark function off the Timestamp column data type IncidentDate.



In [None]:
## code here

**Q-6b) What week of the year in 2018 had the most fire calls?**

In [None]:
## code here

**Q-7) What neighborhoods in San Francisco had the worst response time in 2018?**

In [None]:
## code here

**Q-8a) How can we use Parquet files or SQL table to store data and read it back?**

In [None]:
## code here

**Q-8c) How can read data from Parquet file?**

Note we don't have to specify the schema here since it's stored as part of the Parquet metadata


In [None]:
## code here

#### (4) US Flights Dataset



Define a UDF to convert the date format into a legible format.

*Note*: the date is a string with year missing, so it might be difficult to do any queries using SQL `year()` function

In [47]:
def to_date_format_udf(d_str):
  l = [char for char in d_str]
  return "".join(l[0:2]) + "/" +  "".join(l[2:4]) + " " + " " +"".join(l[4:6]) + ":" + "".join(l[6:])
to_date_format_udf("02190925")

'02/19  09:25'

In [None]:
# Register the UDF
spark.udf.register("to_date_format_udf", to_date_format_udf, StringType())

In [None]:
df = (spark.read.format("csv")
      .schema("date STRING, delay INT, distance INT, origin STRING, destination STRING")
      .option("header", "true")
      .option("path", "/shareddata/data/flights/departuredelays.csv")
      .load())

df.show(5)

In [None]:
df.selectExpr("to_date_format_udf(date) as data_format").show(10, truncate=False)


Create a temporary view to which we can issue SQL queries. 

In [None]:
df.createOrReplaceTempView("us_delay_flights_tbl")

Convert all `date` to `date_fm` so it's more eligible
Note: we are using UDF to convert it on the fly. 

In [None]:
spark.sql("SELECT *, date, to_date_format_udf(date) AS date_fm FROM us_delay_flights_tbl").show(10, truncate=False)
spark.sql("SELECT COUNT(*) FROM us_delay_flights_tbl").show() 

Query 1:  Find out all flights whose distance between origin and destination is greater than 1000 

In [None]:
spark.sql("SELECT distance, origin, destination FROM us_delay_flights_tbl WHERE distance > 1000 ORDER BY distance DESC").show(10, truncate=False)

## or 
df.select("distance", "origin", "destination").where(col("distance") > 1000).orderBy(desc("distance")).show(10, truncate=False)


df.select("distance", "origin", "destination").where("distance > 1000").orderBy("distance", ascending=False).show(10)


df.select("distance", "origin", "destination").where("distance > 1000").orderBy(desc("distance")).show(10)

Query 2: Find out all flights with 2 hour delays between San Francisco and Chicago  

In [None]:

spark.sql("""
    SELECT date, delay, origin, destination 
    FROM us_delay_flights_tbl 
    WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD' 
    ORDER by delay DESC
""").show(10, truncate=False)

In [None]:
df1 =  spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'SFO'")
df1.createOrReplaceGlobalTempView("us_origin_airport_SFO_tmp_view")
spark.catalog.listTables(dbName="global_temp")


#### (5) Max purchase quantity over all time

In [None]:
from pyspark.sql.functions import col, to_date
from pyspark.sql.window import Window
from pyspark.sql.functions import desc

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.read.format("csv") \
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/shareddata/data/retail-data/*.csv")\
  .coalesce(5)
df.cache()
df.createOrReplaceTempView("dfTable")

df.show(5)

In [None]:
dfWithDate = df.withColumn("date", to_date(col("InvoiceDate"), "MM/dd/yyyy HH:mm"))
dfWithDate.createOrReplaceTempView("dfWithDate")

In [None]:
windowSpec = Window\
  .partitionBy("CustomerId", "date")\
  .orderBy(desc("Quantity"))\
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [None]:
from pyspark.sql.functions import max
maxPurchaseQuantity = max(col("Quantity")).over(windowSpec)

In [None]:
from pyspark.sql.functions import dense_rank, rank
purchaseDenseRank = dense_rank().over(windowSpec)
purchaseRank = rank().over(windowSpec)

In [None]:
dfWithDate.where("CustomerId IS NOT NULL").orderBy("CustomerId")\
  .select(
    col("CustomerId"),
    col("date"),
    col("Quantity"),
    purchaseRank.alias("quantityRank"),
    purchaseDenseRank.alias("quantityDenseRank"),
    maxPurchaseQuantity.alias("maxPurchaseQuantity")).show()

# END

# Thank you 