# Spark SQL Engine

[ Ref _Learning Spark v2_ book, _Chapter 3_.]

At a programmatic level, Spark SQL allows developers to issue ANSI SQL:2003 compatible queries on structured data with a schema.

The diagram below summarizes the structure of Spark SQL and its usage:

<img src="https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/assets/lesp_0401.png" width="700">

Ref: [O'Reilly](https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/ch04.html)

Spark SQL allows to read data from structured file formats like the ones at the bottom of the stack (JSON, csv, etc). It then allows to access such data via ODBC connectors or SQL Spark shells by storing this data to temporarily table.

How can I actually run SQL commands in my notebook? We can define a `DataFrame` and run the `createOrReplaceTempView` method. It allows to run SQL queries programmatically and returns the result again as a `DataFrame`.

In [0]:
sf_fire_file = "/databricks-datasets/learning-spark-v2/sf-fire/sf-fire-calls.csv"
df = spark.read.csv(sf_fire_file, header=True)

df.createOrReplaceTempView("firecalls")

sql_result_df = spark.sql("""
    SELECT CallType, count(*) as count
    FROM firecalls
    GROUP BY CallType
    ORDER BY count(*) desc
""")
display(sql_result_df)

CallType,count
Medical Incident,2843475
Structure Fire,578998
Alarms,483518
Traffic Collision,175507
Citizen Assist / Service Call,65360
Other,56961
Outside Fire,51603
Vehicle Fire,20939
Water Rescue,20037
Gas Leak (Natural and LP Gases),17284


In Databricks, you can convert a cell to a SQL cell by starting it with a `%sql`

In [0]:
%sql
SELECT CallType, count(*) as count
FROM firecalls
GROUP BY CallType
ORDER BY count(*) desc

CallType,count
Medical Incident,2843475
Structure Fire,578998
Alarms,483518
Traffic Collision,175507
Citizen Assist / Service Call,65360
Other,56961
Outside Fire,51603
Vehicle Fire,20939
Water Rescue,20037
Gas Leak (Natural and LP Gases),17284


## Catalyst optimizer

The Catalyst optimizer sits at the core of the Spark SQL engine. It takes a query and converts it to an execution plan. The plan goes through four **transformational phases**:

1. **analysis**. The Spark SQL engine begins by generating an abstract syntax tree (AST) for the SQL or DataFrame query. In this initial phase, any columns or table names will be resolved by consulting an internal **Catalog**, a programmatic interface to Spark SQL that holds a list of names of columns, data types, functions, tables, databases, etc.
2. **logical optimization**. Applying a standardrule based optimization approach, the Catalyst optimizer will first construct a set of multiple plans and then, using its cost-based optimizer (CBO), assign costs to each plan.
3. **physical planning**. Spark SQL generates an optimal physical plan for the selected logical plan, using physical operators that match those available in the Spark execution engine.
4. **code generation**. Generating efficient Java bytecode to run on each machine.

The image below summarazies these phases.

![Four phases of Spark plan](https://www.databricks.com/wp-content/uploads/2018/05/Catalyst-Optimizer-diagram.png)

Ref: [Databricks](https://www.databricks.com/glossary/catalyst-optimizer)

Can we see the plan of our transformations? Yes, we can print it from any `DataFrame` via the `explain` method.

In [0]:
from pyspark.sql.functions import count

count_df = (
    df.select("CallType", "Call Number")
    .groupBy("CallType")
    .count()
    .orderBy("count", ascending=False)
)
display(count_df)

CallType,count
Medical Incident,2843475
Structure Fire,578998
Alarms,483518
Traffic Collision,175507
Citizen Assist / Service Call,65360
Other,56961
Outside Fire,51603
Vehicle Fire,20939
Water Rescue,20037
Gas Leak (Natural and LP Gases),17284


In [0]:
count_df.explain(extended=True)

== Parsed Logical Plan ==
'Sort ['count DESC NULLS LAST], true
+- Aggregate [CallType#623], [CallType#623, count(1) AS count#682L]
   +- Project [CallType#623, Call Number#620]
      +- Relation [Call Number#620,Unit ID#621,Incident Number#622,CallType#623,Call Date#624,Watch Date#625,Call Final Disposition#626,Available DtTm#627,Address#628,City#629,Zipcode of Incident#630,Battalion#631,Station Area#632,Box#633,OrigPriority#634,Priority#635,Final Priority#636,ALS Unit#637,Call Type Group#638,NumAlarms#639,UnitType#640,Unit sequence in call dispatch#641,Fire Prevention District#642,Supervisor District#643,... 4 more fields] csv

== Analyzed Logical Plan ==
CallType: string, count: bigint
Sort [count#682L DESC NULLS LAST], true
+- Aggregate [CallType#623], [CallType#623, count(1) AS count#682L]
   +- Project [CallType#623, Call Number#620]
      +- Relation [Call Number#620,Unit ID#621,Incident Number#622,CallType#623,Call Date#624,Watch Date#625,Call Final Disposition#626,Available DtT

How do we read this? It should be read bottom-up. So, looking at the _Parsed Logical Plan_:

1. Relation [ ... ] csv (reading the csv data source)
2. Project `CallType` and `CallNumber` (`CallNumber` will disappear from _Optimized Logical Plan_!)
3. Aggregate (aggregating by `CallType` and counting)

Is the plan behind the PySpark transformations of `count_df` the same as the `SQL` transformation behind `sql_result_df`?

In [0]:
sql_result_df.explain(extended=True)

== Parsed Logical Plan ==
'Sort ['count(1) DESC NULLS LAST], true
+- 'Aggregate ['CallType], ['CallType, 'count(1) AS count#1225]
   +- 'UnresolvedRelation [firecalls], [], false

== Analyzed Logical Plan ==
CallType: string, count: bigint
Sort [count#1225L DESC NULLS LAST], true
+- Aggregate [CallType#1172], [CallType#1172, count(1) AS count#1225L]
   +- SubqueryAlias firecalls
      +- View (`firecalls`, [Call Number#1169,Unit ID#1170,Incident Number#1171,CallType#1172,Call Date#1173,Watch Date#1174,Call Final Disposition#1175,Available DtTm#1176,Address#1177,City#1178,Zipcode of Incident#1179,Battalion#1180,Station Area#1181,Box#1182,OrigPriority#1183,Priority#1184,Final Priority#1185,ALS Unit#1186,Call Type Group#1187,NumAlarms#1188,UnitType#1189,Unit sequence in call dispatch#1190,Fire Prevention District#1191,Supervisor District#1192,Neighborhood#1193,Location#1194,RowID#1195,Delay#1196])
         +- Relation [Call Number#1169,Unit ID#1170,Incident Number#1171,CallType#1172,Call 

The _Optimized Logical Plan_ look actually the same for the transformation in SQL and in PySpark! That is, regardless of the language you use, your computation undergoes the same journey and the resulting bytecode is likely the same.