# Spark SQL

(No video yet)

Remember that Spark is a data processsing engine, not a database.

See https://spark.apache.org/docs/latest/sql-programming-guide.html 

Most of the text here is taken from  [SDG] chapter 10 "Spark SQL" .

Spark SQL is a Spark module for structured data processing.

Do not confuse with reading/writing to an RDBMS. 
You can run SQL query on a dataframe that you created from any data source.

In a nutshell, with Spark SQL you can run SQL queries against views or tables organized into
databases. You also can use system functions or define user functions and analyze query plans in
order to optimize their workloads. This integrates directly into the DataFrame and Dataset API,
and as we saw in previous chapters, you can choose to express some of your data manipulations
in SQL and others in DataFrames and they will **compile to the same underlying code**. [SDG]

## What is Apache Hive?
Before Spark’s rise, Hive was the de facto big data SQL access layer. Originally developed at Facebook, Hive became an incredibly popular tool across industry for *performing SQL operations on big data*. In many ways it helped propel Hadoop into different industries because analysts could run SQL queries[SDG]


## NOTE
Spark SQL is intended to operate as an online **analytic** processing (OLAP) database, not an online transaction processing (OLTP) database. [SDG]


You can completely interoperate between SQL and DataFrames, as you see
fit. For instance, you can create a DataFrame, manipulate it with SQL, and then manipulate it
again as a DataFrame:

```
spark.read.json("/data/flight-data/json/2015-summary.json")\
.createOrReplaceTempView("some_sql_view") # DF => SQL
spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count)
FROM some_sql_view GROUP BY DEST_COUNTRY_NAME
""")\
.where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")\
.count() # SQL => DF
```


# Views

To an end user, views are displayed as tables, except rather than rewriting all of the data to a new
location, they simply perform a transformation on the source data at query time.

Views are created in the `default` database

A view is effectively a **transformation** and Spark will perform it only at query time. This means
that it will only apply that filter after you actually go to query the table (and not earlier).
Effectively, views are equivalent to creating a new DataFrame from an existing DataFrame.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
datapath = "../data/sdg/"


In [None]:
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(datapath + "/retail-data/by-day/2010*.csv")

df.createOrReplaceTempView("retail_data")
schema = df.schema
df.limit(2).toPandas()

The `retail_data` is a temporary view. It will live as long as the current SparkSession. <br>
This view cannot be shared with other Spark applications or databases. There are ways to do this, but not covered here.

# Let's run some code!

Dataframe operation:

In [None]:
from pyspark.sql.functions import window, column, desc, col
df.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate").groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day")).sum("total_cost").sort(desc("CustomerId")).show(5)

Spark SQL operation

In [None]:
spark.sql("""SELECT  CustomerId , date(InvoiceDate) as Created,  SUM(UnitPrice * Quantity) AS total_cost FROM retail_data
             GROUP BY CustomerId, date(InvoiceDate)
""").sort(desc("CustomerId"))\
.show(5)

# Complex Types
Complex types are a departure from standard SQL and are an incredibly powerful feature that
does not exist in standard SQL. Understanding how to manipulate them appropriately in SQL is
essential. There are three core complex types in Spark SQL: **structs, lists, and maps**.

This is an advanced topic.<br>
For examples, check the book.

# Indexing

When performing queries such as `groupby("column").sum()`, all the data has to be scanned, using sequential read.

What if we have `select a,b where b="wine"` and there are few matching rows? 

Spark does not support indexing of the data (not to be confused with indexing of the database that we read to create the dataframe!).

Instead, you should rely on *partitioning* by the columns you plan to query. This should provide the needed speed.

* can a partition be read in parallel by several threads?

See interesting disucssion in StackOverflow:
https://stackoverflow.com/questions/36938976/why-spark-sql-considers-the-support-of-indexes-unimportant .

Microsoft implemented a prototype indexer, but I don't know if it was integrated into Spark: https://www.databricks.com/session_na20/hyperspace-an-indexing-subsystem-for-apache-spark

# Check youreself
* Where are database files stored?
* Can Spark do UPDATE TABLE? Why?
* What is the role of View?
* is indexing used? if not, what should be used instead?

Answer [here](https://forms.gle/KKVZRgboNBDFKE7K6) and see your results

## Additional Exercises 

* Rename the "window" column in the DataFrame operation to 'Created' within the same 'selectExpr' query, aligning it with the SQL operation.

* Create two separate operations, one using DataFrame operations and the other using SQL, to generate a table showing the count of invoices per customer, remove null values, and arrange the results in ascending order.




