# Spark SQL

Remember that Spark is a data processsing engine, not a database.

See https://spark.apache.org/docs/latest/sql-programming-guide.html 

Most of the text here is taken from  [SDG] chapter 10 "Spark SQL" .

Spark SQL is a Spark module for structured data processing.

Do not confuse with reading/writing to an RDBMS. 
You can run SQL query on a dataframe that you created from any data source.

In a nutshell, with Spark SQL you can run SQL queries against views or tables organized into
databases. You also can use system functions or define user functions and analyze query plans in
order to optimize their workloads. This integrates directly into the DataFrame and Dataset API,
and as we saw in previous chapters, you can choose to express some of your data manipulations
in SQL and others in DataFrames and they will **compile to the same underlying code**. [SDG]

## What is Apache Hive?
Before Spark’s rise, Hive was the de facto big data SQL access layer. Originally developed at Facebook, Hive became an incredibly popular tool across industry for *performing SQL operations on big data*. In many ways it helped propel Hadoop into different industries because analysts could run SQL queries[SDG]


## NOTE
Spark SQL is intended to operate as an online **analytic** processing (OLAP) database, not an online transaction processing (OLTP) database. This means that it is not intended to perform extremely low-latency queries. [SDG]


TODO: take from "streaming_book.ipynb"

You can completely interoperate between SQL and DataFrames, as you see
fit. For instance, you can create a DataFrame, manipulate it with SQL, and then manipulate it
again as a DataFrame.

# Views

To an end user, views are displayed as tables, except rather than rewriting all of the data to a new
location, they simply perform a transformation on the source data at query time.

Views are created in the `default` database

A view is effectively a **transformation** and Spark will perform it only at query time. This means
that it will only apply that filter after you actually go to query the table (and not earlier).
Effectively, views are equivalent to creating a new DataFrame from an existing DataFrame.

In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
datapath = "../data/sdg/"


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/02/26 09:38:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/02/26 09:38:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/02/26 09:38:13 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
23/02/26 09:38:13 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
23/02/26 09:38:13 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.


In [37]:
df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load(datapath + "/retail-data/by-day/2010*.csv")
df.createOrReplaceTempView("retail_data")
schema = df.schema
df.limit(2).toPandas()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,537226,22811,SET OF 6 T-LIGHTS CACTI,6,2010-12-06 08:34:00,2.95,15987.0,United Kingdom
1,537226,21713,CITRONELLA CANDLE FLOWERPOT,8,2010-12-06 08:34:00,2.1,15987.0,United Kingdom


The 'retail_data' is a temporary view. It will live as long as the current SparkSession. <br>
This view cannot be shared with other Spark applications or databases. There are ways to do this, but not covered here.

# Let's run some code!

In [38]:
from pyspark.sql.functions import window, column, desc, col
df\
.selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate")\
.groupBy(
col("CustomerId"), window(col("InvoiceDate"), "1 day"))\
.sum("total_cost")\
.show(5)

+----------+--------------------+------------------+
|CustomerId|              window|   sum(total_cost)|
+----------+--------------------+------------------+
|   13408.0|{2010-12-01 00:00...|1024.6800000000003|
|   17460.0|{2010-12-01 00:00...|              19.9|
|   15235.0|{2010-12-01 00:00...|              79.5|
|   13495.0|{2010-12-06 00:00...|510.94999999999993|
|   14769.0|{2010-12-17 00:00...|            347.01|
+----------+--------------------+------------------+
only showing top 5 rows



In [40]:
#spark.sql("select 1+1").show() 

# TODO: fix the SQL syntax to represent the same query as above
spark.sql("""select  CustomerId ,  UnitPrice * Quantity as total_cost, InvoiceDate FROM retail_data
          """).show(5)

+----------+------------------+-------------------+
|CustomerId|        total_cost|        InvoiceDate|
+----------+------------------+-------------------+
|   15987.0|17.700000000000003|2010-12-06 08:34:00|
|   15987.0|              16.8|2010-12-06 08:34:00|
|   15987.0|              11.9|2010-12-06 08:34:00|
|   15987.0| 9.899999999999999|2010-12-06 08:34:00|
|   15987.0|              10.5|2010-12-06 08:34:00|
+----------+------------------+-------------------+
only showing top 5 rows



# Complex Types
Complex types are a departure from standard SQL and are an incredibly powerful feature that
does not exist in standard SQL. Understanding how to manipulate them appropriately in SQL is
essential. There are three core complex types in Spark SQL: **structs, lists, and maps**.

This is an advanced topic.<br>
For examples, check the book.

# Indexing

When performing queries such as `groupby("column").sum()`, all the data has to be scanned, using sequential read.

What is we have `select a,b where b="wine"` and there are few matching rows? 

Spark does not support indexing of the data (not to be confused with indexing of the database that we read to create the dataframe!).

Instead, you should rely on *partitioning* by the columns you plan to query. This should provide the needed speed.

* can a partition be read in parallel by several threads?

See interesting disucssion in StackOverflow:
https://stackoverflow.com/questions/36938976/why-spark-sql-considers-the-support-of-indexes-unimportant .

Microsoft implemented a prototype indexer, but I don't know if it was integrated into Spark: https://www.databricks.com/session_na20/hyperspace-an-indexing-subsystem-for-apache-spark

# Check youreself
* Where are database file stored?
* Can Spark do UPDATE TABLE? Why?
* What is the role of View?
* is Indexing needed? if not, what should be used instead?