#  Working with SQL in PySpark

- Spark offers a SQL API that supports ANSI SQL for data manipulation.
- The pyspark.sql module uses a lot of SQL vocabulary to perform data manipulation.
- Before working with Spark SQL, PySpark’s data frames need to be registered as views or tables.
- It's important to notice that PySpark’s own data frame manipulation methods and functions borrow SQL functionality.
- You can use Spark SQL queries in a PySpark program through the spark.sql function.
-  Spark SQL table references are kept in a Catalog, which contains the metadata for all tables accessible to Spark SQL.
-  PySpark will accept SQL-style clauses in where(), expr(), and selectExpr().

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException 
import pyspark.sql.functions as F
import pyspark.sql.types as T
spark = SparkSession.builder.getOrCreate()

## PySpark vs SQL

PySpark and SQL use the same keywords, but the order of operations differs. Let me show this.

In [2]:
# Reading the data wiht PySpark
elements = spark.read.csv(
    "./data/Periodic_Table_Of_Elements.csv",
    header=True,
    inferSchema=True,
)

In [3]:
# Data transformation with PySpark
elements.where(F.col("phase") == "liq").groupby("period").count().show()

+------+-----+
|period|count|
+------+-----+
|     6|    1|
|     4|    1|
+------+-----+



In [4]:
# Data transformation with SQL
"""
SELECT
    period,
    count(*)
FROM elements
WHERE phase = 'liq'
GROUP BY period;
"""

"\nSELECT\n    period,\n    count(*)\nFROM elements\nWHERE phase = 'liq'\nGROUP BY period;\n"

## Creating a Spark data frame using SQL.

To create a table/view to query with Spark SQL, use the createOrReplaceTempView method. This method takes a single string parameter, which is the name of the table you want to use.

In [5]:
elements.createOrReplaceTempView("elements") 
spark.sql(
 "select period, count(*) from elements where phase='liq' group by period"
).show(5)

+------+--------+
|period|count(1)|
+------+--------+
|     6|       1|
|     4|       1|
+------+--------+



### Using the Spark catalog

The Spark catalog is an object that allows working with Spark SQL tables and views. A lot of its methods have to do with managing the metadata of those tables, such as their names and the level of caching 

In [6]:
# Reaching through the catalog property of our SparkSession.
spark.catalog 

<pyspark.sql.catalog.Catalog at 0x1bb68ed50c0>

In [7]:
# Getting a list of table object
spark.catalog.listTables() 

[Table(name='elements', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [8]:
# Deleting a view
spark.catalog.dropTempView("elements") 

In [9]:
# Looking at liste
spark.catalog.listTables() 

[]