In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("sparksql").master('local[*]').getOrCreate()

# Spark Catalog: 

The Spark Catalog is a component within Spark SQL that acts as a central repository for metadata about structured data. Think of it as a system that organizes and manages information about your databases, tables, views, functions, and other data assets within your Spark environment. 

Spark supports different catalog implementations: 

in-memory: A simple, ephemeral catalog that exists only for the current SparkSession. This is the default if Spark is not built with Hive support. 

hive: Uses the Hive Metastore to persist metadata. This allows for sharing metadata across different Spark sessions and with Hive. This is the default if Spark is built with Hive support. 

In [3]:
spark.conf.get('spark.sql.catalogImplementation')

'in-memory'

# Spark warehouse directory 

The Spark warehouse directory is the default location on the file system where Spark SQL stores the data and metadata for managed tables (also known as internal tables). When you create a managed table in Spark SQL without specifying a LOCATION, Spark will store the table's data files within this directory.
Currently I am implementing delta lake using spark and hence warehouse.dir is set to a s3 location. Else the default is current working directory where spark application is running.

In [4]:
spark.conf.get('spark.sql.warehouse.dir')

's3a://datalakeshreyas/pyspark_practise/warehouse'

# Tables
There are two types:
1. Managed Tables: Spark manages both the metadata and the data for these tables. The data is typically stored in the Spark warehouse directory. If you drop a managed table, both the metadata and the underlying data are deleted.
2. External Tables: Spark only manages the metadata, and the data resides in a location you specify. If you drop an external table, only the metadata is removed, and the data remains untouched in the specified LOCATION

In [None]:
CREATE TABLE [IF NOT EXISTS] table_name
  (column1 data_type [COMMENT column_comment],
   column2 data_type [COMMENT column_comment],
   ...)
[USING format]
[OPTIONS (key=value, ...)]
[PARTITIONED BY (column_name, ...)]
[CLUSTERED BY (column_name, ...) [SORTED BY (column_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[LOCATION 'path']
[COMMENT 'table_comment']
[TBLPROPERTIES (key=value, ...)]
[AS select_statement];

The syntax is same for External tables. Only difference is we use CREATE EXTERNAL TABLE at the starting.

[USING format] (Optional for managed tables, often used for external tables): Specifies the data source format for the table. Common formats include:
1. parquet (default in many Spark configurations)
2. csv
3. json
4. orc
5. avro
6. text
7. delta (for Delta Lake tables)
8. jdbc (for connecting to relational databases)
9. hive (if Spark is built with Hive support)

If omitted for managed tables, the default format configured in Spark is used (usually Parquet). For external tables, it's good practice to specify the format.   

[OPTIONS] : Used to provide additional options based on the format. Like header = 1 for csv format or jdbc info when jdbc format is selected.

[PARTITIONED BY (column_name, ...)]: Creates a partitioned table. Data is divided into directories based on the values of the specified columns. This can significantly improve query performance for queries that filter on these columns.

[LOCATION 'path']:
For managed tables: This is optional. If specified, Spark will store the data for the table in the given path within the Spark warehouse. If omitted, Spark will choose a default location within the warehouse.

For external tables: This is mandatory. It specifies the directory where the data files for the table are located. The path should be accessible by the Spark cluster

# Views

Spark SQL supports two main types of views:

Temporary Views (Session-Scoped):

1. These views exist only for the duration of the current SparkSession.
2. They are not persisted in the underlying catalog (like the Hive Metastore).
3. Once the SparkSession ends, the temporary view is automatically dropped.
4. They are useful for intermediate results within a specific analysis or application.
5. They are created using the CREATE TEMPORARY VIEW or CREATE OR REPLACE TEMPORARY VIEW statement.


Global Temporary Views (Application-Scoped):

1. These views are also temporary but are visible across all SparkSessions within the same Spark application.
2. They are stored in a system database called global_temp.
3. To access a global temporary view, you need to qualify its name with the global_temp database (e.g., SELECT * FROM global_temp.my_global_view).
4. They are useful when you want to share intermediate results or common query logic across multiple parts of the same Spark application.
5. They are created using the CREATE GLOBAL TEMPORARY VIEW or CREATE OR REPLACE GLOBAL TEMPORARY VIEW statement.

In [None]:
# We can also create temporary and global views in pyspark on a dataframe
df.createOrReplaceTempView("people_temp_view")

# Now you can query this temporary view using spark.sql()
spark.sql("SELECT * FROM people_temp_view").show()

In [None]:
df.createOrReplaceGlobalTempView("people_global_temp_view")

# You can access this global temporary view using spark.sql() with the global_temp prefix
spark.sql("SELECT * FROM global_temp.people_global_temp_view").show()

# SparkSQl, Hive and Snowflake
Most of the spark sql and hive is similar to snowflake. We have external tables there as well and it is same here, we can create external tables, give partitions as well, we cannot perform DML on them, if we want to bring in new data then we perform alter table refresh similar to snowflake.

Even Syntax like USE DATABASE is same. We have also CREATE IF NOT EXISTS, CREATE OR REPLACE, also all the file formats and its options.