# Module 5 - Spark SQL

## Introduction

This module focuses on Spark SQL - using SQL syntax to work with PySpark DataFrames. DataFrames and Spark SQL tables are interconvertible, allowing you to work with either DataFrame API or SQL syntax. This is especially useful for those coming from SQL backgrounds.

## What You'll Learn

- DataFrames and Spark SQL tables are interconvertible
- Creating Spark SQL table views from DataFrames
- Converting Spark SQL tables back to DataFrames
- All view creation methods (createTempView, createOrReplaceTempView, createGlobalTempView, createOrReplaceGlobalTempView)
- Using SQL queries with PySpark
- Best practices for interoperability


## DataFrames and Spark SQL Tables - Interconvertible

**Key Concept**: DataFrames and Spark SQL tables are **interconvertible**. This means you can:
- Create a Spark SQL table view from a DataFrame (to use SQL syntax)
- Convert a Spark SQL table back to a DataFrame (to use DataFrame API)

This interoperability is very convenient, especially for SQL developers who prefer SQL syntax over DataFrame API.

### Why Create Spark SQL Table Views from DataFrames?

When you create a Spark SQL table view from a DataFrame, you can execute normal SQL queries on it. This is much more convenient for SQL developers who are familiar with SQL syntax rather than the DataFrame API. The Spark SQL table view acts as a bridge between the DataFrame and SQL worlds.


In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Create SparkSession
spark = SparkSession.builder \
    .appName("Spark SQL") \
    .master("local[*]") \
    .getOrCreate()

# Create sample DataFrame
data = [
    ("Alice", 25, "Sales", 50000, "New York"),
    ("Bob", 30, "IT", 60000, "London"),
    ("Charlie", 35, "Sales", 70000, "Tokyo"),
    ("Diana", 28, "IT", 55000, "Paris"),
    ("Eve", 32, "HR", 65000, "Sydney"),
    ("Frank", 27, "Sales", 52000, "New York"),
    ("Grace", 29, "HR", 58000, "London")
]

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("City", StringType(), True)
])

df = spark.createDataFrame(data, schema)
print("Sample DataFrame:")
df.show()

# Create a temporary view from the DataFrame (this is what we'll use in SQL queries)
df.createOrReplaceTempView("employees")
print("\nCreated temporary view 'employees' from DataFrame")


25/12/28 21:38:39 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
25/12/28 21:38:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/28 21:38:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/28 21:38:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/28 21:38:40 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/12/28 21:38:40 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/12/28 21:38:40 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/12/28 21:38:40 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting 

Sample DataFrame:


                                                                                

+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Diana| 28|        IT| 55000|   Paris|
|    Eve| 32|        HR| 65000|  Sydney|
|  Frank| 27|     Sales| 52000|New York|
|  Grace| 29|        HR| 58000|  London|
+-------+---+----------+------+--------+


Created temporary view 'employees' from DataFrame


## Converting DataFrame to Spark SQL Table View

### Method 1: createOrReplaceTempView()

Creates a temporary view or replaces it if it already exists. The view is session-scoped (only available in the current SparkSession).

```python
df.createOrReplaceTempView("view_name")
```


In [3]:
# Example: Creating a temporary view from DataFrame
# Note: We already created the 'employees' view in the setup cell above
# This is just to demonstrate the method

# Create another DataFrame for demonstration
sample_df = spark.createDataFrame(
    [("John", 30, "IT"), ("Jane", 25, "HR")],
    ["Name", "Age", "Department"]
)

# Create a temporary view
sample_df.createOrReplaceTempView("sample_employees")

# Now we can query it using SQL
print("Querying the temporary view:")
spark.sql("SELECT * FROM sample_employees").show()


Querying the temporary view:
+----+---+----------+
|Name|Age|Department|
+----+---+----------+
|John| 30|        IT|
|Jane| 25|        HR|
+----+---+----------+



In [4]:
# Query using SQL
result_sql = spark.sql("""
    SELECT Department, AVG(Salary) as AvgSalary, COUNT(*) as EmployeeCount
    FROM employees
    GROUP BY Department
    ORDER BY AvgSalary DESC
""")

print("SQL Query Result:")
result_sql.show()


SQL Query Result:
+----------+------------------+-------------+
|Department|         AvgSalary|EmployeeCount|
+----------+------------------+-------------+
|        HR|           61500.0|            2|
|        IT|           57500.0|            2|
|     Sales|57333.333333333336|            3|
+----------+------------------+-------------+



In [5]:
# More complex SQL query
result_complex = spark.sql("""
    SELECT Name, Department, Salary,
           Salary - (SELECT AVG(Salary) FROM employees e2 WHERE e2.Department = employees.Department) as DiffFromAvg
    FROM employees
    WHERE Salary > 55000
    ORDER BY Salary DESC
""")

print("Complex SQL Query:")
result_complex.show()


Complex SQL Query:
+-------+----------+------+------------------+
|   Name|Department|Salary|       DiffFromAvg|
+-------+----------+------+------------------+
|Charlie|     Sales| 70000|12666.666666666664|
|    Eve|        HR| 65000|            3500.0|
|    Bob|        IT| 60000|            2500.0|
|  Grace|        HR| 58000|           -3500.0|
+-------+----------+------+------------------+



### Converting Spark SQL Table View Back to DataFrame

Once you have a Spark SQL table view, you can convert it back to a DataFrame using `spark.table()` or by querying it with `spark.sql()`.


In [6]:
# Method 1: Convert Spark SQL table view to DataFrame using spark.table()
df_from_view = spark.table("employees")

print("Converted Spark SQL table view back to DataFrame:")
print(f"Type: {type(df_from_view)}")
df_from_view.show()

# Method 2: Convert using spark.sql() - returns a DataFrame
df_from_sql = spark.sql("SELECT * FROM employees")

print("\nConverted using spark.sql() - also returns a DataFrame:")
print(f"Type: {type(df_from_sql)}")
df_from_sql.show()

# Now you can use DataFrame API operations
print("\nUsing DataFrame API on the converted DataFrame:")
df_from_view.filter(df_from_view.Salary > 60000).show()


Converted Spark SQL table view back to DataFrame:
Type: <class 'pyspark.sql.dataframe.DataFrame'>
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Diana| 28|        IT| 55000|   Paris|
|    Eve| 32|        HR| 65000|  Sydney|
|  Frank| 27|     Sales| 52000|New York|
|  Grace| 29|        HR| 58000|  London|
+-------+---+----------+------+--------+


Converted using spark.sql() - also returns a DataFrame:
Type: <class 'pyspark.sql.dataframe.DataFrame'>
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
|  Diana| 28|        IT| 55000|   Paris|
|    Eve| 32|        HR| 65000|  Sydney|
|  Frank| 27|    

### Alternative View Creation Methods

Spark provides several methods to create views from DataFrames, each with different scoping and behavior:

1. **createTempView()**: Creates a temporary view (fails if view already exists)
2. **createOrReplaceTempView()**: Creates or replaces a temporary view (session-scoped)
3. **createGlobalTempView()**: Creates a global temporary view (fails if view already exists)
4. **createOrReplaceGlobalTempView()**: Creates or replaces a global temporary view (application-scoped)

**Key Differences:**
- **Temp views**: Session-scoped (only available in the current SparkSession)
- **Global temp views**: Application-scoped (available across all SparkSessions in the same Spark application)
- **create vs createOrReplace**: `create` fails if view exists, `createOrReplace` overwrites existing view


In [7]:
# Example: createTempView() - fails if view already exists
try:
    df.createTempView("employees_temp")
    print("Created temporary view 'employees_temp' using createTempView()")
except Exception as e:
    print(f"Error (view might already exist): {e}")

# Example: createOrReplaceTempView() - replaces if exists
df.createOrReplaceTempView("employees_replace")
print("Created/replaced temporary view 'employees_replace' using createOrReplaceTempView()")

# Query the temp view
spark.sql("SELECT * FROM employees_replace LIMIT 3").show()


Created temporary view 'employees_temp' using createTempView()
Created/replaced temporary view 'employees_replace' using createOrReplaceTempView()
+-------+---+----------+------+--------+
|   Name|Age|Department|Salary|    City|
+-------+---+----------+------+--------+
|  Alice| 25|     Sales| 50000|New York|
|    Bob| 30|        IT| 60000|  London|
|Charlie| 35|     Sales| 70000|   Tokyo|
+-------+---+----------+------+--------+



In [8]:
# Example: createGlobalTempView() - creates global view (fails if exists)
try:
    df.createGlobalTempView("global_employees_create")
    print("Created global temporary view using createGlobalTempView()")
    # Query global view (note: must use global_temp database prefix)
    spark.sql("SELECT * FROM global_temp.global_employees_create LIMIT 2").show()
except Exception as e:
    print(f"Error (view might already exist): {e}")

# Example: createOrReplaceGlobalTempView() - replaces if exists
df.createOrReplaceGlobalTempView("global_employees_replace")
print("\nCreated/replaced global temporary view using createOrReplaceGlobalTempView()")
spark.sql("SELECT * FROM global_temp.global_employees_replace LIMIT 2").show()


Created global temporary view using createGlobalTempView()
+-----+---+----------+------+--------+
| Name|Age|Department|Salary|    City|
+-----+---+----------+------+--------+
|Alice| 25|     Sales| 50000|New York|
|  Bob| 30|        IT| 60000|  London|
+-----+---+----------+------+--------+


Created/replaced global temporary view using createOrReplaceGlobalTempView()
+-----+---+----------+------+--------+
| Name|Age|Department|Salary|    City|
+-----+---+----------+------+--------+
|Alice| 25|     Sales| 50000|New York|
|  Bob| 30|        IT| 60000|  London|
+-----+---+----------+------+--------+



### Summary: View Creation Methods Comparison

| Method | Scope | Behavior if Exists |
|--------|-------|-------------------|
| `createTempView()` | Session | Fails with error |
| `createOrReplaceTempView()` | Session | Replaces existing view |
| `createGlobalTempView()` | Application | Fails with error |
| `createOrReplaceGlobalTempView()` | Application | Replaces existing view |

**When to use each:**
- **createOrReplaceTempView()**: Most common - use when you want to ensure the view exists
- **createTempView()**: Use when you want to ensure the view doesn't already exist
- **Global views**: Use when you need to share views across multiple SparkSessions in the same application


## Global Temporary Views

Global temporary views are accessible across Spark sessions (within the same Spark application).


In [9]:
# Create global temporary view
df.createOrReplaceGlobalTempView("global_employees")

# Query global view (note the global_temp prefix)
result_global = spark.sql("SELECT * FROM global_temp.global_employees WHERE Age > 30")
print("Querying global temporary view:")
result_global.show()


Querying global temporary view:
+-------+---+----------+------+------+
|   Name|Age|Department|Salary|  City|
+-------+---+----------+------+------+
|Charlie| 35|     Sales| 70000| Tokyo|
|    Eve| 32|        HR| 65000|Sydney|
+-------+---+----------+------+------+



## Summary

In this module, you learned:

1. **DataFrames and Spark SQL Tables are Interconvertible**: You can create views from DataFrames and query them with SQL
2. **Creating Views**: Using `createOrReplaceTempView()`, `createTempView()`, `createGlobalTempView()`, and `createOrReplaceGlobalTempView()`
3. **SQL Queries**: Using `spark.sql()` to execute SQL queries on DataFrames
4. **Converting Views to DataFrames**: Using `spark.table()` or `spark.sql()` to convert views back to DataFrames

**Key Takeaways**: 
- Spark SQL allows SQL developers to use familiar SQL syntax with PySpark DataFrames
- DataFrames and SQL tables are interconvertible - use the syntax you prefer
- Creating views from DataFrames enables powerful SQL-based data processing
- Use DataFrame API for most operations, SQL for complex queries

**Next Steps**: In Module 6, we'll learn about joins - combining data from multiple DataFrames.
