# Module 06: Spark SQL and Temporary Views

**Difficulty**: ⭐⭐

**Estimated Time**: 70-85 minutes

**Prerequisites**: 
- Module 00: Introduction to Big Data and Spark Ecosystem
- Module 01: PySpark Setup and SparkSession
- Module 03: DataFrames and Datasets
- Module 05: DataFrame Operations

## Learning Objectives

By the end of this notebook, you will be able to:
1. Create and use temporary views for SQL queries
2. Write SQL queries on DataFrames using spark.sql()
3. Understand the difference between temporary and global temporary views
4. Use the Spark SQL catalog to manage tables and views
5. Mix DataFrame API and SQL seamlessly in the same application

## Setup

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, sum, avg, count, round
from datetime import date
import random

# Create SparkSession
spark = SparkSession.builder \
    .appName("Module 06: Spark SQL and Temporary Views") \
    .master("local[*]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

print(f"✓ SparkSession created: {spark.sparkContext.appName}")
print(f"  Spark version: {spark.version}")
print(f"  Spark UI: {spark.sparkContext.uiWebUrl}")

### Create Sample DataFrames

In [None]:
# Sample employees data
employees_data = [
    (1, "Alice Johnson", "Engineering", 75000, date(2020, 1, 15)),
    (2, "Bob Smith", "Sales", 65000, date(2019, 6, 10)),
    (3, "Charlie Brown", "Engineering", 80000, date(2021, 3, 22)),
    (4, "Diana Prince", "Marketing", 70000, date(2020, 8, 5)),
    (5, "Eve Davis", "Engineering", 85000, date(2018, 11, 30)),
    (6, "Frank Miller", "Sales", 68000, date(2022, 2, 14)),
    (7, "Grace Lee", "Marketing", 72000, date(2021, 9, 1)),
    (8, "Henry Wilson", "Engineering", 78000, date(2019, 4, 20)),
    (9, "Ivy Chen", "Sales", 71000, date(2020, 12, 10)),
    (10, "Jack Thompson", "Engineering", 82000, date(2021, 7, 5))
]

df_employees = spark.createDataFrame(
    employees_data,
    ["emp_id", "name", "department", "salary", "hire_date"]
)

print("Employees DataFrame:")
df_employees.show()

# Sample projects data
projects_data = [
    (101, "Project Alpha", 1, date(2024, 1, 1), date(2024, 6, 30)),
    (102, "Project Beta", 3, date(2024, 2, 1), date(2024, 8, 31)),
    (103, "Project Gamma", 5, date(2024, 3, 1), date(2024, 9, 30)),
    (104, "Project Delta", 2, date(2024, 1, 15), date(2024, 5, 15)),
    (105, "Project Epsilon", 8, date(2024, 4, 1), date(2024, 10, 31)),
]

df_projects = spark.createDataFrame(
    projects_data,
    ["project_id", "project_name", "lead_emp_id", "start_date", "end_date"]
)

print("\nProjects DataFrame:")
df_projects.show()

# Sample sales data
sales_data = []
products = ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard']
regions = ['North', 'South', 'East', 'West']

for i in range(50):
    sales_data.append((
        i + 1,
        date(2024, random.randint(1, 3), random.randint(1, 28)),
        random.choice(products),
        random.choice(regions),
        random.randint(1, 10),
        round(random.uniform(100, 2000), 2)
    ))

df_sales = spark.createDataFrame(
    sales_data,
    ["sale_id", "sale_date", "product", "region", "quantity", "revenue"]
)

print("\nSales DataFrame (first 10 rows):")
df_sales.show(10)

## 1. Understanding Spark SQL

### What is Spark SQL?

**Spark SQL** is Spark's module for structured data processing. It provides:
- SQL interface to query data
- Integration with DataFrames
- Optimized execution through Catalyst optimizer
- Support for various data sources

### Why Use SQL with Spark?

✅ **Familiarity**: Most data analysts know SQL

✅ **Readability**: Complex operations clearer in SQL

✅ **Integration**: Mix SQL and DataFrame API

✅ **Performance**: Same Catalyst optimizer as DataFrame API

### DataFrame API vs SQL

```python
# DataFrame API
df.filter(col("salary") > 70000) \
  .groupBy("department") \
  .agg(avg("salary"))

# SQL (equivalent)
spark.sql("""
    SELECT department, AVG(salary)
    FROM employees
    WHERE salary > 70000
    GROUP BY department
""")
```

**Both produce the same optimized execution plan!**

## 2. Creating Temporary Views

To query a DataFrame with SQL, we first register it as a **temporary view**.

### What is a Temporary View?

- A named reference to a DataFrame
- Exists only for the session
- Doesn't store data (just a reference)
- Can be queried with SQL

### Creating a Temporary View

In [None]:
# Create temporary view
df_employees.createOrReplaceTempView("employees")
df_projects.createOrReplaceTempView("projects")
df_sales.createOrReplaceTempView("sales")

print("✓ Temporary views created:")
print("  - employees")
print("  - projects")
print("  - sales")

# Note: createOrReplaceTempView() replaces if already exists
# Use createTempView() to throw error if exists

## 3. Running SQL Queries

### Basic SELECT Queries

In [None]:
# Simple SELECT query
result = spark.sql("""
    SELECT * 
    FROM employees
    LIMIT 5
""")

print("Simple SELECT:")
result.show()

# Select specific columns
result2 = spark.sql("""
    SELECT name, department, salary
    FROM employees
""")

print("\nSelect specific columns:")
result2.show(5)

### WHERE Clause

In [None]:
# Filter with WHERE
result = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE salary > 75000
""")

print("Employees with salary > 75000:")
result.show()

# Multiple conditions
result2 = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE department = 'Engineering' 
      AND salary > 75000
""")

print("\nEngineers with salary > 75000:")
result2.show()

### ORDER BY

In [None]:
# Sort by salary
result = spark.sql("""
    SELECT name, department, salary
    FROM employees
    ORDER BY salary DESC
    LIMIT 5
""")

print("Top 5 highest paid employees:")
result.show()

# Sort by multiple columns
result2 = spark.sql("""
    SELECT name, department, salary
    FROM employees
    ORDER BY department ASC, salary DESC
""")

print("\nSorted by department, then salary:")
result2.show()

## 4. Aggregation in SQL

### GROUP BY

In [None]:
# Count employees by department
result = spark.sql("""
    SELECT 
        department,
        COUNT(*) as num_employees
    FROM employees
    GROUP BY department
    ORDER BY num_employees DESC
""")

print("Employees by department:")
result.show()

# Multiple aggregations
result2 = spark.sql("""
    SELECT 
        department,
        COUNT(*) as num_employees,
        ROUND(AVG(salary), 2) as avg_salary,
        MIN(salary) as min_salary,
        MAX(salary) as max_salary,
        SUM(salary) as total_payroll
    FROM employees
    GROUP BY department
    ORDER BY avg_salary DESC
""")

print("\nDepartment statistics:")
result2.show()

### HAVING Clause

In [None]:
# Filter aggregated results with HAVING
result = spark.sql("""
    SELECT 
        department,
        COUNT(*) as num_employees,
        ROUND(AVG(salary), 2) as avg_salary
    FROM employees
    GROUP BY department
    HAVING AVG(salary) > 70000
    ORDER BY avg_salary DESC
""")

print("Departments with avg salary > 70000:")
result.show()

## 5. JOINs in SQL

In [None]:
# INNER JOIN
result = spark.sql("""
    SELECT 
        p.project_name,
        e.name as project_lead,
        e.department,
        p.start_date,
        p.end_date
    FROM projects p
    INNER JOIN employees e
        ON p.lead_emp_id = e.emp_id
    ORDER BY p.start_date
""")

print("Projects with their leads:")
result.show(truncate=False)

# LEFT JOIN
result2 = spark.sql("""
    SELECT 
        e.name,
        e.department,
        p.project_name
    FROM employees e
    LEFT JOIN projects p
        ON e.emp_id = p.lead_emp_id
    ORDER BY e.name
""")

print("\nEmployees and their projects (including non-leads):")
result2.show()

## 6. Subqueries

In [None]:
# Subquery in WHERE clause
result = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE salary > (
        SELECT AVG(salary)
        FROM employees
    )
    ORDER BY salary DESC
""")

print("Employees earning above average:")
result.show()

# Subquery with IN
result2 = spark.sql("""
    SELECT name, department, salary
    FROM employees
    WHERE emp_id IN (
        SELECT lead_emp_id
        FROM projects
    )
    ORDER BY name
""")

print("\nEmployees who are project leads:")
result2.show()

## 7. Common Table Expressions (CTEs)

In [None]:
# Using WITH clause (CTE)
result = spark.sql("""
    WITH dept_stats AS (
        SELECT 
            department,
            COUNT(*) as num_employees,
            AVG(salary) as avg_salary
        FROM employees
        GROUP BY department
    )
    SELECT 
        department,
        num_employees,
        ROUND(avg_salary, 2) as avg_salary
    FROM dept_stats
    WHERE num_employees >= 3
    ORDER BY avg_salary DESC
""")

print("Departments with 3+ employees:")
result.show()

# Multiple CTEs
result2 = spark.sql("""
    WITH 
    high_earners AS (
        SELECT * 
        FROM employees 
        WHERE salary > 75000
    ),
    dept_counts AS (
        SELECT department, COUNT(*) as count
        FROM high_earners
        GROUP BY department
    )
    SELECT *
    FROM dept_counts
    ORDER BY count DESC
""")

print("\nHigh earners by department:")
result2.show()

## 8. Date and String Functions in SQL

In [None]:
# Date functions
result = spark.sql("""
    SELECT 
        name,
        hire_date,
        YEAR(hire_date) as hire_year,
        MONTH(hire_date) as hire_month,
        DATEDIFF(CURRENT_DATE(), hire_date) as days_employed
    FROM employees
    ORDER BY hire_date DESC
    LIMIT 5
""")

print("Date functions:")
result.show()

# String functions
result2 = spark.sql("""
    SELECT 
        name,
        UPPER(name) as upper_name,
        LOWER(name) as lower_name,
        SUBSTRING(name, 1, 5) as first_5_chars,
        CONCAT(name, ' (', department, ')') as name_with_dept
    FROM employees
    LIMIT 5
""")

print("\nString functions:")
result2.show(truncate=False)

## 9. Global Temporary Views

### Temporary View vs Global Temporary View

| Feature | Temporary View | Global Temporary View |
|---------|----------------|----------------------|
| **Scope** | Current session | All sessions |
| **Prefix** | None | `global_temp.` |
| **Lifecycle** | Session ends | Application ends |
| **Use case** | Single session | Multi-session sharing |

In [None]:
# Create global temporary view
df_employees.createOrReplaceGlobalTempView("employees_global")

print("✓ Global temporary view created")

# Query global temporary view
result = spark.sql("""
    SELECT department, COUNT(*) as count
    FROM global_temp.employees_global
    GROUP BY department
""")

print("\nQuerying global temporary view:")
result.show()

# Note: Must use global_temp prefix!

## 10. Using the Spark Catalog

The **catalog** manages all tables, views, and databases in Spark.

In [None]:
# List all tables/views
print("All tables and views:")
spark.catalog.listTables()

# Show as DataFrame
tables_df = spark.sql("SHOW TABLES")
tables_df.show(truncate=False)

# List databases
print("\nDatabases:")
spark.catalog.listDatabases()

In [None]:
# Check if table exists
print(f"Does 'employees' exist? {spark.catalog.tableExists('employees')}")
print(f"Does 'nonexistent' exist? {spark.catalog.tableExists('nonexistent')}")

# Get table columns
print("\nColumns in 'employees' table:")
for col in spark.catalog.listColumns("employees"):
    print(f"  {col.name}: {col.dataType}")

## 11. Mixing DataFrame API and SQL

You can seamlessly mix SQL and DataFrame operations!

In [None]:
# Start with SQL
sql_result = spark.sql("""
    SELECT department, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
""")

print("SQL result:")
sql_result.show()

# Continue with DataFrame API
final_result = sql_result \
    .filter(col("avg_salary") > 70000) \
    .withColumn("avg_salary_rounded", round(col("avg_salary"), 2)) \
    .drop("avg_salary") \
    .orderBy(col("avg_salary_rounded").desc())

print("\nAfter DataFrame operations:")
final_result.show()

In [None]:
# Start with DataFrame API
df_filtered = df_employees.filter(col("salary") > 70000)

# Register as view
df_filtered.createOrReplaceTempView("high_earners")

# Continue with SQL
result = spark.sql("""
    SELECT 
        department,
        COUNT(*) as num_high_earners,
        ROUND(AVG(salary), 2) as avg_high_earner_salary
    FROM high_earners
    GROUP BY department
    ORDER BY avg_high_earner_salary DESC
""")

print("DataFrame → SQL:")
result.show()

## Exercises

### Exercise 1: Sales Analysis with SQL

Using the `sales` table, write SQL queries to:
1. Find total revenue by product
2. Find average revenue per sale by region
3. Find products that generated > $10,000 total revenue
4. Show results sorted by total revenue (descending)

In [None]:
# Exercise 1: Your code here

# Your SQL query here

### Exercise 2: Complex JOIN Query

Write a SQL query that:
1. Joins employees and projects
2. Shows project name, project lead name, and lead's department
3. Includes only projects led by Engineering department
4. Sorts by project start date

In [None]:
# Exercise 2: Your code here

# Your SQL query here

### Exercise 3: CTE for Multi-Step Analysis

Use a CTE to:
1. Calculate each employee's years of service (use DATEDIFF and hire_date)
2. Categorize employees: "Senior" (>4 years), "Mid" (2-4 years), "Junior" (<2 years)
3. Count employees in each category by department
4. Show only departments with at least 2 employees

In [None]:
# Exercise 3: Your code here

# Your SQL query here

### Exercise 4: Mixing SQL and DataFrame API

1. Use SQL to find departments with average salary > $70,000
2. Use DataFrame API to add a column "salary_tier" based on avg_salary:
   - "High" if >= 80000
   - "Medium" if >= 70000
   - "Low" otherwise
3. Sort by avg_salary descending
4. Display the result

In [None]:
# Exercise 4: Your code here

# Your code here

## Solutions

### Exercise 1 Solution

In [None]:
# Solution 1: Sales Analysis with SQL

result = spark.sql("""
    SELECT 
        product,
        COUNT(*) as num_sales,
        ROUND(SUM(revenue), 2) as total_revenue,
        ROUND(AVG(revenue), 2) as avg_revenue
    FROM sales
    GROUP BY product
    HAVING SUM(revenue) > 10000
    ORDER BY total_revenue DESC
""")

print("Products with >$10,000 revenue:")
result.show()

### Exercise 2 Solution

In [None]:
# Solution 2: Complex JOIN Query

result = spark.sql("""
    SELECT 
        p.project_name,
        e.name as project_lead,
        e.department,
        p.start_date
    FROM projects p
    INNER JOIN employees e
        ON p.lead_emp_id = e.emp_id
    WHERE e.department = 'Engineering'
    ORDER BY p.start_date
""")

print("Engineering-led projects:")
result.show(truncate=False)

### Exercise 3 Solution

In [None]:
# Solution 3: CTE for Multi-Step Analysis

result = spark.sql("""
    WITH employee_tenure AS (
        SELECT 
            name,
            department,
            hire_date,
            DATEDIFF(CURRENT_DATE(), hire_date) / 365 as years_service,
            CASE 
                WHEN DATEDIFF(CURRENT_DATE(), hire_date) / 365 > 4 THEN 'Senior'
                WHEN DATEDIFF(CURRENT_DATE(), hire_date) / 365 >= 2 THEN 'Mid'
                ELSE 'Junior'
            END as tenure_category
        FROM employees
    )
    SELECT 
        department,
        tenure_category,
        COUNT(*) as num_employees
    FROM employee_tenure
    GROUP BY department, tenure_category
    HAVING COUNT(*) >= 1
    ORDER BY department, tenure_category
""")

print("Employee tenure analysis:")
result.show()

### Exercise 4 Solution

In [None]:
# Solution 4: Mixing SQL and DataFrame API

from pyspark.sql.functions import when, col, round

# Step 1: SQL query
sql_result = spark.sql("""
    SELECT 
        department,
        AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
    HAVING AVG(salary) > 70000
""")

# Step 2-3: DataFrame API
final_result = sql_result \
    .withColumn(
        "salary_tier",
        when(col("avg_salary") >= 80000, "High")
            .when(col("avg_salary") >= 70000, "Medium")
            .otherwise("Low")
    ) \
    .withColumn("avg_salary", round(col("avg_salary"), 2)) \
    .orderBy(col("avg_salary").desc())

# Step 4: Display
print("Department salary analysis:")
final_result.show()

## Summary

### Key Concepts Covered

✅ **Temporary Views**: Register DataFrames for SQL queries

✅ **SQL Queries**: Full SQL support with spark.sql()

✅ **Global Views**: Share views across sessions

✅ **Catalog**: Manage tables, views, and databases

✅ **API Mixing**: Seamlessly combine SQL and DataFrame operations

### Creating Views

**Temporary View** (session-scoped):
```python
df.createOrReplaceTempView("table_name")
spark.sql("SELECT * FROM table_name")
```

**Global Temporary View** (application-scoped):
```python
df.createOrReplaceGlobalTempView("table_name")
spark.sql("SELECT * FROM global_temp.table_name")
```

### SQL Features Supported

✅ **SELECT, WHERE, ORDER BY, LIMIT**

✅ **GROUP BY, HAVING**

✅ **JOINs** (INNER, LEFT, RIGHT, FULL, CROSS)

✅ **Subqueries**

✅ **CTEs** (WITH clause)

✅ **Aggregate functions** (COUNT, SUM, AVG, MIN, MAX)

✅ **String functions** (UPPER, LOWER, CONCAT, SUBSTRING)

✅ **Date functions** (YEAR, MONTH, DAY, DATEDIFF, DATE_ADD)

### Catalog Operations

```python
spark.catalog.listTables()           # List all tables/views
spark.catalog.listDatabases()        # List databases
spark.catalog.tableExists("name")    # Check if exists
spark.catalog.listColumns("name")    # Get columns
```

### When to Use SQL vs DataFrame API?

**Use SQL when**:
- Team is familiar with SQL
- Complex queries easier to express in SQL
- Migrating from traditional databases
- Ad-hoc analysis and exploration

**Use DataFrame API when**:
- Building programmatic data pipelines
- Need type safety (in Scala/Java)
- Dynamic query construction
- Using advanced DataFrame features (window functions, UDFs)

**Best practice**: Mix both! Use whichever is clearer for each operation.

### Performance Considerations

- **Both SQL and DataFrame API use the same Catalyst optimizer**
- **No performance difference** between SQL and DataFrame API
- Choose based on readability and team preference
- Spark optimizes the logical plan before execution

### What's Next?

In **Module 07: Window Functions and Advanced Transformations**, you will:
- Use window functions for advanced analytics
- Create and use User Defined Functions (UDFs)
- Perform ranking, running totals, and moving averages
- Handle complex data transformations

### Additional Resources

- [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [SQL Reference](https://spark.apache.org/docs/latest/sql-ref.html)
- [Catalog API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Catalog.html)

In [None]:
# Cleanup
spark.stop()
print("SparkSession stopped. ✓")