# Spark SQL - Transformations and Data Analysis

**Training Objective:** Understanding Spark SQL as an alternative to PySpark DataFrame API

**Topics Covered:**
- Spark SQL basics and view registration
- Syntax comparison: SQL vs DataFrame API
- Window Functions in SQL
- CTE (Common Table Expressions) and subqueries
- DDL operations (CREATE TABLE AS SELECT)

## Theoretical Introduction

**Spark SQL vs DataFrame API**

Spark offers two equivalent approaches to data processing:

| Aspect | DataFrame API | Spark SQL |
|--------|---------------|------------|
| Syntax | Python/Scala | Standard SQL |
| Optimization | Catalyst Optimizer | Catalyst Optimizer |
| Performance | Identical | Identical |
| Type Safety | Compile-time | Runtime |
| Integration | Programmatic | BI Tools, Analysts |

**When to use Spark SQL:**
- Analysts familiar with SQL
- Integration with BI tools
- Quick ad-hoc explorations
- Complex queries with CTE

**When to use DataFrame API:**
- Complex programmatic logic
- Dynamic query generation
- Reusable components
- Unit testing

## User Isolation

Run the initialization script:

In [0]:
%run ../00_setup

## Environment Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## Spark SQL Basics

### spark.sql() - executing SQL queries

The `spark.sql()` function executes a SQL query and returns a DataFrame.

**Key features:**
- Returns a DataFrame (can be combined with DataFrame API)
- Supports all standard SQL operations
- Uses Catalyst Optimizer

In [0]:
# Example: Simple SQL query
result = spark.sql("""
    SELECT 
        'Hello Spark SQL' as message,
        current_date() as today,
        current_timestamp() as now
""")

In [0]:
display(result)

### Creating Test Data

Preparing data for Spark SQL demonstration:

In [0]:
# Orders data
orders_data = [
    (1, 101, "2024-01-15", 250.00, "completed"),
    (2, 102, "2024-01-16", 150.00, "completed"),
    (3, 101, "2024-01-20", 320.00, "completed"),
    (4, 103, "2024-02-01", 180.00, "pending"),
    (5, 101, "2024-02-10", 420.00, "completed"),
    (6, 102, "2024-02-15", 90.00, "cancelled"),
    (7, 103, "2024-03-01", 550.00, "completed"),
    (8, 104, "2024-03-05", 280.00, "completed"),
    (9, 101, "2024-03-10", 175.00, "completed"),
    (10, 102, "2024-03-15", 340.00, "completed"),
]

orders_schema = StructType([
    StructField("order_id", IntegerType(), False),
    StructField("customer_id", IntegerType(), False),
    StructField("order_date", StringType(), False),
    StructField("amount", DoubleType(), False),
    StructField("status", StringType(), False)
])

orders_df = spark.createDataFrame(orders_data, orders_schema) \
    .withColumn("order_date", F.to_date("order_date"))

In [0]:
# Customers data
customers_data = [
    (101, "Jan", "Kowalski", "Premium", "Warszawa"),
    (102, "Anna", "Nowak", "Standard", "Krakow"),
    (103, "Piotr", "Wisniewski", "Premium", "Gdansk"),
    (104, "Maria", "Wojcik", "Standard", "Poznan"),
]

customers_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("first_name", StringType(), False),
    StructField("last_name", StringType(), False),
    StructField("tier", StringType(), False),
    StructField("city", StringType(), False)
])

customers_df = spark.createDataFrame(customers_data, customers_schema)

In [0]:
display(orders_df.limit(5))

In [0]:
display(customers_df)

### Registering Temp Views

To use DataFrames in SQL queries, they must be registered as temporary views.

**View types:**
- `createOrReplaceTempView()` - local view for the session
- `createOrReplaceGlobalTempView()` - global view (accessible via `global_temp.name`)

In [0]:
# Registering temporary views
orders_df.createOrReplaceTempView("orders")
customers_df.createOrReplaceTempView("customers")

In [0]:
# Now we can use SQL
spark.sql("SELECT * FROM orders LIMIT 5").display()

## SQL vs DataFrame API Comparison

### Example: Filtering and Aggregation

We will perform the same operation using both approaches.

**Task:** Find total value of completed orders per customer

In [0]:
# DataFrame API Approach
result_df = orders_df \
    .filter(F.col("status") == "completed") \
    .groupBy("customer_id") \
    .agg(
        F.count("*").alias("orders_count"),
        F.sum("amount").alias("total_amount"),
        F.round(F.avg("amount"), 2).alias("avg_amount")
    ) \
    .orderBy(F.col("total_amount").desc())

In [0]:
display(result_df)

In [0]:
# Spark SQL Approach
result_sql = spark.sql("""
    SELECT 
        customer_id,
        COUNT(*) as orders_count,
        SUM(amount) as total_amount,
        ROUND(AVG(amount), 2) as avg_amount
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
    ORDER BY total_amount DESC
""")

In [0]:
display(result_sql)

**Comparison:** Both approaches yield identical results and execution plans.

### Example: JOIN with multiple tables

In [0]:
# DataFrame API - JOIN
joined_df = orders_df \
    .join(customers_df, "customer_id", "inner") \
    .select(
        "order_id",
        F.concat_ws(" ", "first_name", "last_name").alias("customer_name"),
        "tier",
        "order_date",
        "amount",
        "status"
    )

In [0]:
display(joined_df.limit(5))

In [0]:
# Spark SQL - JOIN
joined_sql = spark.sql("""
    SELECT 
        o.order_id,
        CONCAT_WS(' ', c.first_name, c.last_name) as customer_name,
        c.tier,
        o.order_date,
        o.amount,
        o.status
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
""")

In [0]:
display(joined_sql.limit(5))

## Window Functions in SQL

### Window Functions Syntax

```sql
function() OVER (
    PARTITION BY column
    ORDER BY column
    ROWS BETWEEN ... AND ...
)
```

**Ranking Functions:**
- `ROW_NUMBER()` - unique row number
- `RANK()` - rank with gaps
- `DENSE_RANK()` - rank without gaps

**Analytic Functions:**
- `LAG()` - value from previous row
- `LEAD()` - value from next row
- `FIRST_VALUE()` / `LAST_VALUE()`

In [0]:
# Order ranking per customer
ranking_sql = spark.sql("""
    SELECT 
        order_id,
        customer_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_sequence,
        RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) as amount_rank,
        DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) as amount_dense_rank
    FROM orders
    WHERE status = 'completed'
    ORDER BY customer_id, order_date
""")

In [0]:
display(ranking_sql)

### LAG and LEAD - Change Analysis

In [0]:
# Comparison with previous order
lag_lead_sql = spark.sql("""
    SELECT 
        order_id,
        customer_id,
        order_date,
        amount,
        LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) as prev_amount,
        LEAD(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) as next_amount,
        amount - LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) as amount_change
    FROM orders
    WHERE status = 'completed'
    ORDER BY customer_id, order_date
""")

In [0]:
display(lag_lead_sql)

### Running Totals and Moving Averages

In [0]:
# Running total and moving average
running_sql = spark.sql("""
    SELECT 
        order_id,
        customer_id,
        order_date,
        amount,
        SUM(amount) OVER (
            PARTITION BY customer_id 
            ORDER BY order_date 
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as cumulative_amount,
        ROUND(AVG(amount) OVER (
            PARTITION BY customer_id 
            ORDER BY order_date 
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ), 2) as moving_avg_3
    FROM orders
    WHERE status = 'completed'
    ORDER BY customer_id, order_date
""")

In [0]:
display(running_sql)

## CTE (Common Table Expressions)

### WITH clause

CTEs allow creating named subqueries that can be reused multiple times.

**CTE Advantages:**
- Code readability
- Logic reuse
- Easier debugging
- Recursive queries

In [0]:
# CTE - Customer Analysis
cte_analysis = spark.sql("""
    WITH customer_orders AS (
        SELECT 
            customer_id,
            COUNT(*) as orders_count,
            SUM(amount) as total_spent,
            AVG(amount) as avg_order_value
        FROM orders
        WHERE status = 'completed'
        GROUP BY customer_id
    ),
    customer_ranking AS (
        SELECT 
            *,
            RANK() OVER (ORDER BY total_spent DESC) as spending_rank,
            CASE 
                WHEN total_spent >= 500 THEN 'High Value'
                WHEN total_spent >= 300 THEN 'Medium Value'
                ELSE 'Low Value'
            END as value_segment
        FROM customer_orders
    )
    SELECT 
        cr.*,
        c.first_name,
        c.last_name,
        c.tier,
        c.city
    FROM customer_ranking cr
    JOIN customers c ON cr.customer_id = c.customer_id
    ORDER BY spending_rank
""")

In [0]:
display(cte_analysis)

### Reusing CTEs

In [0]:
# CTE used multiple times
multi_cte = spark.sql("""
    WITH monthly_stats AS (
        SELECT 
            DATE_TRUNC('month', order_date) as month,
            customer_id,
            SUM(amount) as monthly_spent
        FROM orders
        WHERE status = 'completed'
        GROUP BY DATE_TRUNC('month', order_date), customer_id
    )
    SELECT 
        month,
        COUNT(DISTINCT customer_id) as active_customers,
        SUM(monthly_spent) as total_revenue,
        ROUND(AVG(monthly_spent), 2) as avg_customer_spend,
        MAX(monthly_spent) as max_customer_spend
    FROM monthly_stats
    GROUP BY month
    ORDER BY month
""")

In [0]:
display(multi_cte)

## Subqueries

### Scalar Subqueries

Subqueries returning a single value:

In [0]:
# Orders above average
scalar_subquery = spark.sql("""
    SELECT 
        order_id,
        customer_id,
        amount,
        (SELECT ROUND(AVG(amount), 2) FROM orders WHERE status = 'completed') as avg_amount,
        amount - (SELECT AVG(amount) FROM orders WHERE status = 'completed') as diff_from_avg
    FROM orders
    WHERE status = 'completed'
      AND amount > (SELECT AVG(amount) FROM orders WHERE status = 'completed')
    ORDER BY amount DESC
""")

In [0]:
display(scalar_subquery)

### Correlated Subqueries

Subqueries referencing the outer query:

In [0]:
# Customers with orders above their average
correlated_subquery = spark.sql("""
    SELECT 
        o.order_id,
        o.customer_id,
        o.amount,
        (SELECT ROUND(AVG(o2.amount), 2) 
         FROM orders o2 
         WHERE o2.customer_id = o.customer_id 
           AND o2.status = 'completed') as customer_avg
    FROM orders o
    WHERE o.status = 'completed'
      AND o.amount > (
          SELECT AVG(o2.amount) 
          FROM orders o2 
          WHERE o2.customer_id = o.customer_id 
            AND o2.status = 'completed'
      )
    ORDER BY o.customer_id, o.amount DESC
""")

In [0]:
display(correlated_subquery)

### EXISTS and IN

In [0]:
# Customers who have orders > 400
exists_query = spark.sql("""
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name,
        c.tier
    FROM customers c
    WHERE EXISTS (
        SELECT 1 
        FROM orders o 
        WHERE o.customer_id = c.customer_id 
          AND o.amount > 400
          AND o.status = 'completed'
    )
""")

In [0]:
display(exists_query)

## CASE WHEN and Advanced Expressions

### Conditional Logic

In [0]:
# Order segmentation
case_when_sql = spark.sql("""
    SELECT 
        order_id,
        customer_id,
        amount,
        CASE 
            WHEN amount >= 500 THEN 'Large'
            WHEN amount >= 200 THEN 'Medium'
            ELSE 'Small'
        END as order_size,
        CASE status
            WHEN 'completed' THEN 1
            WHEN 'pending' THEN 0
            ELSE -1
        END as status_code,
        COALESCE(amount, 0) as amount_safe
    FROM orders
    ORDER BY amount DESC
""")

In [0]:
display(case_when_sql)

### NULLIF, COALESCE, NVL

In [0]:
# Handling NULLs
null_handling = spark.sql("""
    SELECT 
        order_id,
        amount,
        status,
        NULLIF(status, 'cancelled') as status_or_null,
        COALESCE(NULLIF(status, 'cancelled'), 'N/A') as status_clean,
        NVL(amount, 0) as amount_nvl
    FROM orders
""")

In [0]:
display(null_handling)

## DDL Operations in Spark SQL

### CREATE TABLE AS SELECT (CTAS)

In [0]:
# Creating table with aggregation results
spark.sql(f"""
    CREATE OR REPLACE TABLE {CATALOG}.{GOLD_SCHEMA}.customer_summary AS
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name,
        c.tier,
        c.city,
        COUNT(o.order_id) as total_orders,
        COALESCE(SUM(CASE WHEN o.status = 'completed' THEN o.amount END), 0) as total_spent,
        ROUND(COALESCE(AVG(CASE WHEN o.status = 'completed' THEN o.amount END), 0), 2) as avg_order_value,
        MAX(o.order_date) as last_order_date
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name, c.tier, c.city
""")

In [0]:
# Verification
spark.sql(f"SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.customer_summary").display()

### CREATE VIEW

In [0]:
%skip
%sql
CREATE OR REPLACE TABLE bronze.orders_demo
AS
select * from orders

In [0]:
# Creating a view
spark.sql(f"""
    CREATE OR REPLACE VIEW {CATALOG}.{GOLD_SCHEMA}.v_monthly_revenue AS
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        COUNT(*) as orders_count,
        SUM(amount) as total_revenue,
        ROUND(AVG(amount), 2) as avg_order_value
    FROM orders --bronze.orders_demo
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', order_date)
""")

In [0]:
spark.sql(f"SELECT * FROM {CATALOG}.{GOLD_SCHEMA}.v_monthly_revenue ORDER BY month").display()

## Summary

### Covered Topics

1. **Spark SQL Basics**
   - `spark.sql()` query execution
   - `createOrReplaceTempView()` view registration

2. **SQL vs DataFrame API Comparison**
   - Identical performance (Catalyst Optimizer)
   - Different use cases

3. **Window Functions in SQL**
   - ROW_NUMBER, RANK, DENSE_RANK
   - LAG, LEAD
   - Running totals, moving averages

4. **CTE and Subqueries**
   - WITH clause for readability
   - Scalar and correlated subqueries
   - EXISTS, IN

5. **DDL Operations**
   - CREATE TABLE AS SELECT
   - CREATE VIEW

### Quick Reference

| Operation | Spark SQL | DataFrame API |
|-----------|-----------|---------------|
| Filtering | `WHERE col = 'x'` | `.filter(F.col("col") == "x")` |
| Aggregation | `GROUP BY col` | `.groupBy("col").agg(...)` |
| Ranking | `ROW_NUMBER() OVER (...)` | `row_number().over(window)` |
| CTE | `WITH cte AS (...)` | No direct equivalent |
| CASE WHEN | `CASE WHEN ... END` | `F.when(...).otherwise(...)` |

### Next Steps

- **Next Notebook**: 07_streaming_incremental.ipynb
- **Workshop**: 01_advanced_transformations_workshop.ipynb

## Clean up resources

In [0]:
# Removing temp views
spark.catalog.dropTempView("orders")
spark.catalog.dropTempView("customers")

# Optional: removing created tables
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{GOLD_SCHEMA}.customer_summary")
# spark.sql(f"DROP VIEW IF EXISTS {CATALOG}.{GOLD_SCHEMA}.v_monthly_revenue")