# Spark SQL Practice - Revenue Analysis

## Introduction

This notebook contains a practice question on computing monthly revenue by region using Spark SQL. This exercise will help you master:

- Date filtering and window functions
- Handling latest records per group
- Joins across multiple tables
- Monthly aggregation with date functions
- Complex business logic implementation

## Instructions

1. **In Databricks**: SparkSession is automatically available as `spark`
2. **For local testing**: Uncomment the SparkSession creation code in the setup cell
3. Run the data setup cells first to create sample data
4. Complete the exercise in the provided code cell
5. Test your solution and verify the results

## Data Setup

Run the cells below to set up all the sample data needed for the exercise.

In [0]:
# In Databricks, SparkSession is already available
# For local testing, uncomment the following:

# from pyspark.sql import SparkSession
# spark = SparkSession.builder \
#     .appName("Spark SQL Practice") \
#     .master("local[*]") \
#     .getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.sql.functions import col, to_timestamp, current_timestamp, expr, date_sub, date_format, sum as spark_sum, max as spark_max, row_number, window
from datetime import datetime, timedelta

print("Setup complete! SparkSession ready.")

Setup complete! SparkSession ready.


In [0]:
# Create customers table
# Schema: customer_id, region

customers_data =display [
    (1, "North"),
    (2, "South"),
    (3, "East"),
    (4, "West"),
    (5, "North"),
    (6, "South"),
    (7, "East"),
    (8, "West"),
    (9, "North"),
    (10, "South")
]

customers_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("region", StringType(), True)
])

df_customers = spark.createDataFrame(customers_data, customers_schema)
df_customers.createOrReplaceTempView("customers")

print("Customers table created:")
df_customers.show()

Customers table created:
+-----------+------+
|customer_id|region|
+-----------+------+
|          1| North|
|          2| South|
|          3|  East|
|          4|  West|
|          5| North|
|          6| South|
|          7|  East|
|          8|  West|
|          9| North|
|         10| South|
+-----------+------+



In [0]:
# Create orders table
# Schema: order_id, customer_id, order_ts, amount
# We'll create orders spanning the last 120 days to have data beyond the 90-day window

from datetime import datetime, timedelta
import random

# Get current timestamp
current_ts = datetime.now()

# Generate orders over the last 120 days
orders_data = []
order_id = 1

# Create orders for each customer across different dates
for customer_id in range(1, 11):
    # Create 2-4 orders per customer at different dates
    num_orders = random.randint(2, 4)
    for _ in range(num_orders):
        # Random date within last 120 days
        days_ago = random.randint(0, 120)
        order_date = current_ts - timedelta(days=days_ago)
        order_ts = order_date.strftime("%Y-%m-%d %H:%M:%S")
        amount = round(random.uniform(100, 2000), 2)
        orders_data.append((order_id, customer_id, order_ts, amount))
        order_id += 1

orders_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("order_ts", StringType(), True),
    StructField("amount", DoubleType(), True)
])

df_orders = spark.createDataFrame(orders_data, orders_schema)
# Convert order_ts to timestamp type
df_orders = df_orders.withColumn("order_ts", to_timestamp(col("order_ts"), "yyyy-MM-dd HH:mm:ss"))
df_orders.createOrReplaceTempView("orders")

print("Orders table created:")
df_orders.orderBy("order_ts").show(50, truncate=False)
print(f"\nTotal orders: {df_orders.count()}")

Orders table created:
+--------+-----------+-------------------+-------+
|order_id|customer_id|order_ts           |amount |
+--------+-----------+-------------------+-------+
|2       |1          |2025-09-11 03:44:29|1062.88|
|16      |7          |2025-09-21 03:44:29|965.27 |
|9       |4          |2025-09-23 03:44:29|1589.54|
|14      |6          |2025-10-01 03:44:29|1795.87|
|12      |5          |2025-10-04 03:44:29|1293.37|
|10      |4          |2025-10-17 03:44:29|1008.83|
|11      |5          |2025-10-19 03:44:29|305.81 |
|18      |8          |2025-10-19 03:44:29|1354.15|
|24      |10         |2025-10-21 03:44:29|586.77 |
|4       |2          |2025-10-22 03:44:29|1955.52|
|7       |3          |2025-10-22 03:44:29|1956.53|
|17      |7          |2025-10-28 03:44:29|1807.68|
|1       |1          |2025-11-06 03:44:29|370.41 |
|13      |6          |2025-11-08 03:44:29|225.65 |
|26      |10         |2025-11-16 03:44:29|1876.35|
|3       |1          |2025-12-08 03:44:29|988.53 |
|20      

In [0]:
# Create payments table
# Schema: payment_id, order_id, amount, paid_ts
# Note: An order can have multiple payments (partial payments, refunds, etc.)
# We need to identify the LATEST payment per order

from datetime import datetime, timedelta
import random

# Get current timestamp
current_ts = datetime.now()

payments_data = []
payment_id = 1

# For each order, create 1-3 payments at different times
for order_row in orders_data:
    order_id = order_row[0]
    order_date_str = order_row[2]
    order_date = datetime.strptime(order_date_str, "%Y-%m-%d %H:%M:%S")
    order_amount = order_row[3]
    
    # Create 1-3 payments per order
    num_payments = random.randint(1, 3)
    remaining_amount = order_amount
    
    for i in range(num_payments):
        # Payment date is after order date, within 30 days
        days_after_order = random.randint(0, 30)
        payment_date = order_date + timedelta(days=days_after_order, hours=random.randint(0, 23))
        paid_ts = payment_date.strftime("%Y-%m-%d %H:%M:%S")
        
        # Last payment gets remaining amount, others are partial
        if i == num_payments - 1:
            payment_amount = round(remaining_amount, 2)
        else:
            payment_amount = round(random.uniform(0.1, remaining_amount * 0.8), 2)
            remaining_amount -= payment_amount
        
        payments_data.append((payment_id, order_id, payment_amount, paid_ts))
        payment_id += 1

payments_schema = StructType([
    StructField("payment_id", IntegerType(), True),
    StructField("order_id", IntegerType(), True),
    StructField("amount", DoubleType(), True),
    StructField("paid_ts", StringType(), True)
])

df_payments = spark.createDataFrame(payments_data, payments_schema)
# Convert paid_ts to timestamp type
df_payments = df_payments.withColumn("paid_ts", to_timestamp(col("paid_ts"), "yyyy-MM-dd HH:mm:ss"))
df_payments.createOrReplaceTempView("payments")

print("Payments table created:")
df_payments.orderBy("order_id", "paid_ts").show(50, truncate=False)
print(f"\nTotal payments: {df_payments.count()}")

# Show example: multiple payments for same order
print("\nExample: Multiple payments for order_id = 1:")
df_payments.filter(col("order_id") == 1).orderBy("paid_ts").show(truncate=False)

Payments table created:
+----------+--------+-------+-------------------+
|payment_id|order_id|amount |paid_ts            |
+----------+--------+-------+-------------------+
|2         |1       |327.64 |2025-11-11 14:44:29|
|1         |1       |42.77  |2025-11-27 03:44:29|
|3         |2       |1062.88|2025-09-27 09:44:29|
|4         |3       |988.53 |2026-01-04 14:44:29|
|5         |4       |1955.52|2025-11-01 07:44:29|
|6         |5       |645.93 |2025-12-27 08:44:29|
|7         |5       |216.37 |2026-01-11 04:44:29|
|8         |6       |738.81 |2026-01-11 12:44:29|
|9         |6       |1017.34|2026-01-15 14:44:29|
|10        |7       |1956.53|2025-10-26 15:44:29|
|11        |8       |1166.53|2026-01-12 21:44:29|
|12        |9       |1589.54|2025-09-24 03:44:29|
|13        |10      |1008.83|2025-11-11 02:44:29|
|14        |11      |61.61  |2025-10-28 07:44:29|
|16        |11      |216.84 |2025-11-05 14:44:29|
|15        |11      |27.36  |2025-11-06 00:44:29|
|17        |12      |1293.

---

## Practice Question

### Task 1: Monthly Revenue by Region (Last 90 Days)

**Requirement**: For the last 90 days, compute monthly revenue by region based on the **latest payment per order**.

**Key Points to Consider:**
1. Filter to last 90 days based on payment date (`paid_ts`)
2. For each order, use only the **latest payment** (most recent `paid_ts`)
3. Join with customers to get the region
4. Group by month and region
5. Sum the payment amounts

**Tables:**
- `customers(customer_id, region)`
- `orders(order_id, customer_id, order_ts, amount)`
- `payments(payment_id, order_id, amount, paid_ts)`

**Expected Output Columns:**
- `month` (e.g., "2024-01", "2024-02")
- `region`
- `revenue` (sum of latest payment amounts)

**Hints:**
- Use window functions to identify the latest payment per order
- Consider using CTEs (Common Table Expressions) to break down the problem
- Remember to filter by date before applying window functions for better performance

In [0]:
# Your solution here
# Write your Spark SQL query or PySpark code to solve the problem

---

## Verification Queries

Use these queries to verify your understanding and check intermediate results.

In [0]:
# Check: How many payments per order (to verify multiple payments exist)
print("Payments per order (sample):")
spark.sql("""
    SELECT 
        order_id,
        COUNT(*) as payment_count,
        MIN(paid_ts) as first_payment,
        MAX(paid_ts) as latest_payment,
        SUM(amount) as total_paid
    FROM payments
    GROUP BY order_id
    HAVING COUNT(*) > 1
    ORDER BY payment_count DESC
    LIMIT 10
""").show(truncate=False)

In [0]:
# Check: Date range of payments in last 90 days
print("Payment date range (last 90 days):")
spark.sql("""
    SELECT 
        MIN(paid_ts) as earliest_payment,
        MAX(paid_ts) as latest_payment,
        COUNT(DISTINCT DATE_FORMAT(paid_ts, 'yyyy-MM')) as distinct_months,
        COUNT(*) as total_payments
    FROM payments
    WHERE paid_ts >= current_timestamp() - INTERVAL 90 DAYS
""").show(truncate=False)

In [0]:
# Check: Sample data exploration
print("Sample customers:")
df_customers.show()

print("\nSample orders:")
df_orders.show(10)

print("\nSample payments:")
df_payments.show(10)