# Interview Challenge 11: Window Functions & Advanced Analytics

## Problem Statement

You are building an analytics dashboard for an e-commerce platform. Use PySpark window functions to calculate customer rankings, running totals, moving averages, and trend analysis across different business dimensions.

## Dataset Description

**Sales Data Schema:**
- `customer_id` (string)
- `order_date` (date)
- `product_category` (string)
- `sales_amount` (double)
- `quantity` (integer)
- `region` (string)

## Tasks

1. **Customer Rankings & Percentiles**
   - Rank customers by total sales within each region
   - Calculate percentile rankings for customer spending
   - Find top 3 customers per region using different ranking methods

2. **Running Totals & Moving Aggregates**
   - Calculate running total of sales by customer over time
   - Compute 7-day moving average of daily sales by category
   - Calculate cumulative sales percentage by region

3. **Time-Based Comparisons**
   - Compare current month sales vs previous month
   - Calculate month-over-month growth rates
   - Identify sales trends and seasonality patterns

4. **Advanced Window Operations**
   - Use lag/lead functions to compare consecutive periods
   - Calculate rolling statistics with custom window frames
   - Implement complex partitioning schemes

## Technical Requirements
- Use multiple window specifications with different partitioning and ordering
- Handle edge cases (first/last rows, null values)
- Optimize window function performance
- Include proper column naming and documentation

## ðŸš€ Try It Yourself

Implement comprehensive window function analytics. Start with basic rankings, then move to running totals and time-based comparisons.

**Steps to follow:**
1. Set up the sales data and explore it
2. Implement customer rankings and percentiles
3. Add running totals and moving averages
4. Create time-based comparisons and trends
5. Use advanced window operations with lag/lead functions

**Tip:** Pay attention to window partitioning and ordering - they significantly impact results.

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Create Spark session
spark = SparkSession.builder \
    .appName("WindowFunctionsChallenge") \
    .getOrCreate()

# Sample sales data
sales_data = [
    ('CUST001', '2023-01-15', 'Electronics', 299.99, 1, 'North'),
    ('CUST002', '2023-01-16', 'Books', 45.50, 2, 'South'),
    ('CUST001', '2023-01-20', 'Clothing', 89.99, 1, 'North'),
    ('CUST003', '2023-01-22', 'Electronics', 599.99, 1, 'East'),
    ('CUST002', '2023-01-25', 'Electronics', 149.99, 1, 'South'),
    ('CUST001', '2023-02-01', 'Books', 25.99, 3, 'North'),
    ('CUST004', '2023-02-05', 'Clothing', 129.99, 1, 'West'),
    ('CUST003', '2023-02-10', 'Electronics', 349.99, 1, 'East'),
    ('CUST002', '2023-02-15', 'Books', 67.99, 1, 'South'),
    ('CUST001', '2023-02-20', 'Electronics', 199.99, 1, 'North'),
    ('CUST005', '2023-02-25', 'Clothing', 79.99, 2, 'North'),
    ('CUST004', '2023-03-01', 'Electronics', 449.99, 1, 'West'),
    ('CUST003', '2023-03-05', 'Books', 32.50, 1, 'East'),
    ('CUST002', '2023-03-10', 'Electronics', 299.99, 1, 'South'),
    ('CUST001', '2023-03-15', 'Clothing', 159.99, 1, 'North')
]

# Define schema
sales_schema = StructType([
    StructField('customer_id', StringType(), True),
    StructField('order_date', StringType(), True),
    StructField('product_category', StringType(), True),
    StructField('sales_amount', DoubleType(), True),
    StructField('quantity', IntegerType(), True),
    StructField('region', StringType(), True)
])

# Create DataFrame
sales_df = spark.createDataFrame(sales_data, sales_schema)
sales_df = sales_df.withColumn('order_date', to_date('order_date'))
sales_df = sales_df.withColumn('month', date_format('order_date', 'yyyy-MM'))

print("Sales data overview:")
sales_df.show()
print(f"Total records: {sales_df.count()}")

# === YOUR SOLUTION GOES HERE ===
# Implement window function analytics

# Task 1: Customer Rankings & Percentiles
# 1a. Rank customers by total sales within each region
# 1b. Calculate percentile rankings
# 1c. Get top 3 customers per region

# Task 2: Running Totals & Moving Aggregates
# 2a. Running total by customer over time
# 2b. 7-day moving average by category
# 2c. Cumulative sales percentage by region

# Task 3: Time-Based Comparisons
# 3a. Month-over-month comparisons
# 3b. Growth rate calculations
# 3c. Seasonal trend analysis

# Task 4: Advanced Window Operations
# 4a. Lag/lead functions for period comparisons
# 4b. Custom window frames
# 4c. Complex partitioning strategies

print("Implement your window function analytics above!")
