# Interview Challenge 8: Advanced DataFrame Operations & Window Functions

## Problem Statement

You are working with a large e-commerce dataset and need to perform complex analytics to understand customer behavior, product performance, and sales trends. Use PySpark DataFrame API and window functions to solve these business questions.

## Dataset Description

**Orders Table:**
- `order_id` (string) - Unique order identifier
- `customer_id` (string) - Customer who placed the order
- `order_date` (date) - Date when order was placed
- `product_id` (string) - Product that was ordered
- `quantity` (integer) - Number of items ordered
- `unit_price` (double) - Price per unit
- `total_amount` (double) - Total order amount

**Customers Table:**
- `customer_id` (string)
- `registration_date` (date)
- `country` (string)
- `customer_segment` (string)

**Products Table:**
- `product_id` (string)
- `category` (string)
- `subcategory` (string)
- `brand` (string)

## Tasks

1. **Customer Analytics**
   - Calculate customer lifetime value (LTV) using window functions
   - Identify top 10 customers by total spending in each country
   - Calculate customer retention rate (customers with orders in consecutive months)
   - Find customers who haven't ordered in the last 30 days

2. **Product Performance**
   - Calculate product sales ranking within each category using window functions
   - Identify products with declining sales (compare current month vs previous month)
   - Calculate product category performance trends over time
   - Find best-selling products by customer segment

3. **Time Series Analysis**
   - Calculate 7-day and 30-day moving averages for daily sales
   - Identify sales seasonality patterns (monthly trends)
   - Calculate year-over-year growth rates
   - Find peak sales periods

4. **Advanced Aggregations**
   - Calculate percentile rankings for customer spending
   - Implement custom aggregation functions
   - Handle complex grouping and pivoting operations
   - Calculate basket analysis metrics

## Technical Requirements
- Use PySpark DataFrame API extensively
- Implement window functions for ranking, running totals, and moving averages
- Use appropriate join strategies
- Optimize for performance with proper partitioning
- Handle null values and edge cases
- Include comments explaining your approach

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Create Spark session
spark = SparkSession.builder \
    .appName("AdvancedDataFrameChallenge") \
    .getOrCreate()

# Sample data
orders_data = [
    ('ORD001', 'CUST001', '2023-01-15', 'PROD001', 2, 50.0, 100.0),
    ('ORD002', 'CUST001', '2023-01-20', 'PROD002', 1, 75.0, 75.0),
    ('ORD003', 'CUST002', '2023-01-18', 'PROD001', 3, 50.0, 150.0),
    ('ORD004', 'CUST003', '2023-02-10', 'PROD003', 1, 200.0, 200.0),
    ('ORD005', 'CUST001', '2023-02-15', 'PROD001', 1, 50.0, 50.0),
    ('ORD006', 'CUST004', '2023-02-20', 'PROD002', 2, 75.0, 150.0),
    ('ORD007', 'CUST002', '2023-03-05', 'PROD003', 1, 200.0, 200.0),
    ('ORD008', 'CUST005', '2023-03-12', 'PROD001', 4, 50.0, 200.0)
]

customers_data = [
    ('CUST001', '2020-01-15', 'US', 'Premium'),
    ('CUST002', '2020-03-20', 'UK', 'Standard'),
    ('CUST003', '2020-02-10', 'US', 'Premium'),
    ('CUST004', '2020-04-05', 'DE', 'Standard'),
    ('CUST005', '2020-01-30', 'US', 'Basic')
]

products_data = [
    ('PROD001', 'Electronics', 'Laptops', 'BrandA'),
    ('PROD002', 'Electronics', 'Phones', 'BrandB'),
    ('PROD003', 'Home', 'Furniture', 'BrandC')
]

# Define schemas
orders_schema = StructType([
    StructField('order_id', StringType(), True),
    StructField('customer_id', StringType(), True),
    StructField('order_date', DateType(), True),
    StructField('product_id', StringType(), True),
    StructField('quantity', IntegerType(), True),
    StructField('unit_price', DoubleType(), True),
    StructField('total_amount', DoubleType(), True)
])

customers_schema = StructType([
    StructField('customer_id', StringType(), True),
    StructField('registration_date', DateType(), True),
    StructField('country', StringType(), True),
    StructField('customer_segment', StringType(), True)
])

products_schema = StructType([
    StructField('product_id', StringType(), True),
    StructField('category', StringType(), True),
    StructField('subcategory', StringType(), True),
    StructField('brand', StringType(), True)
])

# Create DataFrames
orders_df = spark.createDataFrame(orders_data, orders_schema)
customers_df = spark.createDataFrame(customers_data, customers_schema)
products_df = spark.createDataFrame(products_data, products_schema)

# Convert string dates to date type
orders_df = orders_df.withColumn('order_date', to_date('order_date'))
customers_df = customers_df.withColumn('registration_date', to_date('registration_date'))

print("Data loaded successfully!")
orders_df.show(5)

# === YOUR SOLUTION GOES HERE ===
# Implement the advanced DataFrame operations

# Task 1: Customer Analytics with Window Functions
# 1a. Calculate customer lifetime value (running total)
# 1b. Top 10 customers by spending per country
# 1c. Customer retention (consecutive month orders)
# 1d. Inactive customers (no orders in last 30 days)

# Task 2: Product Performance Analysis
# 2a. Product ranking within categories
# 2b. Declining sales detection
# 2c. Category performance trends
# 2d. Segment-based product performance

# Task 3: Time Series Analytics
# 3a. Moving averages (7-day, 30-day)
# 3b. Monthly seasonality patterns
# 3c. Year-over-year growth
# 3d. Peak sales periods

# Task 4: Advanced Aggregations
# 4a. Percentile rankings
# 4b. Custom aggregation functions
# 4c. Complex grouping operations
# 4d. Basket analysis metrics

print("Implement your advanced DataFrame operations above!")
