# Interview Challenge 12: SQL & Temp Views Integration

## Problem Statement

Build a reporting dashboard that combines DataFrame operations with SQL queries using temporary views. Create complex analytics that leverage both the DataFrame API and Spark SQL for optimal performance and readability.

## Dataset Description

**Customer Orders Dataset:**
- `order_id`, `customer_id`, `order_date`, `total_amount`, `status`
- `customer_id`, `name`, `email`, `signup_date`, `country`
- `product_id`, `product_name`, `category`, `price`
- `order_id`, `product_id`, `quantity`, `unit_price`

## Tasks

1. **Create Temporary Views**
   - Register DataFrames as temporary views
   - Create global temporary views for cross-session access
   - Understand view lifecycle and scoping

2. **SQL Query Integration**
   - Write complex SQL queries with JOINs, CTEs, and window functions
   - Combine DataFrame transformations with SQL queries
   - Use subqueries and advanced SQL features

3. **Hybrid Processing**
   - Mix DataFrame API and SQL for optimal performance
   - Create views from DataFrame operations for SQL access
   - Implement complex business logic using both approaches

4. **Advanced Analytics**
   - Customer lifetime value calculations
   - Product performance analysis
   - Geographic sales analysis
   - Time-based trend analysis

## Technical Requirements
- Use `createOrReplaceTempView()` and `createOrReplaceGlobalTempView()`
- Implement complex SQL with CTEs, window functions, and subqueries
- Combine DataFrame operations with SQL queries effectively
- Handle performance considerations between DataFrame API vs SQL

## ðŸš€ Try It Yourself

Build a comprehensive reporting solution using both DataFrame API and Spark SQL. Start by creating temp views, then implement complex analytics using SQL queries.

**Steps to follow:**
1. Load and prepare the data using DataFrames
2. Create temporary views for SQL access
3. Write complex SQL queries with JOINs and aggregations
4. Implement advanced analytics using CTEs and window functions
5. Combine DataFrame operations with SQL results

**Tip:** Consider when to use DataFrame API vs SQL - each has performance implications.

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Create Spark session
spark = SparkSession.builder \
    .appName("SQLTempViewsChallenge") \
    .enableHiveSupport() \
    .getOrCreate()

# Sample data for orders, customers, products, and order items
orders_data = [
    ('ORD001', 'CUST001', '2023-01-15', 299.99, 'completed'),
    ('ORD002', 'CUST002', '2023-01-16', 145.50, 'completed'),
    ('ORD003', 'CUST001', '2023-01-20', 89.99, 'completed'),
    ('ORD004', 'CUST003', '2023-01-22', 599.99, 'pending'),
    ('ORD005', 'CUST002', '2023-01-25', 149.99, 'completed'),
    ('ORD006', 'CUST001', '2023-02-01', 77.97, 'completed'),
    ('ORD007', 'CUST004', '2023-02-05', 129.99, 'completed'),
    ('ORD008', 'CUST003', '2023-02-10', 349.99, 'completed'),
    ('ORD009', 'CUST002', '2023-02-15', 67.99, 'cancelled'),
    ('ORD010', 'CUST001', '2023-02-20', 199.99, 'completed')
]

customers_data = [
    ('CUST001', 'John Doe', 'john@email.com', '2020-01-15', 'US'),
    ('CUST002', 'Jane Smith', 'jane@email.com', '2020-03-20', 'UK'),
    ('CUST003', 'Bob Johnson', 'bob@email.com', '2020-02-10', 'US'),
    ('CUST004', 'Alice Brown', 'alice@email.com', '2020-04-05', 'DE')
]

products_data = [
    ('PROD001', 'Laptop Pro', 'Electronics', 299.99),
    ('PROD002', 'Wireless Headphones', 'Electronics', 149.99),
    ('PROD003', 'Office Chair', 'Furniture', 349.99),
    ('PROD004', 'Coffee Mug', 'Kitchen', 12.99),
    ('PROD005', 'Running Shoes', 'Sports', 89.99)
]

order_items_data = [
    ('ORD001', 'PROD001', 1, 299.99),
    ('ORD002', 'PROD002', 1, 145.50),
    ('ORD003', 'PROD005', 1, 89.99),
    ('ORD004', 'PROD003', 1, 349.99),
    ('ORD004', 'PROD004', 1, 250.00),
    ('ORD005', 'PROD002', 1, 149.99),
    ('ORD006', 'PROD004', 6, 77.97),
    ('ORD007', 'PROD005', 1, 89.99),
    ('ORD007', 'PROD004', 3, 40.00),
    ('ORD008', 'PROD003', 1, 349.99),
    ('ORD010', 'PROD002', 1, 199.99)
]

# Define schemas
orders_schema = StructType([
    StructField('order_id', StringType(), True),
    StructField('customer_id', StringType(), True),
    StructField('order_date', StringType(), True),
    StructField('total_amount', DoubleType(), True),
    StructField('status', StringType(), True)
])

customers_schema = StructType([
    StructField('customer_id', StringType(), True),
    StructField('name', StringType(), True),
    StructField('email', StringType(), True),
    StructField('signup_date', StringType(), True),
    StructField('country', StringType(), True)
])

products_schema = StructType([
    StructField('product_id', StringType(), True),
    StructField('product_name', StringType(), True),
    StructField('category', StringType(), True),
    StructField('price', DoubleType(), True)
])

order_items_schema = StructType([
    StructField('order_id', StringType(), True),
    StructField('product_id', StringType(), True),
    StructField('quantity', IntegerType(), True),
    StructField('unit_price', DoubleType(), True)
])

# Create DataFrames
orders_df = spark.createDataFrame(orders_data, orders_schema)
customers_df = spark.createDataFrame(customers_data, customers_schema)
products_df = spark.createDataFrame(products_data, products_schema)
order_items_df = spark.createDataFrame(order_items_data, order_items_schema)

# Convert dates
orders_df = orders_df.withColumn('order_date', to_date('order_date'))
customers_df = customers_df.withColumn('signup_date', to_date('signup_date'))

print("Data loaded successfully!")
print(f"Orders: {orders_df.count()}, Customers: {customers_df.count()}, Products: {products_df.count()}, Order Items: {order_items_df.count()}")

# === YOUR SOLUTION GOES HERE ===
# Implement SQL & Temp Views integration

# Task 1: Create Temporary Views
# Create temp views for SQL access

# Task 2: SQL Query Integration
# Write complex SQL queries using the temp views

# Task 3: Hybrid Processing
# Combine DataFrame operations with SQL queries

# Task 4: Advanced Analytics
# Implement complex business analytics using SQL

print("Implement your SQL & Temp Views solution above!")
