# Instacart Market Basket Analysis

This notebook demonstrates exploratory data analysis and insights from the Instacart dataset using PySpark and Delta Lake.

## Dataset Overview
- **Source**: Kaggle Instacart Market Basket Analysis
- **Size**: 3+ million orders from 200,000+ users
- **Goal**: Analyze shopping patterns, product affinity, and customer behavior

## Lakehouse Architecture
- **Bronze**: Raw CSV → Delta Lake (minimal transformation)
- **Silver**: Cleaned, joined, enriched tables
- **Gold**: Aggregated business metrics and ML features

## 1. Setup - Create Spark Session

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, sum as spark_sum, avg, desc, round as spark_round
from delta import configure_spark_with_delta_pip
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Create Spark session with Delta Lake
builder = SparkSession.builder \
    .appName("InstacartAnalysis") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.driver.memory", "4g")

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

print(f"Spark version: {spark.version}")
print("✓ Spark session created successfully")

## 2. Load Gold Layer Tables

We'll work with pre-aggregated Gold tables for fast analytics.

In [None]:
# Load Gold layer tables
product_metrics = spark.read.format("delta").load("../data/gold/product_metrics")
department_metrics = spark.read.format("delta").load("../data/gold/department_metrics")
user_features = spark.read.format("delta").load("../data/gold/user_purchase_features")
product_pairs = spark.read.format("delta").load("../data/gold/product_pairs_affinity")

print(f"✓ Loaded {product_metrics.count():,} products")
print(f"✓ Loaded {department_metrics.count():,} departments")
print(f"✓ Loaded {user_features.count():,} users")
print(f"✓ Loaded {product_pairs.count():,} product pairs")

## 3. Top Products Analysis

Which products are most frequently ordered?

In [None]:
# Get top 20 most ordered products
top_products = product_metrics \
    .select("product_name", "total_orders", "reorder_rate", "unique_customers") \
    .orderBy(desc("total_orders")) \
    .limit(20)

top_products.show(20, truncate=False)

In [None]:
# Visualize top 10 products
top_10_pdf = top_products.limit(10).toPandas()

plt.figure(figsize=(12, 6))
sns.barplot(data=top_10_pdf, y='product_name', x='total_orders', palette='viridis')
plt.title('Top 10 Most Ordered Products', fontsize=16, fontweight='bold')
plt.xlabel('Total Orders', fontsize=12)
plt.ylabel('Product', fontsize=12)
plt.tight_layout()
plt.show()

## 4. Reorder Rate Analysis

Which products have the highest reorder rates?

In [None]:
# Products with highest reorder rates (min 1000 orders for statistical significance)
high_reorder = product_metrics \
    .filter(col("total_orders") >= 1000) \
    .select("product_name", "total_orders", "reorder_rate", "aisle", "department") \
    .orderBy(desc("reorder_rate")) \
    .limit(15)

high_reorder.show(15, truncate=False)

## 5. Department Performance

How do different departments perform?

In [None]:
# Department metrics
dept_pdf = department_metrics \
    .orderBy(desc("total_orders")) \
    .toPandas()

# Visualize department performance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Total orders by department
sns.barplot(data=dept_pdf.head(10), y='department', x='total_orders', ax=axes[0], palette='coolwarm')
axes[0].set_title('Top 10 Departments by Orders', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Total Orders')

# Reorder rate by department
sns.barplot(data=dept_pdf.sort_values('avg_reorder_rate', ascending=False).head(10), 
            y='department', x='avg_reorder_rate', ax=axes[1], palette='plasma')
axes[1].set_title('Top 10 Departments by Reorder Rate', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Avg Reorder Rate')

plt.tight_layout()
plt.show()

## 6. Basket Analysis - Product Affinity

Which products are frequently bought together?

In [None]:
# Top product pairs
top_pairs = product_pairs \
    .select("product_a_name", "product_b_name", "pair_count") \
    .orderBy(desc("pair_count")) \
    .limit(20)

print("Top 20 Product Pairs (Frequently Bought Together):")
top_pairs.show(20, truncate=False)

## 7. Customer Segmentation

Analyze customer purchase behavior patterns.

In [None]:
# Summary statistics for customer features
user_stats = user_features.select(
    avg("total_orders").alias("avg_orders"),
    avg("avg_basket_size").alias("avg_basket_size"),
    avg("reorder_propensity").alias("avg_reorder_propensity"),
    avg("departments_shopped").alias("avg_departments")
).collect()[0]

print("Customer Statistics:")
print(f"  Avg Orders per Customer: {user_stats['avg_orders']:.2f}")
print(f"  Avg Basket Size: {user_stats['avg_basket_size']:.2f}")
print(f"  Avg Reorder Propensity: {user_stats['avg_reorder_propensity']:.4f}")
print(f"  Avg Departments Shopped: {user_stats['avg_departments']:.2f}")

In [None]:
# Convert user features to pandas for visualization
user_pdf = user_features.sample(fraction=0.1).toPandas()

# Visualize customer distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Basket size distribution
axes[0, 0].hist(user_pdf['avg_basket_size'], bins=50, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Distribution of Avg Basket Size', fontweight='bold')
axes[0, 0].set_xlabel('Avg Basket Size')
axes[0, 0].set_ylabel('Frequency')

# Reorder propensity distribution
axes[0, 1].hist(user_pdf['reorder_propensity'], bins=50, color='lightcoral', edgecolor='black')
axes[0, 1].set_title('Distribution of Reorder Propensity', fontweight='bold')
axes[0, 1].set_xlabel('Reorder Propensity')
axes[0, 1].set_ylabel('Frequency')

# Total orders distribution
axes[1, 0].hist(user_pdf['total_orders'], bins=50, color='lightgreen', edgecolor='black')
axes[1, 0].set_title('Distribution of Total Orders', fontweight='bold')
axes[1, 0].set_xlabel('Total Orders')
axes[1, 0].set_ylabel('Frequency')

# Departments shopped distribution
axes[1, 1].hist(user_pdf['departments_shopped'], bins=30, color='plum', edgecolor='black')
axes[1, 1].set_title('Distribution of Departments Shopped', fontweight='bold')
axes[1, 1].set_xlabel('Number of Departments')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 8. Key Insights & Conclusions

### Product Insights
- Fresh produce (bananas, organic items) dominate orders
- High reorder rates indicate strong customer loyalty for certain products

### Customer Behavior
- Most customers have consistent basket sizes
- Reorder propensity varies widely (opportunity for personalization)

### Business Recommendations
1. **Cross-sell opportunities**: Use product affinity data for recommendations
2. **Inventory optimization**: Focus on high-reorder products
3. **Marketing**: Target low-reorder customers with promotions
4. **Personalization**: Segment customers by basket size and shopping patterns

## Next Steps

1. **Build ML models**: Predict next purchase, reorder probability
2. **Time-series analysis**: Shopping patterns by day/hour
3. **Advanced segmentation**: K-means clustering on user features
4. **Recommendation engine**: Collaborative filtering with product pairs

In [None]:
# Stop Spark session (optional - uncomment to stop)
# spark.stop()
print("Analysis complete!")