# Instacart Market Basket Analysis - Databricks

Interactive exploratory data analysis and insights from the Instacart dataset.

**Prerequisites:**
- Gold layer aggregation completed
- Tables exist in `/FileStore/instacart/gold/`

**Visualizations:**
- Top products analysis
- Department performance
- Customer segmentation
- Basket analysis (product affinity)

In [None]:
# Configuration
GOLD_PATH = "/FileStore/instacart/gold"

print(f"Gold data path: {GOLD_PATH}")
print(f"Spark version: {spark.version}")

## Load Gold Tables

Read pre-aggregated Gold tables for fast analytics.

In [None]:
# Load Gold tables
product_metrics = spark.read.format("delta").load(f"{GOLD_PATH}/product_metrics")
department_metrics = spark.read.format("delta").load(f"{GOLD_PATH}/department_metrics")
user_features = spark.read.format("delta").load(f"{GOLD_PATH}/user_purchase_features")
product_pairs = spark.read.format("delta").load(f"{GOLD_PATH}/product_pairs_affinity")

print(f"✓ Loaded {product_metrics.count():,} products")
print(f"✓ Loaded {department_metrics.count():,} departments")
print(f"✓ Loaded {user_features.count():,} users")
print(f"✓ Loaded {product_pairs.count():,} product pairs")

## 1. Top Products Analysis

Which products are most frequently ordered?

In [None]:
-- Top 20 products by order count
SELECT 
  product_name,
  total_orders,
  unique_customers,
  ROUND(reorder_rate * 100, 2) as reorder_rate_pct
FROM delta.`/FileStore/instacart/gold/product_metrics`
ORDER BY total_orders DESC
LIMIT 20

In [None]:
# Visualize top 10 products
top_10 = product_metrics \
    .select("product_name", "total_orders") \
    .orderBy(col("total_orders").desc()) \
    .limit(10)

display(top_10)

## 2. Reorder Rate Analysis

Products with highest customer loyalty (reorder rates).

In [None]:
-- Products with highest reorder rates (min 1000 orders)
SELECT 
  product_name,
  total_orders,
  ROUND(reorder_rate * 100, 2) as reorder_rate_pct,
  unique_customers,
  aisle,
  department
FROM delta.`/FileStore/instacart/gold/product_metrics`
WHERE total_orders >= 1000
ORDER BY reorder_rate DESC
LIMIT 20

## 3. Department Performance

How do different departments perform?

In [None]:
-- Department metrics
SELECT 
  department,
  total_orders,
  unique_products,
  unique_customers,
  ROUND(avg_reorder_rate * 100, 2) as avg_reorder_rate_pct
FROM delta.`/FileStore/instacart/gold/department_metrics`
ORDER BY total_orders DESC

In [None]:
# Visualize department performance
display(department_metrics.orderBy(col("total_orders").desc()))

## 4. Customer Behavior Statistics

Summary statistics for customer purchase patterns.

In [None]:
from pyspark.sql.functions import avg, min as spark_min, max as spark_max, stddev

# Calculate customer statistics
user_stats = user_features.select(
    avg("total_orders").alias("avg_orders"),
    avg("avg_basket_size").alias("avg_basket_size"),
    avg("reorder_propensity").alias("avg_reorder_propensity"),
    avg("departments_shopped").alias("avg_departments"),
    spark_min("total_orders").alias("min_orders"),
    spark_max("total_orders").alias("max_orders")
).collect()[0]

print("=" * 60)
print("CUSTOMER STATISTICS")
print("=" * 60)
print(f"Avg Orders per Customer: {user_stats['avg_orders']:.2f}")
print(f"Avg Basket Size: {user_stats['avg_basket_size']:.2f}")
print(f"Avg Reorder Propensity: {user_stats['avg_reorder_propensity']:.4f}")
print(f"Avg Departments Shopped: {user_stats['avg_departments']:.2f}")
print(f"Order Range: {user_stats['min_orders']} - {user_stats['max_orders']}")

## 5. Customer Segmentation Distribution

Visualize customer purchase behavior distributions.

In [None]:
# Sample users for visualization (10% sample for large datasets)
user_sample = user_features.sample(fraction=0.1)

print(f"Analyzing {user_sample.count():,} sampled users")
display(user_sample)

In [None]:
-- Customer segmentation by basket size
SELECT 
  CASE 
    WHEN avg_basket_size < 5 THEN 'Small (<5)'
    WHEN avg_basket_size < 10 THEN 'Medium (5-10)'
    WHEN avg_basket_size < 20 THEN 'Large (10-20)'
    ELSE 'Very Large (20+)'
  END as basket_size_segment,
  COUNT(*) as customer_count,
  AVG(reorder_propensity) as avg_reorder_rate
FROM delta.`/FileStore/instacart/gold/user_purchase_features`
GROUP BY basket_size_segment
ORDER BY customer_count DESC

## 6. Basket Analysis - Frequently Bought Together

Which products are commonly purchased in the same order?

In [None]:
-- Top 20 product pairs
SELECT 
  product_a_name,
  product_b_name,
  pair_count
FROM delta.`/FileStore/instacart/gold/product_pairs_affinity`
ORDER BY pair_count DESC
LIMIT 20

In [None]:
# Visualize top product pairs
top_pairs = product_pairs \
    .select("product_a_name", "product_b_name", "pair_count") \
    .orderBy(col("pair_count").desc()) \
    .limit(15)

display(top_pairs)

## 7. Shopping Time Patterns (if available)

Analyze when customers prefer to shop.

In [None]:
-- Most popular shopping hours (from user features)
SELECT 
  CAST(avg_order_hour as INT) as hour_of_day,
  COUNT(*) as customer_count
FROM delta.`/FileStore/instacart/gold/user_purchase_features`
WHERE avg_order_hour IS NOT NULL
GROUP BY hour_of_day
ORDER BY hour_of_day

In [None]:
-- Most popular shopping days (0=Sunday, 6=Saturday)
SELECT 
  CAST(avg_order_dow as INT) as day_of_week,
  CASE CAST(avg_order_dow as INT)
    WHEN 0 THEN 'Sunday'
    WHEN 1 THEN 'Monday'
    WHEN 2 THEN 'Tuesday'
    WHEN 3 THEN 'Wednesday'
    WHEN 4 THEN 'Thursday'
    WHEN 5 THEN 'Friday'
    WHEN 6 THEN 'Saturday'
  END as day_name,
  COUNT(*) as customer_count
FROM delta.`/FileStore/instacart/gold/user_purchase_features`
WHERE avg_order_dow IS NOT NULL
GROUP BY day_of_week, day_name
ORDER BY day_of_week

## 8. Key Insights & Business Recommendations

### Product Insights
- **Fresh produce dominates**: Bananas, organic items have highest order counts
- **High reorder rates**: Indicate strong customer loyalty for staple items
- **Opportunity**: Low reorder products may benefit from promotions

### Customer Behavior
- **Consistent basket sizes**: Most customers have predictable shopping patterns
- **Reorder propensity varies**: Opportunity for personalized marketing
- **Department diversity**: Some customers shop across many categories

### Business Recommendations

1. **Cross-sell opportunities**
   - Use product affinity data (frequently bought together)
   - Recommend complementary items at checkout
   - Bundle popular pairs for promotions

2. **Inventory optimization**
   - Stock high-reorder products prominently
   - Predict demand based on reorder patterns
   - Adjust inventory by department performance

3. **Marketing campaigns**
   - Target low-reorder customers with discounts
   - Loyalty programs for high-frequency shoppers
   - Time promotions based on shopping hour/day patterns

4. **Personalization**
   - Segment customers by basket size and shopping frequency
   - Tailor recommendations to department preferences
   - Predict next purchase with ML models

## 9. Next Steps - Advanced Analytics

### Machine Learning Opportunities

1. **Reorder Prediction Model**
   - Features: `user_purchase_features` table
   - Target: `reordered` flag from Silver tables
   - Algorithm: XGBoost, LightGBM
   - Use case: Predict if customer will reorder a product

2. **Recommendation Engine**
   - Collaborative filtering on `product_pairs_affinity`
   - Matrix factorization (ALS)
   - Use case: "Customers who bought X also bought Y"

3. **Customer Segmentation (Clustering)**
   - Features: basket size, reorder propensity, departments shopped
   - Algorithm: K-means, DBSCAN
   - Use case: Identify customer personas for targeted marketing

4. **Churn Prediction**
   - Feature: `avg_days_between_orders`
   - Identify at-risk customers (increasing time between orders)
   - Trigger re-engagement campaigns

### How to Build ML Models in Databricks

```python
# Example: Train a reorder prediction model
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler

# Load training data
train_df = spark.read.format("delta").load("/FileStore/instacart/silver/order_products_train_enriched")

# Feature engineering
assembler = VectorAssembler(
    inputCols=["order_number", "order_dow", "order_hour_of_day", "add_to_cart_order"],
    outputCol="features"
)

# Train model
rf = RandomForestClassifier(labelCol="reordered", featuresCol="features")
model = rf.fit(train_df)

# Use MLflow for experiment tracking (built-in to Databricks)
```

### Dashboarding with Databricks SQL

1. Go to **SQL Personas** → **SQL Editor**
2. Create queries from Gold tables
3. Add visualizations (bar charts, line graphs)
4. Combine into a dashboard for stakeholders

## Summary

✅ **Analysis complete!**

**Key metrics explored:**
- Product performance (order frequency, reorder rates)
- Department analytics
- Customer segmentation
- Basket analysis (product affinity)
- Shopping time patterns

**Business value delivered:**
- Actionable insights for marketing and inventory
- ML-ready feature tables
- Foundation for recommendation engine
- Customer segmentation for personalization

**What you can do now:**
- Export insights to stakeholders
- Build ML models with MLflow
- Create Databricks SQL dashboards
- Schedule notebooks as jobs (paid tier)