# Workshop: Customer Segmentation for RetailMax

## Business Context

**Company:** RetailMax - a multi-category retail chain operating in the USA  
**Role:** Data Scientist in the Customer Analytics team  
**Scenario:** January 2025. The company has completed a record sales year, but leadership has identified an issue with marketing efficiency.

---

## Business Problem

RetailMax has 10,000 customers and 100,000 transactions, but currently treats all customers identically. The same promotions are sent to high-value customers and one-time buyers, resulting in inefficient marketing spend.

**Objective:** Build an ML model that automatically classifies customers into segments (Basic, Standard, Premium) to enable personalized marketing campaigns.

---

## Available Data

| Source | Description | Size | Known Issues |
|--------|-------------|------|--------------|
| `customers.csv` | Customer data (name, email, location, segment) | 10,000 | Missing values, duplicate IDs |
| `products.csv` | Product catalog (name, brand, price) | 2,000 | Negative prices, missing names |
| `orders_batch.json` | Order history | 100,000 | Null IDs, negative quantities, future dates |

**Data Quality Note:** The data comes directly from transactional systems and contains quality issues typical of real-world data. Returns are recorded as negative quantities, and some orders lack customer IDs.

---

## Workshop Roadmap

| # | Notebook | Objective | Deliverable |
|---|----------|-----------|-------------|
| **0** | Setup | Environment and data preparation | Bronze Tables |
| **1** | Data Exploration | Data understanding, issue identification | EDA Report |
| **2** | Cleaning & Features | Data cleaning + Customer 360 | Feature Table |
| **3** | ML Pipeline | Classification model + MLflow | Deployed Model |

---

## Learning Objectives

By the end of this workshop, participants will be able to:
- Perform exploratory data analysis (EDA) on real-world datasets
- Identify and resolve data quality issues
- Build customer-level features (RFM analysis)
- Create reproducible ML pipelines in Spark
- Track experiments using MLflow

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Setup (run first!)
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Catalog `data_ml_preparation` must exist (created by instructor)
  - Permissions: CREATE SCHEMA, CREATE VOLUME, CREATE TABLE
- **Execution time:** ~3 minutes (includes data download)

---

## Section 1: User Detection & Environment Setup

Automatic detection of current user and unique schema creation for isolation:

In [None]:
# Get current user and create isolated schema
current_user_email = spark.sql("SELECT current_user()").collect()[0][0]
username = current_user_email.split("@")[0].replace(".", "_").replace("-", "_")

# Configuration - SAME as Demo notebooks for consistency
catalog_name = "data_ml_preparation"  # Must exist (created by instructor)
schema_name = f"ml_dp_{username}"

print(f"Detected user: {current_user_email}")
print(f"Username for schema: {username}")
print(f"Target: {catalog_name}.{schema_name}")

In [None]:
# Set catalog and create schema (catalog must exist - created by instructor)
try:
    spark.sql(f"USE CATALOG {catalog_name}")
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {schema_name}")
    spark.sql(f"USE SCHEMA {schema_name}")
    print(f"Environment configured: {catalog_name}.{schema_name}")
except Exception as e:
    print(f"Error: {e}")
    print("Fallback to hive_metastore...")
    catalog_name = "hive_metastore"
    spark.sql(f"USE CATALOG {catalog_name}")
    spark.sql(f"CREATE SCHEMA IF NOT EXISTS {schema_name}")
    spark.sql(f"USE SCHEMA {schema_name}")
    print(f"Using: {catalog_name}.{schema_name}")

## Section 2: Create Volume and Download Data

Create a Unity Catalog Volume to store raw data files, then download the dataset from GitHub repository.

**Volume:** Managed storage location in Unity Catalog for files (CSV, JSON, Parquet, etc.)

In [None]:
# Create Volume for raw data storage
volume_name = "raw_data"

spark.sql(f"CREATE VOLUME IF NOT EXISTS {volume_name}")
volume_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}"

print(f"Volume created: {volume_path}")

### 2.1 Download Data from GitHub Repository

Download the training dataset files from the course repository:

In [None]:
import urllib.request
import os

# GitHub raw content URLs
repo_base = "https://raw.githubusercontent.com/Bureyz/DataPreparation4MachineLearning/main/dataset"

files_to_download = [
    ("customers/customers.csv", "customers.csv"),
    ("products/csv/products.csv", "products.csv"),
    ("orders/orders_batch.json", "orders_batch.json")
]

# Download files to Volume
for remote_path, local_name in files_to_download:
    url = f"{repo_base}/{remote_path}"
    local_path = f"{volume_path}/{local_name}"
    
    try:
        # Download using dbutils
        response = urllib.request.urlopen(url)
        content = response.read()
        
        # Write to volume using dbutils
        dbutils.fs.put(local_path, content.decode('utf-8'), overwrite=True)
        print(f"Downloaded: {local_name}")
    except Exception as e:
        print(f"Error downloading {local_name}: {e}")

print(f"\nFiles in volume:")
display(dbutils.fs.ls(volume_path))

## Section 3: Load Raw Data (Bronze Layer)

Load data from Volume into Bronze tables.
Bronze layer contains raw data exactly as received from source systems.

**Note:** The data intentionally contains quality issues that will be identified and resolved in subsequent notebooks.

In [None]:
# Load Customers from Volume (CRM System - 10,000 records)
customers_path = f"{volume_path}/customers.csv"

df_customers = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(customers_path)

print(f"Loaded {df_customers.count()} customers")
display(df_customers.limit(5))

In [None]:
# Load Products from Volume (Product Catalog - 2,000 SKUs)
products_path = f"{volume_path}/products.csv"

df_products = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(products_path)

print(f"Loaded {df_products.count()} products")
display(df_products.limit(5))

In [None]:
# Load Orders from Volume (POS/E-commerce - 100,000 transactions)
orders_path = f"{volume_path}/orders_batch.json"

df_orders = spark.read.format("json") \
    .option("inferSchema", "true") \
    .load(orders_path)

print(f"Loaded {df_orders.count()} orders")
display(df_orders.limit(5))

In [None]:
# Save Bronze Tables
df_customers.write.mode("overwrite").saveAsTable("customers_bronze")
df_products.write.mode("overwrite").saveAsTable("products_bronze")
df_orders.write.mode("overwrite").saveAsTable("orders_bronze")

print("Bronze tables created: customers_bronze, products_bronze, orders_bronze")

## Section 4: Create Unified Dataset

Join Orders with Customers and Products to create a unified sales dataset.
This dataset will be the starting point for EDA and cleaning in subsequent notebooks.

In [None]:
from pyspark.sql.functions import col

# Join Orders with Customers and Products
# Note: This dataset contains data quality issues (nulls, negatives) to be cleaned in the workshop

df_joined = df_orders.alias("o") \
    .join(df_customers.alias("c"), col("o.customer_id") == col("c.customer_id"), "left") \
    .join(df_products.alias("p"), col("o.product_id") == col("p.product_id"), "left") \
    .select(
        col("o.order_id"),
        col("o.order_datetime"),
        col("o.customer_id"),
        col("c.first_name"),
        col("c.last_name"),
        col("c.email"),
        col("c.country"),
        col("c.registration_date"),
        col("c.customer_segment"),
        col("o.product_id"),
        col("p.product_name"),
        col("p.brand"),
        col("p.unit_cost"),
        col("o.unit_price").alias("sales_price"),
        col("o.quantity"),
        col("o.total_amount"),
        col("o.payment_method")
    )

# Save as the starting point for the workshop
df_joined.write.mode("overwrite").saveAsTable("workshop_sales_data")

print(f"Unified dataset created: workshop_sales_data ({df_joined.count()} rows)")
display(df_joined.limit(5))

## Summary

Setup complete. The following resources have been created:

| Resource | Name | Description |
|----------|------|-------------|
| **Schema** | `ml_dp_{username}` | Per-user isolated schema |
| **Volume** | `raw_data` | Storage for raw data files |
| **Table** | `customers_bronze` | Raw customer data |
| **Table** | `products_bronze` | Raw product catalog |
| **Table** | `orders_bronze` | Raw order transactions |
| **Table** | `workshop_sales_data` | Unified sales dataset |

**Volume Path:** `/Volumes/{catalog}/{schema}/raw_data/`

**Next Step:** Proceed to `01_Workshop_Data_Exploration.ipynb` to analyze data quality.