# Workshop 1: Exploratory Data Analysis (EDA)

## Business Context: The Marketing Challenge

RetailMax's CMO has a problem: **marketing campaigns have a 2% conversion rate**, far below the industry average of 5%. The reason? All 10,000 customers receive the same generic promotions.

**Your mission:** Build an ML model that automatically classifies customers into segments (Basic, Standard, Premium) so marketing can send targeted campaigns.

But before we can build any model, we need to **understand our data**. This is where EDA comes in.

---

## What We're Doing and Why

| Step | Business Question | Technical Task |
|------|------------------|----------------|
| **1. Load Data** | What data do we have? | Load and preview tables |
| **2. Profile Data** | Is the data usable? | Check types, stats, distributions |
| **3. Find Issues** | What's broken? | Identify nulls, errors, inconsistencies |
| **4. Analyze Distributions** | Are customer segments balanced? | Check class distribution |

**Why EDA matters:**
- A model trained on dirty data will make dirty predictions
- "Garbage In, Garbage Out" - even the best algorithm can't fix bad data
- Issues found now save weeks of debugging later

---

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Hands-on Exercise
- **Prerequisites:** `00_Workshop_Setup.ipynb` completed
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
- **Execution time:** ~30 minutes

---

## Theoretical Background

**EDA helps answer:**

| Aspect | Question | Impact on ML |
|--------|----------|--------------|
| **Data Quality** | Missing values? Duplicates? Invalid values? | Requires imputation or removal |
| **Distribution** | Normal or skewed? | Affects model choice and scaling |
| **Outliers** | Extreme values present? | Can distort linear models |
| **Correlations** | Which features are related? | Informs feature selection |

---

In [0]:
%run ../demo/00_Setup

## Section 1: Load Data

**Business context:** We have one year of RetailMax sales transactions. Each row represents a single purchase - who bought what, when, and for how much.

In [0]:
# Exercise 1: Load Data
# Load the 'workshop_sales_data' table into a DataFrame named 'df'.
# Display the row count and first 5 records to verify the data loaded correctly.

df = spark.table("workshop_sales_data")
print(f"Rows: {df.count()}")
display(df.limit(5))

In [0]:
# Exercise 1: Load Data
# Load the 'workshop_sales_data' table into a DataFrame named 'df'.
# Display the row count and first 5 records to verify the data loaded correctly.

df = # TODO: Load table
# print(...)
# display(...)

## Section 2: Data Profiling

**Business context:** Before marketing trusts our model, we need to trust our data. Profiling answers: "Can we use this data, or is it full of errors?"

**What to check:**
1. **Data types:** Are dates stored as dates? Are numbers numeric?
2. **Statistics:** Do mean values make sense? Are there outliers?

In [0]:
# Exercise 2: Display the data schema
# Pay attention to columns: 'order_datetime', 'quantity', 'total_amount'. Are their types correct?

# TODO: Print schema

In [0]:
# Exercise 3: Display summary statistics for numeric columns
# Columns: 'quantity', 'unit_cost', 'sales_price', 'total_amount'
# Check MIN and MAX values. Do you see anything concerning (negative prices, extreme quantities)?

# TODO: Display summary statistics

In [0]:
# Exercise 3b: Check skewness of numeric columns
# Skewness > 1 or < -1 indicates highly skewed data
# Highly skewed features may need Log Transformation before modeling

from pyspark.sql.functions import skewness

# TODO: Calculate skewness for 'total_amount' and 'quantity'
# Hint: df.select(skewness("column_name"))

## Section 3: Data Quality Issue Identification

**Business context:** Real-world data is messy. Orders get duplicated, systems crash mid-transaction, users enter invalid data. We need to find and document every issue.

**Common retail data problems:**
- Customer didn't provide email → NULL
- Return processed incorrectly → negative quantity  
- POS system error → quantity * price ≠ total

**Why it matters:** If 5% of transactions have errors, our model learns from lies.

In [0]:
from pyspark.sql.functions import col, count, when, isnan, round, abs

# Exercise 4: Count NULL values in each column
# Hint: Use a loop over df.columns or list comprehension

null_counts = # TODO: Count nulls per column
display(null_counts)

In [0]:
# Exercise 5: Check for logical errors
# Find rows where:
# a) 'quantity' is less than or equal to 0
# b) 'total_amount' is less than 0
# Display the invalid records

invalid_orders = # TODO: Filter invalid orders
# print(...)
display(invalid_orders)

In [0]:
# Exercise 6 (Challenge): Check mathematical consistency
# Does 'total_amount' equal 'quantity' * 'sales_price'?
# Account for floating point precision (difference > 0.01)
# Find records where the calculation does not match

inconsistent_prices = # TODO: Find inconsistent records

# print(...)
# display(...)

## Section 4: Distribution Analysis

**Business context:** Marketing wants to know the customer mix. Are most customers Basic, or is Premium the majority? This affects:
- **Class imbalance:** If 90% are Basic, model might just predict "Basic" for everyone
- **Business strategy:** If Premium is rare, acquiring one is very valuable

**Key questions:**
1. Payment method distribution → Do Premium customers use credit cards?
2. Customer segment distribution → How balanced are our classes?

In [0]:
# Exercise 7: Visualize 'payment_method' distribution
# Use display() on grouped data. Which payment method is most common?

# TODO: Group and display payment methods

In [0]:
# Exercise 8: Visualize 'customer_segment' distribution
# Are the classes balanced or does one segment dominate? This is important for the ML model.

# TODO: Group and display customer segments

## Checkpoint: What Did We Discover?

**At this point, you should have identified:**

| Finding | Example | Impact on ML |
|---------|---------|--------------|
|  Dataset shape | 10,000 rows, 15 columns | Enough data for training |
| ️ Null values | 5% missing emails | Need imputation strategy |
|  Invalid records | Negative quantities | Must filter before modeling |
| ️ Skewed distributions | total_amount skew > 2 | May need log transform |
| ️ Class imbalance | 60% Basic, 25% Standard, 15% Premium | Consider stratified sampling |

**Business translation for stakeholders:**
> "We analyzed the sales data and found data quality issues that would corrupt the ML model. About 5% of transactions have errors (negative values, missing IDs). Once cleaned, we'll have reliable data for customer segmentation."

---

**Key findings to address in Workshop 2:**
1. Remove invalid records (negative quantities, amounts)
2. Handle missing values (delete key IDs, impute descriptive fields)
3. Consider log transformation for skewed features
4. Plan stratified sampling for imbalanced classes

---

## Best Practices: EDA

| Practice | Description |
|----------|-------------|
| **Start with shape** | `df.count()`, `len(df.columns)` before anything else |
| **Check types first** | `printSchema()` - wrong types cause silent errors |
| **Use summary()** | Quick view of min/max reveals impossible values |
| **Check skewness** | Values > 1 need transformation (log, sqrt) |
| **Document findings** | Write down issues for the cleaning phase |

---

---

# Solutions

Reference solutions for the exercises above. Compare your results

In [0]:
# 1. Load Data
df = spark.table("workshop_sales_data")
print(f"Rows: {df.count()}")
display(df.limit(5))

# 2. Schema
df.printSchema()

# 3. Stats
display(df.select("quantity", "unit_cost", "sales_price", "total_amount").summary())

# 3b. Skewness
from pyspark.sql.functions import skewness
display(df.select(
    skewness("total_amount").alias("total_amount_skew"),
    skewness("quantity").alias("quantity_skew")
))

# 4. Nulls
from pyspark.sql.functions import col, count, when, abs, round
null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
display(null_counts)

# 5. Invalid Logic
invalid_orders = df.filter((col("quantity") <= 0) | (col("total_amount") < 0))
display(invalid_orders)

# 6. Math Consistency
inconsistent = df.withColumn("calc_total", round(col("quantity") * col("sales_price"), 2)) \
                 .filter(abs(col("calc_total") - col("total_amount")) > 0.01)
display(inconsistent)

# 7 & 8. Visualizations
display(df.groupBy("payment_method").count())
display(df.groupBy("customer_segment").count())