# Workshop 1: Exploratory Data Analysis (EDA)

## Objective

Perform exploratory data analysis to understand the RetailMax sales dataset and identify data quality issues before building an ML model.

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Hands-on Exercise
- **Prerequisites:** `00_Workshop_Setup.ipynb` completed
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
- **Execution time:** ~30 minutes

---

## Theoretical Background

**Why is EDA important for ML?**

Exploratory Data Analysis is the first step in any ML project. The "Garbage In, Garbage Out" principle means that even the best model cannot compensate for poor data quality.

**EDA helps answer:**

| Aspect | Question | Impact on ML |
|--------|----------|--------------|
| **Data Quality** | Missing values? Duplicates? Invalid values? | Requires imputation or removal |
| **Distribution** | Normal or skewed? | Affects model choice and scaling |
| **Outliers** | Extreme values present? | Can distort linear models |
| **Correlations** | Which features are related? | Informs feature selection |

---

In [0]:
%run ../demo/00_Setup

## Section 1: Load Data

In [0]:
# Exercise 1: Load Data
# Load the 'workshop_sales_data' table into a DataFrame named 'df'.
# Display the row count and first 5 records to verify the data loaded correctly.

df = # TODO: Load table
# print(...)
# display(...)

## Section 2: Data Profiling

Before cleaning, understand the nature of the data.

**Check:**
1. **Data types:** Are dates stored as dates? Are numbers numeric?
2. **Statistics:** Do mean values make sense? Are there outliers?

In [0]:
# Exercise 2: Display the data schema
# Pay attention to columns: 'order_datetime', 'quantity', 'total_amount'. Are their types correct?

# TODO: Print schema

In [0]:
# Exercise 3: Display summary statistics for numeric columns
# Columns: 'quantity', 'unit_cost', 'sales_price', 'total_amount'
# Check MIN and MAX values. Do you see anything concerning (negative prices, extreme quantities)?

# TODO: Display summary statistics

In [0]:
# Exercise 3b: Check skewness of numeric columns
# Skewness > 1 or < -1 indicates highly skewed data
# Highly skewed features may need Log Transformation before modeling

from pyspark.sql.functions import skewness

# TODO: Calculate skewness for 'total_amount' and 'quantity'
# Hint: df.select(skewness("column_name"))

## Section 3: Data Quality Issue Identification

Real-world data is rarely perfect. Identify specific issues to plan remediation.

**Look for:**
1. **Missing values (Nulls):** Which columns are incomplete?
2. **Logical errors:** Negative quantities, negative prices
3. **Inconsistencies:** Does `quantity * sales_price` equal `total_amount`?

In [0]:
from pyspark.sql.functions import col, count, when, isnan, round, abs

# Exercise 4: Count NULL values in each column
# Hint: Use a loop over df.columns or list comprehension

null_counts = # TODO: Count nulls per column
display(null_counts)

In [0]:
# Exercise 5: Check for logical errors
# Find rows where:
# a) 'quantity' is less than or equal to 0
# b) 'total_amount' is less than 0
# Display the invalid records

invalid_orders = # TODO: Filter invalid orders
# print(...)
display(invalid_orders)

In [0]:
# Exercise 6 (Challenge): Check mathematical consistency
# Does 'total_amount' equal 'quantity' * 'sales_price'?
# Account for floating point precision (difference > 0.01)
# Find records where the calculation does not match

inconsistent_prices = # TODO: Find inconsistent records

# print(...)
# display(...)

## Section 4: Distribution Analysis

Visualizations help identify trends and anomalies quickly.

**Focus on:**
1. Payment method distribution
2. Customer segment distribution
3. Sales over time

In [0]:
# Exercise 7: Visualize 'payment_method' distribution
# Use display() on grouped data. Which payment method is most common?

# TODO: Group and display payment methods

In [0]:
# Exercise 8: Visualize 'customer_segment' distribution
# Are the classes balanced or does one segment dominate? This is important for the ML model.

# TODO: Group and display customer segments

## Checkpoint

At this point, you should have:
- Loaded and examined the dataset structure
- Identified null values and their distribution
- Found logical errors (negative quantities, amounts)
- Checked skewness of numeric features
- Analyzed class balance in customer segments

**Key findings to address in the next notebook:**
- Missing values in key columns
- Invalid records (negative values)
- Mathematical inconsistencies
- Skewed distributions (may need log transform)

---

## Best Practices: EDA

| Practice | Description |
|----------|-------------|
| **Start with shape** | `df.count()`, `len(df.columns)` before anything else |
| **Check types first** | `printSchema()` - wrong types cause silent errors |
| **Use summary()** | Quick view of min/max reveals impossible values |
| **Check skewness** | Values > 1 need transformation (log, sqrt) |
| **Document findings** | Write down issues for the cleaning phase |

---

---

# Solutions

Reference solutions for the exercises above. Compare your results

In [0]:
# 1. Load Data
df = spark.table("workshop_sales_data")
print(f"Rows: {df.count()}")
display(df.limit(5))

# 2. Schema
df.printSchema()

# 3. Stats
display(df.select("quantity", "unit_cost", "sales_price", "total_amount").summary())

# 3b. Skewness
from pyspark.sql.functions import skewness
display(df.select(
    skewness("total_amount").alias("total_amount_skew"),
    skewness("quantity").alias("quantity_skew")
))

# 4. Nulls
from pyspark.sql.functions import col, count, when, abs, round
null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
display(null_counts)

# 5. Invalid Logic
invalid_orders = df.filter((col("quantity") <= 0) | (col("total_amount") < 0))
display(invalid_orders)

# 6. Math Consistency
inconsistent = df.withColumn("calc_total", round(col("quantity") * col("sales_price"), 2)) \
                 .filter(abs(col("calc_total") - col("total_amount")) > 0.01)
display(inconsistent)

# 7 & 8. Visualizations
display(df.groupBy("payment_method").count())
display(df.groupBy("customer_segment").count())