# Module 1: Exploratory Data Analysis (EDA) and Validation

## Business Context: TechCorp HR Analytics

**The Challenge:**
TechCorp, a global technology company, wants to build a **Salary Prediction Model** to help HR:
1. **Fair compensation:** Ensure new hires receive competitive, unbiased salaries
2. **Budget planning:** Predict salary costs for new positions
3. **Market benchmarking:** Compare internal salaries with market rates

**The Data:**
HR has collected employee records with demographics and employment details. Before building any model, we must understand and validate this data.

**Our Goal in This Module:**
Perform EDA to discover data quality issues (missing values, outliers) that could corrupt the ML model.

---

**Training Objective:** Master exploratory data analysis (EDA) techniques and data quality validation before starting the ML process.

**Scope:**
- Generating synthetic employee data with realistic patterns and defects
- Data profiling: schema, descriptive statistics
- Visualizations: histograms, box plots
- Outlier detection: IQR (Interquartile Range) method

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** None (first notebook)
- **Execution time:** ~30 minutes

> **Note:** This notebook generates synthetic data - each run creates fresh data.

## Theoretical Introduction

**Why is EDA crucial in ML?**

Exploratory Data Analysis (EDA) is the first and most important step in any ML project. The **"Garbage In, Garbage Out"** principle means that even the best model won't help if the data quality is poor.

**What does EDA give us?**

| Aspect | Question | Consequence for ML |
|--------|----------|-------------------|
| **Data Quality** | Missing values? Duplicates? Impossible values? | Requires imputation or removal |
| **Distribution** | Normal or skewed data? | Affects model choice and scaling |
| **Outliers** | Extreme values? | Can "break" linear models |
| **Correlations** | Which features are correlated? | Helps with feature selection |

**IQR Method for Outliers:**

We use the **Interquartile Range (IQR)** because it is robust against outliers themselves (unlike standard deviation).

Rule: A point is an outlier if it is above $Q3 + 1.5 \times IQR$ or below $Q1 - 1.5 \times IQR$.

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

## Section 1: Data Generation (Bronze Layer)

We will generate a dataset representing **TechCorp employee records**. The data has realistic patterns:
- **Salary depends on:** Age (experience), Country (market rates), Source (recruitment channel)
- **Intentional defects:** Outliers (executive salaries), Missing values (incomplete records)

**Realistic Salary Logic:**
```
base_salary = 35,000
+ age_bonus = (age - 22) √ó 800     # Experience premium
√ó country_multiplier              # USA: 1.5, UK: 1.3, DE: 1.2, FR: 1.1, PL: 0.7
√ó source_multiplier               # PARTNER: 1.15, CRM: 1.0, WEB: 0.95, MOBILE: 0.9
+ random_noise                    # Individual variation
```

This ensures our ML model will find real patterns to learn!

In [0]:
# Install Faker for synthetic data generation
%pip install faker

In [0]:
from faker import Faker
import random
from pyspark.sql.types import *

fake = Faker()
random.seed(42)

# Salary multipliers (realistic market differences)
COUNTRY_MULTIPLIER = {"USA": 1.5, "UK": 1.3, "DE": 1.2, "FR": 1.1, "PL": 0.7}
SOURCE_MULTIPLIER = {"PARTNER": 1.15, "CRM": 1.0, "WEB": 0.95, "MOBILE": 0.9}

# Configuration
n_rows = 2000
data = []

for i in range(n_rows):
    # Generate demographics first
    country = random.choice(["USA", "UK", "DE", "FR", "PL"])
    source = random.choice(["CRM", "WEB", "MOBILE", "PARTNER"])
    
    # Simulate Missing Age (5% of data)
    age = None if random.random() < 0.05 else random.randint(22, 65)
    
    # Calculate REALISTIC salary based on age, country, source
    if random.random() < 0.01:
        # 1% Outliers: C-level executives
        salary = random.randint(400000, 800000)
    elif age is None:
        # If age is missing, use median age for salary calculation
        base_age = 40
        base_salary = 35000 + (base_age - 22) * 800
        salary = int(base_salary * COUNTRY_MULTIPLIER[country] * SOURCE_MULTIPLIER[source])
        salary += random.randint(-8000, 8000)  # noise
    else:
        # Normal case: salary depends on age, country, source
        base_salary = 35000 + (age - 22) * 800  # Experience bonus
        salary = int(base_salary * COUNTRY_MULTIPLIER[country] * SOURCE_MULTIPLIER[source])
        salary += random.randint(-8000, 8000)  # Individual variation
    
    row = (
        i, 
        fake.name(),
        age,
        max(salary, 25000),  # Minimum salary floor
        fake.date_between(start_date="-2y", end_date="today").strftime("%Y-%m-%d"),
        source,
        country
    )
    data.append(row)

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("salary", IntegerType(), True),
    StructField("registration_date", StringType(), True),  # hire_date
    StructField("source", StringType(), True),             # recruitment channel
    StructField("country", StringType(), True)
])

df_raw = spark.createDataFrame(data, schema)
df_raw.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema_name}.customer_bronze")

print("‚úÖ Generated 'customer_bronze' table with realistic salary patterns.")

## Section 2: Data Profiling & Summary

**Why do we need EDA?**
Before training any model, we must understand our data. "Garbage In, Garbage Out" is the golden rule of ML.
EDA helps us answer:
1.  **Data Quality:** Are there missing values? Duplicates? Impossible values (e.g., Age = -5)?
2.  **Data Distribution:** Is the data normal or skewed? (Affects model choice).
3.  **Correlations:** Which features might be useful for prediction?

The first step is to look at the raw numbers.

In [0]:
# Load data
df = spark.table("customer_bronze")

# 1. Basic Display - Use the 'Data Profile' tab in the output below!
display(df)

In [0]:
%sql 
SELECT * FROM customer_bronze

Databricks data profile. Run in Databricks to view.

In [0]:
dbutils.data.summarize(df)

**Data Profiling: TechCorp Employee Data**

This cell loads the 'customer_bronze' table (our HR employee records) and displays its contents for initial data profiling.
Notice how salary varies by country and age - these are the patterns our ML model will learn!

In [0]:
display(df.describe())

In [0]:
display(df.summary())

### Example 2.1: Statistical Summary
The `summary()` command provides count, mean, stddev, min, max, and quartiles.

In [0]:
# Check statistics for numerical columns
display(df.select("age", "salary").summary())

In [0]:
# Check Skewness
# Skewness > 1 indicates highly skewed data (long tail). 
# This suggests we might need Log Transformation later.
from pyspark.sql.functions import skewness

display(df.select(skewness("age").alias("age_skewness"), skewness("salary").alias("salary_skewness")))


### Example 2.2: Delta Lake Fundamental
**Delta Lake** is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It's the foundation of the Databricks Lakehouse.

In [0]:
# Create Delta table from Bronze data
df.write.format("delta").mode("overwrite").saveAsTable("customers_delta")

In [0]:
%sql
SELECT * FROM customers_delta

In [0]:
# Read from Delta table
df_delta = spark.table("customers_delta")
display(df_delta)

In [0]:
%sql
INSERT INTO customers_delta
SELECT * FROM customers_delta LIMIT 10

In [0]:
%sql
UPDATE customers_delta
SET age = 0

In [0]:
%sql
DELETE FROM customers_delta

In [0]:
%sql
-- Delta table history - essential for ML reproducibility
DESCRIBE HISTORY customers_delta

In [0]:
%sql
SELECT * FROM customers_delta

In [0]:
%sql
-- Time travel example: query the customers_delta table as of version 1
SELECT * FROM customers_delta VERSION AS OF 1

In [0]:
%sql
-- Rollback the Delta table 'customers_delta' to version 1
RESTORE TABLE customers_delta TO VERSION AS OF 1

## Section 3: Visualizations

We can use the built-in plotting tool in `display()` to visualize distributions.

**Task:**
1.  Click the **+** icon in the result of the cell below.
2.  Select **Visualization**.
3.  Choose **Histogram** to see the distribution of `age`.
4.  Choose **Box Plot** to see the distribution of `salary` (and spot outliers!).

In [0]:
display(df)

Databricks visualization. Run in Databricks to view.

## Advanced Visualizations with Python Libraries

For more sophisticated ML visualizations, we can integrate matplotlib and seaborn with Databricks.

In [0]:
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In [0]:
pandas_df = df.select("age", "salary").toPandas()

In [0]:
pandas_df.display()

In [0]:
# Create correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(pandas_df.corr(), annot=True, cmap='coolwarm', center=0, fmt='.3f')
plt.title('Feature Correlation Matrix for ML')
plt.tight_layout()
plt.show()

In [0]:
# Distribution analysis for ML feature engineering
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Age distribution
axes[0].hist(pandas_df['age'], bins=10, alpha=0.7, color='skyblue', edgecolor='black')
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')  

# Salary distribution
axes[1].hist(pandas_df['salary'], bins=1000, alpha=0.7, color='lightgreen', edgecolor='black')
axes[1].set_title('Salary Distribution')
axes[1].set_xlabel('Salary')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

This code generates a correlation heatmap for the features in a pandas DataFrame. It uses matplotlib to set the figure size and seaborn to plot the correlation matrix, displaying correlation coefficients with color gradients and annotations. The heatmap helps visualize relationships between features, which is useful for machine learning analysis.

## Section 4: Identifying Outliers

**Why are outliers a problem?**
Outliers are extreme values that deviate significantly from other observations.
- **Impact on Mean:** A single billionaire in a neighborhood can skew the "average income" massively, making it unrepresentative.
- **Impact on Models:** Linear models (Regression) minimize "squared error". An outlier has a huge error, so the model will try too hard to fit it, ruining the fit for the rest of the data.

**The IQR Method:**
We use the **Interquartile Range (IQR)** because it is robust. Unlike Standard Deviation, it is not influenced by the outliers themselves.
Rule: A point is an outlier if it is above $Q3 + 1.5 \times IQR$.

In [0]:
# Calculate Q1 and Q3 for Salary
quantiles = df.approxQuantile("salary", [0.25, 0.75], 0.01)
q1, q3 = quantiles[0], quantiles[1]
iqr = q3 - q1

upper_bound = q3 + 1.5 * iqr

print(f"Q1: {q1}, Q3: {q3}, IQR: {iqr}")
print(f"Upper Bound for Outliers: {upper_bound}")

# Filter outliers
outliers = df.filter(df.salary > upper_bound)
print(f"Number of outliers found: {outliers.count()}")
display(outliers)

## Section 5: Data Validation

Data validation ensures that the data meets defined quality standards before it is used for analysis or modeling. This process checks for issues such as missing values, invalid data types, out-of-range values, and inconsistencies. By validating data early, we can prevent downstream errors, improve model reliability, and maintain trust in analytics results.

Typical data validation steps include:
- Verifying data types and formats
- Checking for missing or null values
- Ensuring values fall within expected ranges (e.g., age > 0)
- Detecting duplicates
- Validating categorical values against allowed lists

Implementing robust data validation is a best practice for maintaining high data quality in any data pipeline.




Schema validation is the process of ensuring that your data conforms to a predefined structure before it is used for analysis or modeling. This includes checking that each column has the correct data type, required columns are present, and constraints (such as NOT NULL or value ranges) are enforced.


In [0]:
# Basic schema validation
expected_columns = ["id", "name", "age", "email", "salary"]
actual_columns = df_raw.columns

print(actual_columns)

In [0]:

missing_columns = set(expected_columns) - set(actual_columns)
extra_columns = set(actual_columns) - set(expected_columns)

if missing_columns:
    print(f"‚ùå Missing columns: {missing_columns}")
if extra_columns:
    print(f"‚ö†Ô∏è Extra columns: {extra_columns}")
if not missing_columns and not extra_columns:
    print("‚úÖ All expected columns present")

# Check data types
print("üìä Current data types:")
for col_name, col_type in df_raw.dtypes:
    print(f"   {col_name}: {col_type}")

In [0]:
display(df_raw)

In [0]:
# Data type validation for each column
expected_types = {
    "id": "int",
    "name": "string",
    "age": "int",
    "salary": "int",
    "registration_date": "string"
}

actual_types = dict(df.dtypes)

type_mismatches = []
for col_name, expected_type in expected_types.items():
    actual_type = actual_types.get(col_name)
    if actual_type != expected_type:
        type_mismatches.append((col_name, expected_type, actual_type))

if type_mismatches:
    print("‚ùå Data type mismatches found:")
    for col_name, expected, actual in type_mismatches:
        print(f"   {col_name}: expected {expected}, found {actual}")
else:
    print("‚úÖ All column data types are correct")


### Business Rule Validation Example: Filter/Where Validation

Business rule validation using filter/where applies logical conditions to your data to ensure it meets specific requirements. For example, you can filter out records where `age < 0` or `salary > 1_000_000` to enforce domain rules and maintain data quality.

In [0]:
# Business rule validation example: filter/where validation
# Example: Age must be between 18 and 100 (inclusive), Salary must be >0 and <=500000

from pyspark.sql.functions import col

# Age validation
invalid_age_df = df_raw.filter(
    col("age").isNotNull() & ((col("age") < 18) | (col("age") > 100))
)
display(invalid_age_df)

# Salary validation
invalid_salary_df = df_raw.filter(
    col("salary").isNotNull() & ((col("salary") <= 0) | (col("salary") > 500000))
)
display(invalid_salary_df)

## Best Practices

### üéØ EDA Strategy (in order of priority):

**1. Start with the big picture:**
- Check shape: `df.count()`, `len(df.columns)`
- Use Data Profile tab in Databricks
- Look at `df.printSchema()` for data types

**2. Identify quality issues:**
- Missing values: `df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])`
- Duplicates: `df.count() - df.dropDuplicates().count()`
- Invalid values: domain-specific checks (e.g., negative ages)

**3. Understand distributions:**
- Use `summary()` for numerical columns
- Check `skewness()` - values > 1 indicate strong skew
- Visualize with histograms and box plots

**4. Document findings:**
- Keep notes on data quality issues
- Document assumptions made
- Track decisions for handling outliers/missing values

| Metric | Good Value | Action if Exceeded |
|--------|------------|-------------------|
| Missing rate | < 5% | Consider imputation |
| Outlier rate | < 1% | Investigate and decide |
| Skewness | -1 to 1 | Consider log transform |
| Duplicate rate | 0% | Remove or investigate |

## Summary

### What we achieved:

- **Data Generation**: Created synthetic `customer_bronze` table with realistic defects
- **Data Profiling**: Used `summary()`, `skewness()`, and Data Profile tab
- **Visualizations**: Histograms and Box Plots for distribution analysis
- **Outlier Detection**: Applied IQR method to identify extreme values

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Always start with EDA** - understand your data before modeling |
| 2 | **"Garbage In, Garbage Out"** - data quality determines model quality |
| 3 | **IQR is robust** - use it instead of standard deviation for outliers |
| 4 | **Skewness matters** - affects model choice and feature transformation |
| 5 | **Document everything** - track data quality issues and decisions |

### Metrics to Monitor:

| Metric | Our Data | Status |
|--------|----------|--------|
| Missing values (age) | ~5% | ‚ö†Ô∏è Needs imputation |
| Outliers (salary) | ~1% | ‚ö†Ô∏è Needs handling |
| Skewness (salary) | High | ‚ö†Ô∏è Consider log transform |

### Next Steps:

üìö **Next Module:** Module 2 - Data Splitting (train/test split strategies)

## Cleanup

Optionally remove demo tables created during exercises:

In [0]:
# Cleanup - remove demo tables created in this notebook

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_bronze")

# print("‚úÖ All demo tables removed")

print("‚ÑπÔ∏è Cleanup disabled (uncomment code to remove demo tables)")