# Module 03: Data Transformation and Cleaning

**Estimated Time:** 60-75 minutes

## Learning Objectives

By the end of this module, you will:
- Clean messy data and handle missing values
- Transform data types and formats
- Perform string manipulation and regex operations
- Work with dates and times effectively
- Merge, join, and aggregate datasets
- Apply data normalization techniques

---

## 1. The Importance of Data Transformation

Raw data is rarely ready for analysis. Transformation involves:

### Common Transformation Tasks
- **Cleaning**: Remove duplicates, handle nulls, fix errors
- **Type Conversion**: Ensure correct data types
- **Normalization**: Standardize formats and values
- **Enrichment**: Add derived columns
- **Aggregation**: Summarize data
- **Joining**: Combine multiple datasets

### Why Transformation Matters
- Garbage in, garbage out - clean data is critical
- Consistent formats enable reliable analysis
- Proper types prevent errors downstream
- Derived metrics add business value

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import re

print("[OK] Libraries loaded")

---

## 2. Handling Missing Values

Missing data is one of the most common data quality issues.

In [None]:
# Create sample data with missing values
data = {
    "customer_id": [1, 2, 3, 4, 5, 6],
    "name": ["Alice", "Bob", None, "David", "Eve", "Frank"],
    "email": ["alice@ex.com", None, "carol@ex.com", "david@ex.com", None, "frank@ex.com"],
    "age": [25, 30, np.nan, 40, 35, 28],
    "revenue": [1000.0, 1500.0, 2000.0, np.nan, 3000.0, 1200.0],
    "country": ["USA", "UK", "USA", "Canada", None, "USA"],
}

df = pd.DataFrame(data)
print("Original Data:")
print(df)
print("\nMissing Values Count:")
print(df.isnull().sum())

In [None]:
# Strategy 1: Drop rows with any missing values
df_dropped_rows = df.dropna()
print(f"After dropping rows with NaN: {len(df_dropped_rows)} rows remain (from {len(df)})")
df_dropped_rows

In [None]:
# Strategy 2: Drop columns with missing values
df_dropped_cols = df.dropna(axis=1)
print(f"After dropping columns with NaN: {len(df_dropped_cols.columns)} columns remain")
df_dropped_cols

In [None]:
# Strategy 3: Fill missing values (most common in production)
df_filled = df.copy()

# Fill numeric with mean/median
df_filled["age"] = df_filled["age"].fillna(df_filled["age"].median())
df_filled["revenue"] = df_filled["revenue"].fillna(df_filled["revenue"].mean())

# Fill categorical with mode or specific value
df_filled["name"] = df_filled["name"].fillna("Unknown")
df_filled["email"] = df_filled["email"].fillna("no-email@example.com")
df_filled["country"] = df_filled["country"].fillna("Unknown")

print("After filling missing values:")
print(df_filled)
print("\nRemaining NaN count:", df_filled.isnull().sum().sum())

---

## 3. Data Type Conversions

In [None]:
# Sample data with type issues
messy_data = {
    "id": ["1", "2", "3", "4"],
    "price": ["$100.50", "$200.00", "$150.75", "$300.00"],
    "quantity": ["10", "20", "15", "25"],
    "date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"],
    "active": ["yes", "no", "yes", "yes"],
}

df_messy = pd.DataFrame(messy_data)
print("Original Types:")
print(df_messy.dtypes)
print("\nData:")
df_messy

In [None]:
# Transform data types
df_cleaned = df_messy.copy()

# Convert ID to integer
df_cleaned["id"] = df_cleaned["id"].astype(int)

# Remove $ and convert to float
df_cleaned["price"] = df_cleaned["price"].str.replace("$", "").astype(float)

# Convert quantity to integer
df_cleaned["quantity"] = df_cleaned["quantity"].astype(int)

# Convert to datetime
df_cleaned["date"] = pd.to_datetime(df_cleaned["date"])

# Convert yes/no to boolean
df_cleaned["active"] = df_cleaned["active"].map({"yes": True, "no": False})

print("Cleaned Types:")
print(df_cleaned.dtypes)
print("\nCleaned Data:")
df_cleaned

---

## 4. String Manipulation

In [None]:
# Sample data with messy strings
text_data = {
    "name": ["  alice smith  ", "BOB JONES", "carol DAVIS", "david-wilson"],
    "email": ["ALICE@EXAMPLE.COM", "bob@Example.com", "Carol@example.COM", "david@EXAMPLE.com"],
    "phone": ["(555) 123-4567", "555-234-5678", "5552345678", "+1-555-345-6789"],
}

df_text = pd.DataFrame(text_data)
print("Original Text Data:")
df_text

In [None]:
# Clean and standardize strings
df_text_clean = df_text.copy()

# Strip whitespace and title case names
df_text_clean["name"] = df_text_clean["name"].str.strip().str.title().str.replace("-", " ")

# Lowercase emails
df_text_clean["email"] = df_text_clean["email"].str.lower()

# Standardize phone numbers (remove all non-numeric except +)
df_text_clean["phone"] = df_text_clean["phone"].str.replace(r"[^0-9+]", "", regex=True)

print("Cleaned Text Data:")
df_text_clean

In [None]:
# Advanced string operations with regex
sample_text = pd.Series(
    ["Order #12345 total: $500.00", "Order #67890 total: $1,234.56", "Order #11111 total: $99.99"]
)

# Extract order numbers
order_numbers = sample_text.str.extract(r"#(\d+)")
print("Extracted Order Numbers:")
print(order_numbers)

# Extract amounts
amounts = sample_text.str.extract(r"\$([\d,]+\.\d{2})")
amounts = amounts[0].str.replace(",", "").astype(float)
print("\nExtracted Amounts:")
print(amounts)

---

## 5. Date and Time Operations

In [None]:
# Create sample data with dates
date_data = {
    "transaction_id": range(1, 6),
    "date": ["2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05", "2024-05-25"],
    "timestamp": [
        "2024-01-15 10:30:00",
        "2024-02-20 14:45:00",
        "2024-03-10 09:15:00",
        "2024-04-05 16:20:00",
        "2024-05-25 11:00:00",
    ],
}

df_dates = pd.DataFrame(date_data)
df_dates["date"] = pd.to_datetime(df_dates["date"])
df_dates["timestamp"] = pd.to_datetime(df_dates["timestamp"])

print("Original Date Data:")
df_dates

In [None]:
# Extract date components
df_dates["year"] = df_dates["date"].dt.year
df_dates["month"] = df_dates["date"].dt.month
df_dates["month_name"] = df_dates["date"].dt.month_name()
df_dates["day"] = df_dates["date"].dt.day
df_dates["day_of_week"] = df_dates["date"].dt.day_name()
df_dates["quarter"] = df_dates["date"].dt.quarter

# Extract time components
df_dates["hour"] = df_dates["timestamp"].dt.hour
df_dates["minute"] = df_dates["timestamp"].dt.minute

# Calculate days since first transaction
df_dates["days_since_first"] = (df_dates["date"] - df_dates["date"].min()).dt.days

print("Date Data with Extracted Components:")
df_dates

---

## 6. Merging and Joining Datasets

In [None]:
# Create sample datasets to merge
customers = pd.DataFrame(
    {
        "customer_id": [1, 2, 3, 4],
        "name": ["Alice", "Bob", "Carol", "David"],
        "country": ["USA", "UK", "Canada", "Australia"],
    }
)

orders = pd.DataFrame(
    {
        "order_id": [101, 102, 103, 104, 105],
        "customer_id": [1, 2, 1, 3, 5],  # Note: customer 5 doesn't exist
        "amount": [100, 200, 150, 300, 250],
        "date": pd.date_range("2024-01-01", periods=5),
    }
)

print("Customers:")
print(customers)
print("\nOrders:")
print(orders)

In [None]:
# Inner join (only matching records)
inner_merged = pd.merge(orders, customers, on="customer_id", how="inner")
print("Inner Join (only matching customers):")
print(inner_merged)
print(f"\nResult: {len(inner_merged)} rows")

In [None]:
# Left join (all orders, matched customers)
left_merged = pd.merge(orders, customers, on="customer_id", how="left")
print("Left Join (all orders):")
print(left_merged)
print(f"\nResult: {len(left_merged)} rows (NaN for unmatched customer)")

In [None]:
# Outer join (all records from both)
outer_merged = pd.merge(orders, customers, on="customer_id", how="outer")
print("Outer Join (all orders and customers):")
print(outer_merged)
print(f"\nResult: {len(outer_merged)} rows")

---

## 7. Aggregation and Grouping

In [None]:
# Create sales data
sales = pd.DataFrame(
    {
        "date": pd.date_range("2024-01-01", periods=20),
        "product": np.random.choice(["A", "B", "C"], 20),
        "region": np.random.choice(["North", "South", "East", "West"], 20),
        "quantity": np.random.randint(1, 100, 20),
        "revenue": np.random.uniform(100, 1000, 20).round(2),
    }
)

print("Sales Data (first 10 rows):")
sales.head(10)

In [None]:
# Group by product and aggregate
product_summary = (
    sales.groupby("product")
    .agg({"quantity": ["sum", "mean", "count"], "revenue": ["sum", "mean", "max"]})
    .round(2)
)

print("Product Summary:")
product_summary

In [None]:
# Group by multiple columns
region_product_summary = (
    sales.groupby(["region", "product"])
    .agg({"revenue": "sum", "quantity": "sum"})
    .round(2)
    .sort_values("revenue", ascending=False)
)

print("Region & Product Summary:")
region_product_summary

In [None]:
# Pivot table for cross-tabulation
pivot = sales.pivot_table(
    values="revenue", index="product", columns="region", aggfunc="sum", fill_value=0
).round(2)

print("Revenue Pivot Table (Product x Region):")
pivot

---

## 8. Data Normalization

In [None]:
# Create sample data for normalization
scores = pd.DataFrame(
    {
        "student": ["Alice", "Bob", "Carol", "David", "Eve"],
        "math_score": [95, 80, 70, 85, 90],
        "english_score": [88, 92, 78, 85, 95],
    }
)

print("Original Scores:")
print(scores)

In [None]:
# Min-Max Normalization (scale to 0-1)
def min_max_normalize(series):
    return (series - series.min()) / (series.max() - series.min())


scores["math_normalized"] = min_max_normalize(scores["math_score"])
scores["english_normalized"] = min_max_normalize(scores["english_score"])

print("Min-Max Normalized Scores:")
print(scores)

In [None]:
# Z-score Normalization (standardization)
def z_score_normalize(series):
    return (series - series.mean()) / series.std()


scores["math_zscore"] = z_score_normalize(scores["math_score"])
scores["english_zscore"] = z_score_normalize(scores["english_score"])

print("Z-Score Normalized Scores:")
print(scores[["student", "math_zscore", "english_zscore"]].round(2))

---

## 9. Complete Transformation Pipeline Example

In [None]:
# Create messy realistic dataset
messy_sales = pd.DataFrame(
    {
        "order_id": ["ORD-001", "ORD-002", "ORD-003", "ORD-004", "ORD-005"],
        "customer_name": ["  alice SMITH  ", "bob jones", None, "CAROL davis", "david-wilson"],
        "order_date": ["2024-01-15", "2024/02/20", "2024-03-10", "2024-04-05", "2024-05-25"],
        "total": ["$1,234.56", "$567.89", "$890.12", None, "$2,345.67"],
        "status": ["delivered", "PENDING", "delivered", "cancelled", "delivered"],
        "country": ["USA", "uk", "USA", "canada", None],
    }
)

print("Messy Sales Data:")
print(messy_sales)
print("\nData Types:")
print(messy_sales.dtypes)

In [None]:
# Complete transformation pipeline
def transform_sales_data(df):
    """
    Complete transformation pipeline for sales data
    """
    df_clean = df.copy()

    # 1. Handle missing values
    df_clean["customer_name"] = df_clean["customer_name"].fillna("Unknown Customer")
    df_clean["country"] = df_clean["country"].fillna("Unknown")
    df_clean["total"] = df_clean["total"].fillna("$0.00")

    # 2. Clean and standardize strings
    df_clean["customer_name"] = (
        df_clean["customer_name"].str.strip().str.title().str.replace("-", " ")
    )

    df_clean["status"] = df_clean["status"].str.lower().str.strip()
    df_clean["country"] = df_clean["country"].str.upper().str.strip()

    # 3. Convert data types
    # Clean currency and convert to float
    df_clean["total"] = df_clean["total"].str.replace("$", "").str.replace(",", "").astype(float)

    # Standardize date format and convert to datetime
    df_clean["order_date"] = pd.to_datetime(df_clean["order_date"].str.replace("/", "-"))

    # 4. Add derived columns
    df_clean["year"] = df_clean["order_date"].dt.year
    df_clean["month"] = df_clean["order_date"].dt.month
    df_clean["quarter"] = df_clean["order_date"].dt.quarter
    df_clean["is_delivered"] = df_clean["status"] == "delivered"

    # 5. Sort by date
    df_clean = df_clean.sort_values("order_date").reset_index(drop=True)

    print("[OK] Transformation complete!")
    print(f"   Records processed: {len(df_clean)}")
    print(f"   Missing values remaining: {df_clean.isnull().sum().sum()}")

    return df_clean


# Transform the data
clean_sales = transform_sales_data(messy_sales)
print("\nCleaned Sales Data:")
print(clean_sales)
print("\nNew Data Types:")
print(clean_sales.dtypes)

---

## 10. Key Takeaways

[OK] **Missing Values**: Fill, drop, or interpolate based on context

[OK] **Type Conversion**: Always ensure correct data types

[OK] **String Cleaning**: Standardize formats (case, whitespace, special chars)

[OK] **Date Operations**: Extract components for analysis

[OK] **Merging**: Understand different join types

[OK] **Aggregation**: Group and summarize for insights

[OK] **Normalization**: Scale data when needed

### Next Steps

In **Module 04: Data Loading and Storage**, we'll:
- Load transformed data to various destinations
- Work with different file formats
- Understand batch vs incremental loading
- Optimize for performance

---

**Ready to load data?** Open `04_data_loading_storage.ipynb`!