# Chapter 15: Data Cleaning, Transformation, and Preprocessing

---

## Learning Objectives

By the end of this chapter, you will be able to:

- Understand **why data cleaning is essential** for accurate analysis
- **Detect and handle missing values** using different strategies (drop, impute)
- **Identify and treat outliers** using statistical methods (IQR, Z-score)
- Perform **data consistency checks** (duplicates, standardization, range validation)
- Apply **normalization and scaling** techniques to numeric features
- **Encode categorical variables** for machine learning models
- **Create new features** through transformation and feature engineering
- Build **reusable preprocessing pipelines** with scikit-learn

---

## Introduction

**Data cleaning** improves correctness and consistency (e.g., missing values, duplicates, wrong types).

**Transformation & preprocessing** reshape your data so it works well with analysis or machine learning (e.g., scaling numbers, encoding categories).

A simple mental model:
1. **Understand the data** (columns, types, ranges, meaning)
2. **Fix quality issues** (missing, invalid, inconsistent)
3. **Prepare features** (scaling, encoding, transformations)
4. **Validate** (re-check summaries and sanity checks)

We’ll do all of this step-by-step.

## Setup

We’ll use these libraries:
- `pandas`, `numpy` for data work
- `matplotlib`, `seaborn` for quick visuals
- `scikit-learn` for scaling and encoding examples

If you don’t have them installed, run (in a terminal):

```bash
pip install pandas numpy matplotlib seaborn scikit-learn
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
np.random.seed(42)

## Loading a practice dataset with real-world issues

We'll use the **penguins** dataset from seaborn — a real dataset about penguin measurements that naturally contains:
- Missing values (real measurement gaps)
- Categorical variables that need encoding
- Numeric features that need scaling

We'll also add some typical data quality issues for practice:
- Duplicate rows
- Inconsistent categories (different spellings/case)
- Some outliers

This lets us practice cleaning techniques on real data.

In [None]:
# Load the penguins dataset - it has natural missing values!
penguins = sns.load_dataset("penguins")

# Create our practice dataframe with some additional issues
df = penguins.copy()

# Rename columns to match a business context
df = df.rename(columns={
    "species": "segment",
    "island": "city", 
    "bill_length_mm": "income",
    "bill_depth_mm": "age",
    "flipper_length_mm": "spent_last_30d",
    "body_mass_g": "customer_id"
})

# Add signup_date
df["signup_date"] = pd.to_datetime("2024-01-01") + pd.to_timedelta(
    np.random.randint(0, 365, size=len(df)), unit="D"
)

# Introduce additional issues for practice:
# 1) Inconsistent city names (already has some variation, add more)
inconsistent_idx = np.random.choice(df.index, size=15, replace=False)
df.loc[inconsistent_idx[:5], "city"] = "torgersen"  # lowercase
df.loc[inconsistent_idx[5:10], "city"] = "BISCOE"   # uppercase  
df.loc[inconsistent_idx[10:], "city"] = "Dream "    # extra space

# 2) Introduce a few more outliers
outlier_idx = np.random.choice(df.index[df["income"].notna()], size=3, replace=False)
df.loc[outlier_idx, "income"] = df["income"].max(skipna=True) * 3

outlier_idx_spend = np.random.choice(df.index[df["spent_last_30d"].notna()], size=2, replace=False)
df.loc[outlier_idx_spend, "spent_last_30d"] = df["spent_last_30d"].max() * 2

# 3) Introduce duplicates
df = pd.concat([df, df.sample(5, random_state=7)], ignore_index=True)

# 4) Store some ages as strings (wrong type)
bad_type_idx = np.random.choice(df.index[df["age"].notna()], size=6, replace=False)
df.loc[bad_type_idx, "age"] = df.loc[bad_type_idx, "age"].astype(str)

print(f"Dataset shape: {df.shape}")
print(f"Natural missing values from penguins dataset + added issues")
df.head()

## First look: inspect & summarize

Before you clean anything, always do a quick inspection:
- **Shape**: how many rows/columns?
- **Types**: numbers vs text vs dates
- **Summary stats**: min/max/mean to spot impossible values
- **Missing values**: which columns are affected?

This step prevents “blind cleaning” that can accidentally remove useful information.

In [None]:
print("Shape:", df.shape)
display(df.dtypes)
display(df.describe(include="all").transpose())

missing_counts = df.isna().sum().sort_values(ascending=False)
missing_counts

### Quick visual: missing values

Tables are useful, but visuals make patterns obvious. A missingness heatmap can show:
- If missing values are random
- Or if they happen in blocks (e.g., a system failure on certain days)

For small datasets, this is a quick and helpful check.

In [None]:
plt.figure(figsize=(10, 4))
sns.heatmap(df.isna(), cbar=False)
plt.title("Missing values (True = missing)")
plt.xlabel("Columns")
plt.ylabel("Rows")
plt.show()

## 15.1 Importance of data cleaning

Data cleaning matters because:
- **Wrong inputs → wrong outputs** (often called *garbage in, garbage out*)
- Missing values can bias averages and models
- Outliers can distort statistics and charts
- Inconsistent categories create duplicate groups (e.g., “NYC” vs “New York”)
- Duplicates can over-count customers or transactions

A good goal is not “perfect” data — it’s **data that is accurate enough for the decision you want to make**, with clear assumptions.

**Tip (beginner-friendly rule):** Always document what you changed and why.

## 15.2 Missing data handling strategies

Missing data is common (forms left blank, sensor failures, system bugs). Before choosing a strategy, ask:
- **How much is missing?** 1% vs 40% changes the decision.
- **Why is it missing?** (random vs systematic)
- **Is the column important?**

### Common strategies
1. **Drop rows** (only if few rows are missing, and missingness seems random)
2. **Drop columns** (if the column is mostly missing and not critical)
3. **Impute** (fill missing values):
   - Numeric: mean/median
   - Categorical: most frequent
   - Time series: forward-fill/back-fill

**Warning:** Imputation can hide real problems. Don’t fill blindly—understand the story behind the missing values.

In [None]:
# Convert 'age' to numeric safely (coerce invalid values to NaN)
df_clean = df.copy()
df_clean["age"] = pd.to_numeric(df_clean["age"], errors="coerce")

df_clean[["age", "income"]].isna().sum()

### Strategy A: drop rows with missing values (simple, but can remove data)

We’ll drop rows where **age or income** is missing. This is okay *only* if:
- The number of missing rows is small
- You believe missingness is not systematic

Let’s measure how many rows we’d lose.

In [None]:
rows_before = len(df_clean)
df_dropna = df_clean.dropna(subset=["age", "income"])
rows_after = len(df_dropna)

print(f"Rows before: {rows_before}")
print(f"Rows after dropna: {rows_after}")
print(f"Rows removed: {rows_before - rows_after} ({(rows_before - rows_after)/rows_before:.1%})")

### Strategy B: impute (fill) missing numeric values

A common approach is to fill numeric missing values with the **median**.
Why median?
- It’s less sensitive to outliers than the mean.

We’ll fill missing `age` and `income` with their medians.

In [None]:
df_imputed = df_clean.copy()

for col in ["age", "income"]:
    median_value = df_imputed[col].median()
    df_imputed[col] = df_imputed[col].fillna(median_value)

df_imputed[["age", "income"]].isna().sum()

### Exercise 1 — Missing data
1. Compute the % missing for each column.
2. Try mean-imputation for `income` and compare mean vs median result.
3. (Thinking) When would dropping rows be a bad idea?

Write code in the next cell.

In [None]:
# 1) % missing per column
missing_pct = df_clean.isna().mean().sort_values(ascending=False) * 100
display(missing_pct.to_frame(name="missing_%"))

# 2) Compare mean vs median imputation for income
income_mean = df_clean["income"].mean()
income_median = df_clean["income"].median()

df_income_mean = df_clean.copy()
df_income_median = df_clean.copy()

df_income_mean["income"] = df_income_mean["income"].fillna(income_mean)
df_income_median["income"] = df_income_median["income"].fillna(income_median)

print("Income mean (after filling):", df_income_mean["income"].mean())
print("Income median (after filling):", df_income_median["income"].median())

## 15.3 Outlier detection and treatment

An **outlier** is a value that is unusually far from most other values.

Outliers happen for many reasons:
- Real rare events (a very high-spending customer)
- Data entry errors (extra zero: 5000 instead of 500)
- Measurement problems

### Why outliers matter
- They can distort averages (mean) and correlation
- They can stretch chart axes so normal values look “flat”
- Some models are sensitive to them

### Common detection methods
- **Box plot / IQR rule** (good general method)
- **Z-score** (works best if data is roughly normal)
- **Domain rules** (e.g., age cannot be negative)

We’ll use the IQR method for `income` and `spent_last_30d`.

In [None]:
def iqr_bounds(series, k=1.5):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return lower, upper

for col in ["income", "spent_last_30d"]:
    s = df_imputed[col]
    lower, upper = iqr_bounds(s)
    outliers = (s < lower) | (s > upper)
    print(f"{col}: {outliers.sum()} outliers (IQR rule). Bounds: [{lower:,.2f}, {upper:,.2f}]")

### Visualize outliers with box plots

A box plot quickly shows median, quartiles, and points outside the typical range.
We’ll plot `income` and `spent_last_30d`.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.boxplot(y=df_imputed["income"], ax=axes[0])
axes[0].set_title("Income (box plot)")
axes[0].set_ylabel("income")

sns.boxplot(y=df_imputed["spent_last_30d"], ax=axes[1])
axes[1].set_title("Spent last 30d (box plot)")
axes[1].set_ylabel("spent_last_30d")

plt.tight_layout()
plt.show()

### Treatment options

What should you do with outliers? It depends. Common options:
1. **Fix obvious errors** (best if you can verify)
2. **Remove outliers** (risky if they are real important cases)
3. **Cap/Winsorize** (limit extreme values to a threshold)
4. **Transform** (e.g., log transform income)

We’ll demonstrate **capping** (winsorization-like) using IQR bounds.

**Common mistake:** Automatically deleting outliers can remove exactly the customers you care about (e.g., high spenders).

In [None]:
df_outlier_capped = df_imputed.copy()

for col in ["income", "spent_last_30d"]:
    lower, upper = iqr_bounds(df_outlier_capped[col])
    df_outlier_capped[col] = df_outlier_capped[col].clip(lower=lower, upper=upper)

df_outlier_capped[["income", "spent_last_30d"]].describe().transpose()

### Exercise 2 — Outliers
1. Use the IQR rule to flag outliers in `age`.
2. Try a log transform on `income` and re-plot a histogram (before vs after).

Write code in the next cell.

In [None]:
# 1) Outliers in age
lower_age, upper_age = iqr_bounds(df_outlier_capped["age"])
age_outliers = (df_outlier_capped["age"] < lower_age) | (df_outlier_capped["age"] > upper_age)
print("Age outliers (IQR):", age_outliers.sum())

# 2) Log transform income and compare histograms
income_raw = df_outlier_capped["income"]
income_log = np.log1p(income_raw)  # log(1 + x) avoids issues with zeros

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(income_raw, bins=30, ax=axes[0])
axes[0].set_title("Income (raw)")

sns.histplot(income_log, bins=30, ax=axes[1])
axes[1].set_title("Income (log1p)")

plt.tight_layout()
plt.show()

## 15.4 Data consistency checks

Data consistency means the data follows expected rules. Examples:
- No duplicate records for the same event
- Categories are standardized (same spelling/case)
- Numeric ranges make sense (age ≥ 0)
- Dates are valid and in the expected range

### Common checks in practice
- **Duplicate rows**: `duplicated()`
- **Invalid ranges**: filter with logical conditions
- **Inconsistent categories**: normalize text (strip, lower) and map to standard values

We’ll fix duplicates and standardize the `city` column.

In [None]:
# Start from our 
# - numeric fixed (age coerced),
# - missing handled (imputed),
# - outliers treated (capped)
df_consistent = df_outlier_capped.copy()

# 1) Duplicates
dup_count = df_consistent.duplicated().sum()
print("Duplicate rows:", dup_count)

df_consistent = df_consistent.drop_duplicates()
print("Rows after dropping duplicates:", len(df_consistent))

# 2) Standardize city names
def standardize_city(city):
    if pd.isna(city):
        return city
    city_clean = str(city).strip().lower()
    # Mapping for the penguins dataset island names (used as cities)
    mapping = {
        "torgersen": "Torgersen",
        "biscoe": "Biscoe",
        "dream": "Dream",
    }
    return mapping.get(city_clean, city.title())

df_consistent["city"] = df_consistent["city"].apply(standardize_city)

df_consistent["city"].value_counts()

### Range and logic checks (examples)

Let’s add a couple of simple rules:
- `age` should be between 0 and 110
- `spent_last_30d` should be ≥ 0

In real datasets, these rules come from domain knowledge.

In [None]:
invalid_age = (df_consistent["age"] < 0) | (df_consistent["age"] > 110)
invalid_spend = df_consistent["spent_last_30d"] < 0

print("Invalid ages:", invalid_age.sum())
print("Invalid spend values:", invalid_spend.sum())

# If any existed, you might correct them (if possible) or remove those rows
df_consistent = df_consistent.loc[~(invalid_age | invalid_spend)].copy()
df_consistent.shape

### Exercise 3 — Consistency checks
1. Standardize `segment` to uppercase (just in case).
2. Create a check that finds customers with `income` below 0 (should be none).
3. Print a small “data quality report” with counts of: missing values, duplicates, invalid ages.

Write code in the next cell.

In [None]:
df_ex3 = df_consistent.copy()

# 1) Segment to uppercase
df_ex3["segment"] = df_ex3["segment"].astype(str).str.strip().str.upper()

# 2) Income below 0 check
income_below_zero = (df_ex3["income"] < 0).sum()
print("Income below 0 count:", income_below_zero)

# 3) Simple data quality report
quality_report = {
    "rows": len(df_ex3),
    "missing_total": int(df_ex3.isna().sum().sum()),
    "duplicates": int(df_ex3.duplicated().sum()),
    "invalid_age": int(((df_ex3["age"] < 0) | (df_ex3["age"] > 110)).sum()),
}
quality_report

## 15.5 Data normalization and scaling

Many algorithms (and even some charts) work better when numeric features are on similar scales.

Example: if `income` is in tens of thousands but `age` is ~30–50, then some models may focus too much on income simply because it has larger numbers.

### Common scaling methods
- **Standardization (Z-score)**: mean 0, standard deviation 1
- **Min-Max scaling**: scales to [0, 1]
- **Robust scaling**: uses median and IQR (good when outliers exist)

We’ll demonstrate scaling with scikit-learn.

**Tip:** Scale only the columns that need it (numeric features), and usually *after* handling missing values.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

numeric_cols = ["age", "income", "spent_last_30d"]
X_num = df_consistent[numeric_cols].copy()

scalers = {
    "standard": StandardScaler(),
    "minmax": MinMaxScaler(),
    "robust": RobustScaler(),
}

scaled_examples = {}
for name, scaler in scalers.items():
    scaled = scaler.fit_transform(X_num)
    scaled_examples[name] = pd.DataFrame(scaled, columns=numeric_cols)

# Compare summary stats
display(pd.concat({k: v.describe().loc[["mean", "std", "min", "max"]] for k, v in scaled_examples.items()}))

### Exercise 4 — Scaling
1. Apply `RobustScaler` and make a scatter plot of scaled `income` vs scaled `spent_last_30d`.
2. Why might robust scaling be a good choice when income has outliers?

Write code in the next cell.

In [None]:
robust = RobustScaler()
X_robust = pd.DataFrame(robust.fit_transform(X_num), columns=numeric_cols)

plt.figure(figsize=(6, 5))
sns.scatterplot(data=X_robust, x="income", y="spent_last_30d")
plt.title("Robust-scaled: income vs spent_last_30d")
plt.show()

## 15.6 Encoding categorical variables

Many machine learning models require numbers. But real datasets often contain **categories** like city, segment, product type.

### Common encoding methods
- **One-Hot Encoding**: creates a 0/1 column for each category (good for nominal categories like city)
- **Ordinal Encoding**: converts ordered categories to integers (only if order makes sense: low < medium < high)

**Warning:** Don’t assign numbers like A=1, B=2, C=3 unless there’s a real order. That can accidentally tell the model that C > B > A.

We’ll use one-hot encoding for `city` and `segment`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_cols = ["city", "segment"]
X_cat = df_consistent[cat_cols].copy()

ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_cat_ohe = ohe.fit_transform(X_cat)
ohe_feature_names = ohe.get_feature_names_out(cat_cols)

df_cat_ohe = pd.DataFrame(X_cat_ohe, columns=ohe_feature_names, index=df_consistent.index)
df_cat_ohe.head()

### Exercise 5 — Encoding
1. Use `pd.get_dummies()` to one-hot encode the same columns.
2. Compare the columns created by pandas vs scikit-learn.

Write code in the next cell.

In [None]:
df_dummies = pd.get_dummies(df_consistent[cat_cols], drop_first=False)
print("Pandas dummy columns (first 10):")
print(list(df_dummies.columns)[:10])

print("\nscikit-learn OHE columns (first 10):")
print(list(ohe_feature_names)[:10])

## 15.7 Feature creation and transformation

A **feature** is an input column used for analysis or modeling. Feature engineering means creating better features from existing data.

### Why create/transform features?
- Some relationships are easier to capture after transformation (e.g., log income)
- Models can improve with meaningful derived variables (e.g., days since signup)
- Dates and text often need extraction to become useful

We’ll create a few simple features:
- `days_since_signup` (from `signup_date`)
- `income_log1p` (transform)
- `spend_per_income` (ratio)
- `signup_month` (from date)

**Tip:** Start simple. Fancy features don’t help if the data is still messy.

In [None]:
df_features = df_consistent.copy()

# Reference date for calculation — in real work, use a real "today" date
as_of_date = pd.to_datetime("2025-01-01")

df_features["days_since_signup"] = (as_of_date - df_features["signup_date"]).dt.days
df_features["income_log1p"] = np.log1p(df_features["income"])
df_features["spend_per_income"] = df_features["spent_last_30d"] / df_features["income"]
df_features["signup_month"] = df_features["signup_date"].dt.month

df_features[["signup_date", "days_since_signup", "income", "income_log1p", "spent_last_30d", "spend_per_income", "signup_month"]].head()

### Visual check: transformed vs original income

A log transform often makes a skewed distribution more “balanced”, which can help both visualization and modeling.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.histplot(df_features["income"], bins=30, ax=axes[0])
axes[0].set_title("Income (capped)")

sns.histplot(df_features["income_log1p"], bins=30, ax=axes[1])
axes[1].set_title("Income log1p")

plt.tight_layout()
plt.show()

### Exercise 6 — Feature engineering
1. Create a new feature `is_high_spender` where 1 means `spent_last_30d` is above the 75th percentile, else 0.
2. Create `age_group` by binning age into groups (e.g., 0–24, 25–34, 35–44, 45–54, 55+).
3. Plot the average spend by `age_group`.

Write code in the next cell.

In [None]:
df_ex6 = df_features.copy()

# 1) High spender label
threshold = df_ex6["spent_last_30d"].quantile(0.75)
df_ex6["is_high_spender"] = (df_ex6["spent_last_30d"] > threshold).astype(int)

# 2) Age groups (binning)
bins = [0, 24, 34, 44, 54, 200]
labels = ["0-24", "25-34", "35-44", "45-54", "55+"]
df_ex6["age_group"] = pd.cut(df_ex6["age"], bins=bins, labels=labels, right=True, include_lowest=True)

# 3) Average spend by age group
avg_spend_by_age = df_ex6.groupby("age_group", observed=True)["spent_last_30d"].mean().sort_index()
display(avg_spend_by_age)

plt.figure(figsize=(7, 4))
sns.barplot(x=avg_spend_by_age.index, y=avg_spend_by_age.values)
plt.title("Average spend (last 30d) by age group")
plt.xlabel("Age group")
plt.ylabel("Average spend")
plt.show()

## Mini-project — Build a preprocessing pipeline (clean → transform → model-ready)

Goal: create a clean, model-ready dataset using a *repeatable* preprocessing pipeline.

Why pipelines are useful:
- You apply the **same steps** every time (less mistakes)
- It helps with reproducibility
- It prevents some common errors when training/testing models

We will:
1. Select numeric and categorical columns
2. Impute missing values
3. Scale numeric columns
4. One-hot encode categorical columns

This is a common “preprocessing template” used in real projects.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# We'll build from df_clean (after type fixing) to show imputation inside the pipeline
df_project = df_clean.copy()

# Standardize city early (string clean) — in real systems this could be a custom transformer
df_project["city"] = df_project["city"].apply(standardize_city)

numeric_features = ["age", "income", "spent_last_30d"]
categorical_features = ["city", "segment"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", RobustScaler()),
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop",
)

X_ready = preprocessor.fit_transform(df_project)
X_ready.shape

### Inspect the generated feature matrix

After preprocessing, we have a numeric matrix with:
- Scaled numeric columns
- One-hot encoded categorical columns

Let’s view the feature names so we can interpret what the pipeline produced.

In [None]:
# Get output feature names (scikit-learn >= 1.0 supports this for many transformers)
feature_names = preprocessor.get_feature_names_out()

df_ready = pd.DataFrame(X_ready, columns=feature_names)
df_ready.head()

### Mini-project exercise
1. Add a new feature to the raw data: `days_since_signup` (from `signup_date`).
2. Update the pipeline to include it as a numeric feature.
3. Confirm the output matrix has one extra numeric column.

Hint: you can create the column in pandas first, then include it in `numeric_features`.

In [None]:
df_project2 = df_project.copy()
as_of_date = pd.to_datetime("2025-01-01")
df_project2["days_since_signup"] = (as_of_date - df_project2["signup_date"]).dt.days

numeric_features2 = numeric_features + ["days_since_signup"]

preprocessor2 = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features2),
        ("cat", categorical_transformer, categorical_features),
    ],
    remainder="drop",
)

X_ready2 = preprocessor2.fit_transform(df_project2)
print("Old shape:", X_ready.shape)
print("New shape:", X_ready2.shape)

## Tips, warnings, and common beginner mistakes

- **Don’t “fix” data without understanding it.** Always inspect first (`info`, `describe`, missing counts).
- **Avoid deleting too much.** Dropping rows/columns is easy but can bias your results.
- **Be careful with outliers.** An outlier might be a valuable rare case, not an error.
- **Don’t treat categories as numbers** unless the category truly has an order.
- **Document your steps.** Future-you (and teammates) will thank you.
- **Re-check after cleaning.** After each major step, re-run summaries to verify the change.

## Additional resources (optional)

- pandas user guide (missing data): https://pandas.pydata.org/docs/user_guide/missing_data.html
- scikit-learn preprocessing docs: https://scikit-learn.org/stable/modules/preprocessing.html
- scikit-learn pipelines: https://scikit-learn.org/stable/modules/compose.html
- Seaborn categorical plots: https://seaborn.pydata.org/tutorial/categorical.html

## Summary / Key takeaways

- Cleaning is essential: it improves accuracy, consistency, and trust in results.
- Handle missing values thoughtfully: drop only when appropriate, otherwise impute.
- Detect outliers (IQR/box plot) and choose a treatment that matches your business context.
- Use consistency checks (duplicates, category standardization, range rules) to prevent silent errors.
- Scale numeric data when needed, especially before many ML algorithms.
- Encode categorical variables (one-hot for non-ordered categories).
- Create simple, meaningful features (dates → durations, log transforms, ratios).
- Pipelines make preprocessing repeatable and less error-prone.