# Theory — Data Preprocessing Pipeline

Why preprocessing matters
- Real-world data is messy: missing values, inconsistent categories, different scales, outliers, skewed distributions.
- Good models rely on consistent inputs. Preprocessing turns raw tables into clean numeric matrices that models can learn from.
- Key principle: avoid data leakage — compute any statistics (means, scalers, encoders) on training data only and apply to validation/test data later.

Typical pipeline (plain-English and order of operations)
1) Define the problem and the `target` (y). Is it regression (numeric y) or classification (categorical y)?
2) Split early: create train/test (and optionally validation) splits BEFORE computing any statistics.
3) Handle missing values:
   - Numeric: mean/median imputation; optionally add a “was_missing” indicator column.
   - Categorical: most frequent category (mode); optionally add explicit “Missing” category.
   - Advanced: KNNImputer, MICE/IterativeImputer for richer patterns.
4) Encode categorical features:
   - One-Hot Encoding (OHE): safe default for unordered categories (country, color).
   - Ordinal Encoding: only if categories have an inherent order (low < medium < high).
   - High cardinality: consider hashing or target encoding (with care to avoid leakage).
5) Scale/normalize numeric features:
   - StandardScaler (zero mean, unit variance): default for many models (SVM, KNN, linear/logistic regression).
   - MinMaxScaler [0,1]: preserves shape; often used for neural nets or bounded features.
   - RobustScaler: resilient to outliers by using median/IQR.
   - Trees/forests/boosting usually do NOT need scaling.
6) Optional feature engineering:
   - Date/time decomposition (year, month, dow), interactions, binning, log/Box–Cox/Yeo–Johnson transformations.
   - Text: Bag-of-Words/TF‑IDF; Images: normalization/augmentation (outside this course’s scope).
7) Fit model on the preprocessed training matrix. Use cross‑validation with the entire pipeline to choose hyperparameters.
8) Evaluate on held‑out test data. Inspect residuals (regression), confusion matrices/ROC/PR (classification).

Essential scikit‑learn tools
- SimpleImputer, OneHotEncoder, OrdinalEncoder, StandardScaler, MinMaxScaler, RobustScaler
- ColumnTransformer: apply different transforms to numeric vs categorical columns.
- Pipeline: chain transforms and estimator as a single object.
- train_test_split, cross_val_score/GridSearchCV: split and evaluate without leakage.

Tiny worked example (paper‑and‑pencil)
Dataset (Country, Age, Salary, Purchased):
- Row1: France, 44, 72000, Yes
- Row2: Spain, 27, 48000, No
- Row3: Germany, 30, NaN, Yes
- Row4: Spain, 38, 61000, No
- Row5: France, 40,  NaN, Yes

Step‑by‑step:
- Train/test split first (e.g., 80/20).
- Impute missing Salary (numeric): median of train salaries; add indicator if desired.
- Encode Country with OHE: France→[1,0,0], Germany→[0,1,0], Spain→[0,0,1].
- Scale numeric features (Age, Salary) with StandardScaler fitted on train only; transform test with the same scaler.
- Target Purchased (Yes/No) becomes y (e.g., 1/0).

Leakage watchlist
- Never fit imputer/scaler/encoder on full data before splitting.
- Do not target‑encode using the entire y; use CV folds or leave‑one‑out schemes.
- Avoid peeking at test data during feature selection.

Pros, cons, pitfalls
- Pros: Cleaner, more stable models; reproducibility; fewer surprises at deployment.
- Cons: More steps; need careful orchestration.
- Pitfalls: Fitting transforms on full data; mismatched columns between train/test; unseen categories at prediction time (set OneHotEncoder(handle_unknown="ignore")).

Quick checklist
- Identify target and task type.
- Split early (train/test).
- Impute → Encode → Scale in a ColumnTransformer + Pipeline.
- Validate via CV; keep the pipeline intact.
- Export the fitted pipeline for production use.


# Data Preprocessing Template

## Importing the libraries

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [0]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Splitting the dataset into the Training set and Test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)