# Master Machine Learning Preprocessing Template
This notebook contains all the core techniques learned in Days 1-30.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

print("Toolkit Ready!")

Toolkit Ready!


### 1. The Data Check-up
Before applying any preprocessing, you must understand your data's health.
*   **Missing Values:** Identify columns with null entries (`NaN`). If not handled, most models will crash.
*   **Duplicates:** Repeated rows can bias your model.
*   **Data Types:** Ensure numbers are `int/float` and categories are `object`.
*   **Skewness:** Check if the data distribution is Gaussian (Bell Curve) or skewed.

Use the commands below to diagnose your dataset.

In [None]:
# df = pd.read_csv('your_file.csv')

# 1. Structure & Integrity
# df.info() # Check data types & non-null counts
# print(f"Duplicate rows: {df.duplicated().sum()}") # Check duplicates

# 2. Missing Values Analysis
# print(df.isnull().mean() * 100) # Percentage of missing data

# 3. Numerical Distribution (Check for skewness/outliers)
# print(df.describe())

# 4. Categorical Distribution (Check for cardinality/rare labels)
# print(df['category_col'].value_counts())

# 5. Correlation Check
# print(df.corr()['target_col'].sort_values()) # Correlation with target

### 2. Imputation Strategies (Handling Missing Data)
Imputation is the process of filling missing values with statistical estimates.

**Strategies for Numerical Data:**
1.  **Mean:** Use when data is normally distributed (no outliers).
2.  **Median:** Use when data is skewed (robust to outliers).
3.  **Arbitrary:** Replacing with -1 or 99 (use if data is not missing at random).

**Strategies for Categorical Data:**
1.  **Most Frequent (Mode):** Replaces missing values with the most common category.
2.  **Constant:** Fills missing values with a new category like "Missing".

In [None]:
# 1. Numerical Imputation
si_mean = SimpleImputer(strategy='mean')     # Use for Normal distribution
si_median = SimpleImputer(strategy='median') # Use for Skewed data (Example: Age, Salary)
si_arb = SimpleImputer(strategy='constant', fill_value=-1) # Flag missing values explicitly

# 2. Categorical Imputation
si_mode = SimpleImputer(strategy='most_frequent') # Replace with Mode (safe default)
si_miss = SimpleImputer(strategy='constant', fill_value='Missing') # Create specific "Missing" category

# 3. Advanced: KNN Imputation (Multivariate)
# Uses other features to guess the missing value by finding 'neighbors'
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)

### 3. Encoding (Converting Words to Numbers)
Machine Learning models only understand numbers. Encoding converts categorical text into numerical format.

**Key Techniques:**
1.  **One-Hot Encoding (Nominal Data):**
    *   **What it is:** Creates a new binary column for each category (e.g., Color -> Red, Green, Blue).
    *   **When to use:** When categories have **no inherent order** (Gender, City, Brand).
    *   **Note:** Use `drop='first'` to avoid the "Dummy Variable Trap" (multicollinearity).

2.  **Ordinal Encoding (Ordinal Data):**
    *   **What it is:** Assigns an integer rank to each category (e.g., Low=0, Medium=1, High=2).
    *   **When to use:** When categories have a clear **rank or order** (Education Level, Satisfaction Rating).

In [None]:
# 1. One-Hot Encoding (Nominal: Gender, City)
# drop='first': Removes 1 dummy col to prevent multicollinearity (dummy variable trap)
# handle_unknown='ignore': Handles new categories in test data gracefully (all zeros)
ohe = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

# 2. Ordinal Encoding (Ordinal: Education, Satisfaction)
# Define order manually: [Col_1_Order, Col_2_Order]
oe = OrdinalEncoder(categories=[
    ['School', 'UG', 'PG'],                # e.g., Education
    ['Low', 'Medium', 'High']              # e.g., Salary Grade
])

# 3. Label Encoding (Target Variable ONLY)
# Encodes y (Target) into 0, 1, 2... Do NOT use for X (features)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# y_train = le.fit_transform(y_train)

### 4. Scaling (Feature Scaling)
Scaling ensures that all features contribute equally to the result by bringing them to a similar range. Without scaling, a feature like "Salary" (range 20k-100k) will dominate "Age" (range 18-90).

**Key Scalers:**
1.  **StandardScaler (Z-Score Normalization):**
    *   **How it works:** Centers data around 0 with a standard deviation of 1. Formula: $z = \frac{x - \mu}{\sigma}$
    *   **When to use:** Default choice for most algorithms (Logistic Regression, SVM, KNN). Preserves outliers.

2.  **MinMaxScaler (Normalization):**
    *   **How it works:** Squeezes data between 0 and 1. Formula: $x' = \frac{x - min}{max - min}$
    *   **When to use:** Deep Learning (CNNs, ANNs) or when you know the distribution is not Gaussian. Sensitive to outliers.

In [None]:
# 1. StandardScaler (Z-Score)
# Mean=0, Std=1. Does NOT handle outliers (they stay outliers).
# Use for: Linear Regression, Logistic Regression, KNN, SVM, PCA
scaler = StandardScaler()

# 2. MinMaxScaler
# Range [0, 1]. Compress data. Sensitive to outliers.
# Use for: Neural Networks (CNN/ANN), Algorithms using distances
minmax = MinMaxScaler()

# 3. RobustScaler
# Scaled using Median and IQR (Interquartile Range).
# Use for: Data with heavy outliers
from sklearn.preprocessing import RobustScaler
robust = RobustScaler()

### 5. The Ultimate Workflow (ColumnTransformer + Pipeline)
Instead of applying steps manually one by one, we bundle them.

*   **Pipeline:** Chains sequential steps together (e.g., Impute -> Scale -> Model).
*   **ColumnTransformer:** Applies different transformations to different columns **in parallel** (e.g., Scale numerical columns vs Encode categorical columns).

**Why use this?**
1.  **Prevents Data Leakage:** Ensures statistics (mean, variance) are calculated only on `X_train` and applied to `X_test`.
2.  **Production Ready:** You can save the entire object as a `.pkl` file and deploy it easily.

In [7]:
# Step 1: Define which columns get which treatment
num_cols = ['age', 'fare'] # example columns
cat_cols = ['embarked', 'sex'] # example columns

# Step 2: Create sub-pipelines
num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

# Step 3: Combine into ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num_section', num_pipe, num_cols),
    ('cat_section', cat_pipe, cat_cols)
])

# Now you can just use: 
# X_train_transformed = preprocessor.fit_transform(X_train)

### 6. Alternative Workflow (Sequential Pipeline)
This method stacks `ColumnTransformers` sequentially inside a `Pipeline`.

**Flow:**
Step 1 (Impute) $\rightarrow$ Output $\rightarrow$ Step 2 (Encode) $\rightarrow$ Output $\rightarrow$ Step 3 (Scale)

**Critical Warning:**
In this approach, the output of Step 1 is a NumPy array without column names. If a step (like OneHotEncoding) adds new columns, the indices of your columns will shift. You must manually calculate the new column index for the subsequent steps.

**When to use:**
Use this only if you need the output of one transformer (e.g., filled missing values) before the next transformer can work (e.g., feature extraction). Otherwise, use the Method #5 (Parallel) approach.

In [None]:
# Step 1: Impute Missing Values (ColumnTransformer 1)
trnf1 = ColumnTransformer(transformers=[
    ('impute_age', SimpleImputer(), [2]), # Input Column Index
    ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6])
], remainder='passthrough')

# Step 2: Encoding (ColumnTransformer 2)
# Note: You must know the new column indices after Step 1
trnf2 = ColumnTransformer(transformers=[
    ('ohe_sex_embarked', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), [1, 6])
], remainder='passthrough')

# Step 3: Scaling (ColumnTransformer 3)
trnf3 = ColumnTransformer(transformers=[
    ('scale', MinMaxScaler(), slice(0, 10)) # Scaling all columns
])

# Step 4: Create the Sequential Pipeline
pipe = Pipeline([
    ('step1', trnf1),
    ('step2', trnf2),
    ('step3', trnf3),
    # ('model', DecisionTreeClassifier())
])

# pipe.fit(X_train, y_train)