# Theory — Preprocessing Tools and When to Use Them

Goal
- Understand the main preprocessing operators in scikit‑learn, what they do, when to use them, and typical settings.

Core transformers (with practical guidance)
1) Imputation (handle missing values)
   - SimpleImputer(strategy="mean"/"median"/"most_frequent"/"constant")
     • Numeric: median is robust; mean is ok without heavy outliers.
     • Categorical: most_frequent; or constant like "Missing".
     • AddMissingIndicator or a custom “was_missing” flag can help models.
   - KNNImputer: leverages nearest neighbors; slower; sensitive to scaling.

2) Encoding categoricals
   - OneHotEncoder(sparse_output=False, handle_unknown="ignore")
     • Default for nominal features; avoids implying order.
     • handle_unknown="ignore" prevents failure on unseen categories at inference.
   - OrdinalEncoder: only for truly ordered categories (e.g., education levels).
   - High cardinality: Hashing trick (FeatureHasher) or target encoding (with strict CV to avoid leakage).

3) Scaling/normalization (numeric only)
   - StandardScaler: default for linear/SVM/KNN; centers and scales.
   - MinMaxScaler: [0,1] range; good when features are bounded or for neural nets.
   - RobustScaler: uses median/IQR; robust to outliers.
   - Normalizer (L2): rescales each sample vector to unit norm; mostly for text/embeddings.

4) Column selection/routing
   - ColumnTransformer([("num", num_pipeline, num_cols), ("cat", cat_pipeline, cat_cols)])
     • Apply numeric pipeline (impute→scale) to numeric columns, categorical pipeline (impute→OHE) to categorical columns.
     • Keeps column order consistent.

5) Orchestration and evaluation
   - Pipeline(steps=[("prep", preprocessor), ("model", estimator)])
     • Encapsulates all preprocessing + model in one object.
     • safe with cross_val_score/GridSearchCV (prevents leakage).
   - train_test_split, StratifiedKFold (classification), GroupKFold (grouped data).

6) Targets and labels
   - LabelEncoder: for single target labels (y) only; not for feature columns.
   - One-vs-Rest strategies and probability calibration happen after preprocessing.

Common patterns (paper‑and‑pencil)
- Numeric only: SimpleImputer(median) → StandardScaler.
- Mixed data: ColumnTransformer(num=[imputer+scaler], cat=[imputer+OHE]) → model.
- Trees/forests/boosting: often skip scaling; still impute and OHE for categoricals (or consider tree libs with native categoricals).
- Time series: split by time; fit transforms only on past data; avoid shuffling.

Pitfalls and fixes
- Leakage: always fit transforms inside the CV loop via Pipeline.
- Unseen categories: OneHotEncoder(handle_unknown="ignore").
- Mismatched columns/order: use ColumnTransformer; avoid manual concat with different trains/tests.
- Rare categories: group into "Other" before OHE to reduce sparsity.
- Skewed numerics: log/Box‑Cox/Yeo‑Johnson transforms before scaling.

Quick checklist
- Choose imputation per type and outliers (median for numeric is safe).
- OHE nominal categoricals; Ordinal encode truly ordered ones.
- Scale for distance‑based/linear margin models.
- Wrap everything in ColumnTransformer + Pipeline; validate with CV.


# Data Preprocessing Tools

## Importing the libraries

In [0]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [0]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [0]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [0]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [0]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [0]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [0]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [0]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [0]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [0]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [0]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [0]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [0]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [0]:
print(y_test)

[0 1]


## Feature Scaling

In [0]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [0]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [0]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
