# Applied Machine Learning: Pipelines in scikit-learn

## 1. Why Use Pipelines?

In real-world machine learning, we rarely train a model on raw data directly.  
We often need to:
- Handle missing values
- Transform numeric variables (scaling, log transform, etc.)
- Encode categorical variables
- Balance imbalanced classes
- Train a model

If we do these steps **separately**, we run into two main problems:

### 1.1 Data Leakage

**Data leakage** happens when information from outside the training dataset is accidentally used to create the model.  
This can lead to overly optimistic results during training but poor performance on new data.

Example:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import pandas as pd

# Example data
df = pd.DataFrame({
    "feature": [1, 2, 3, 4, 5, 6],
    "target": [0, 0, 0, 1, 1, 1]
})

X = df[["feature"]]
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# BAD: Fitting the scaler on ALL data
scaler = StandardScaler()
scaler.fit(X)   # <-- Leakage: test data influences scaling
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
````

Here, the scaling parameters (mean and std) are computed using **both training and test data**, which "leaks" information from the test set into the training process.

Pipelines solve this by **fitting transformations only on the training set**, and then applying the same transformation to the test set.

---

### 1.2 Maintainability and Reuse

If you have a single block of code to preprocess, train, and predict, it's easy to:

* Forget steps when making predictions on new data
* Apply steps in the wrong order
* Duplicate code in multiple places

Pipelines make the sequence of steps explicit and reusable.

---

## 2. The Basics of Pipelines

A pipeline is a sequence of steps, each of which is either:

* A **transformer** (preprocessing, imputation, scaling, encoding, etc.)
* A **final estimator** (classifier, regressor, etc.)

Example: scaling → logistic regression

```python
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

**Key points:**

* The steps are applied in order.
* All steps except the last must be transformers (with `.fit` and `.transform` methods).
* The last step is the model.

---

Great — here’s the **updated markdown** with a simple diagram added after Section 3.
I’ve kept it text-based using Mermaid so that in Jupyter they can either leave it as-is or render it with `jupyterlab-mermaid` or just treat it as a conceptual sketch.

---

## 3. Why Pipelines Prevent Leakage

When you call `.fit(X_train, y_train)`:
1. Each transformer step is **fit only on `X_train`**.
2. The transformed data is passed to the next step.
3. The final estimator is fit on the fully transformed training data.

When you call `.predict(X_test)`:
1. Transformers use the parameters learned from the training set.
2. The transformed test data is passed to the final estimator.

---

### Visualizing the Flow
_Note: You'll need to look at this on Github for it to render properly_

```mermaid
flowchart TB
    A[Raw Data] --> B[Step 1: Transformer 1 <br/> #0040; e.g., Imputer #0041;]
    B --> C[Step 2: Transformer 2 <br/> #0040; e.g., Scaler or Encoder #0041;]
    C --> D[Final Estimator <br/> #0040; e.g., Random Forest #0041;]
    D --> E[Predictions]
```

**How to read this:**

* Each transformation step takes the output from the previous step.
* Transformers are *fit* only on the training set.
* The final estimator produces predictions based on the transformed data.
* At prediction time, the same fitted transformers are applied in the same order to new data.

---

**Why this matters for leakage:**

* If we fit a scaler or encoder *before* splitting the data, we "peek" at the test set.
* Pipelines make sure that each transformation is **fit only on training folds** during cross-validation and only on the training set during final training.

---

## 4. Preprocessing Multiple Column Types

Real datasets have:

* Numeric columns (need scaling, log transform, imputation)
* Categorical columns (need one-hot encoding, imputation)

We can process them separately and then combine them with **`ColumnTransformer`**.

---

### 4.1 Introduction to `ColumnTransformer`

`ColumnTransformer` applies different preprocessing to different columns in one step.

Example:

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ['feature']
categorical_features = ['color']

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

---

### Visualizing ColumnTransformer Flow
_Note: You'll need to look at this on Github for it to render properly_

```mermaid
flowchart TB
    A[Raw Data] --> B[Numeric Columns]
    A --> C[Categorical Columns]

    B --> B1[Numeric Imputer]
    B1 --> B2[Scaler]

    C --> C1[Categorical Imputer]
    C1 --> C2[One-Hot Encoder]

    B2 --> D[Combined Features]
    C2 --> D

    D --> E[Model]
```

**How to read this:**

* The data is split into **numeric** and **categorical** subsets.
* Each subset goes through its own sequence of transformations.
* The processed numeric and categorical features are recombined into a single dataset.
* The combined dataset is fed into the model.

---

This mental model helps explain:

* Why preprocessing can happen in parallel for different column types.
* How missing data and scaling/encoding are handled independently for each type.
* That the order **within each branch** still matters, but branches don’t affect each other.


---

## 5. Combining `ColumnTransformer` with a Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

clf.fit(X_train, y_train)
preds = clf.predict(X_test)

This structure:

* Cleans and transforms each column type
* Passes processed data to the model
* Keeps everything together in one object

---

## 6. Imputation and Pipelines

Missing values are common.
We can impute (fill in) missing data **inside** the pipeline so it’s done correctly for both training and new data.

Types of imputation:

* Mean/median for numeric
* Most frequent for categorical
* Constant value

Pipelines ensure imputation is **fit only on training data**.

---

## 7. Handling Imbalanced Data with `imblearn` Pipelines

When classes are imbalanced (e.g., 95% class A, 5% class B), we may want to use **oversampling** or **undersampling**.

The `imblearn.pipeline.Pipeline` works like sklearn’s Pipeline but supports samplers like SMOTE.

Example:

In [None]:
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

imb_pipe = Pipeline([
    ('preprocess', preprocessor),
    ('smote', SMOTE()),
    ('model', RandomForestClassifier())
])

imb_pipe.fit(X_train, y_train)

**Important:** Sampling must be inside the pipeline to avoid leakage.

---

## 8. Grid Search with Pipelines

We can tune hyperparameters without worrying about leakage:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [None, 5]
}

grid = GridSearchCV(clf, param_grid, cv=5)
grid.fit(X_train, y_train)

The CV process will:

* Fit the preprocessors on the training folds only
* Apply them to validation folds without refitting

---

## 9. Summary

* **Pipelines** combine preprocessing and modeling steps into one object.
* They **prevent data leakage** by fitting transformations only on training data.
* **ColumnTransformer** lets you preprocess numeric and categorical features differently.
* **imblearn Pipelines** allow resampling inside the pipeline.
* Pipelines make your code **cleaner, safer, and reusable**.

---

### Practice Ideas

1. Build a pipeline for a dataset with both numeric and categorical columns.
2. Add missing values and handle them with imputation in the pipeline.
3. Try an imbalanced dataset and use SMOTE in the pipeline.
4. Tune hyperparameters using GridSearchCV on the pipeline.

Perfect — I’ll extend the markdown with a **small, realistic CSV-based example** that walks students from messy raw data all the way to predictions using a pipeline.

I’ll keep it simple but with enough “realism” to hit:

* Missing values in numeric & categorical features
* Numeric scaling
* Categorical one-hot encoding
* Model training
* GridSearchCV for tuning

---

## 10. End-to-End Example: From Messy CSV to Predictions

Let's put it all together with a small, realistic dataset.

### Step 1: Load the Data

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Example messy dataset
data = pd.DataFrame({
    "age": [25, 30, None, 45, 22, None, 37, 29],
    "income": [50000, 60000, 55000, None, 42000, 52000, None, 58000],
    "job_type": ["office", "manual", "manual", None, "office", "office", "manual", "manual"],
    "owns_car": ["yes", "no", "yes", "yes", "no", "no", None, "yes"],
    "bought_insurance": [0, 1, 0, 1, 0, 0, 1, 1]
})

X = data.drop("bought_insurance", axis=1)
y = data["bought_insurance"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

We have:

* Missing values in both numeric and categorical features
* A mix of numeric and categorical columns

---

### Step 2: Define Preprocessing

We’ll use:

* **Mean imputation + scaling** for numeric features
* **Most frequent imputation + one-hot encoding** for categorical features

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

numeric_features = ["age", "income"]
categorical_features = ["job_type", "owns_car"]

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

---

### Step 3: Create the Full Pipeline

We'll use a **Random Forest** as the model.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

---

### Step 4: Train and Evaluate

In [None]:
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

At this point:

* Missing values were handled **inside** the pipeline
* Scaling and encoding were done correctly without leakage
* The model received clean, numeric input automatically

---

### Step 5: Hyperparameter Tuning with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'model__n_estimators': [50, 100],
    'model__max_depth': [None, 5, 10]
}

grid = GridSearchCV(clf, param_grid, cv=3)
grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("Best score:", grid.best_score_)

**Note:** The parameters are referenced using the syntax `stepname__parameter`.

---

### Step 6: Predict on New Data

In [None]:
new_customers = pd.DataFrame({
    "age": [40, None],
    "income": [62000, 48000],
    "job_type": ["manual", None],
    "owns_car": ["no", "yes"]
})

preds = grid.predict(new_customers)
print("Predictions for new customers:", preds)

Even though our new data has missing values, the pipeline handles it automatically — no extra preprocessing code needed.

---

## Key Takeaways from the Example

* All preprocessing steps are **fit on training data only** — no leakage.
* The pipeline can be applied directly to new, messy data.
* Different feature types can have **different preprocessing** in the same pipeline.
* Hyperparameter tuning with GridSearchCV works naturally with pipelines.