# 21. Pipelines

**Purpose:** Learn and revise **Pipelines** in Scikit-learn.

---

## Why Pipelines?

Pipelines bundle **preprocessing + model** so training and inference follow the same steps without leakage.

## Concepts to Remember

| Concept | Description |
|--------|-------------|
| **Pipeline** | Chains transforms and an estimator. |
| **ColumnTransformer** | Apply different transforms per column type. |
| **OneHotEncoder** | Encode categorical features safely. |
| **GridSearchCV** | Tune pipeline steps with cross-validation. |


In [1]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


In [None]:
rng = np.random.default_rng(42)
n = 400
df = pd.DataFrame(
    {
        "age": rng.integers(18, 70, size=n),
        "income": rng.normal(50000, 15000, size=n).round(0),
        "segment": rng.choice(["A", "B", "C"], size=n, p=[0.4, 0.4, 0.2]),
    }
)

y = ((df["age"] > 40).astype(int) + (df["segment"] == "C").astype(int) > 0).astype(int)

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, stratify=y, random_state=42)


In [None]:
numeric_features = ["age", "income"]
categorical_features = ["segment"]

preprocess = ColumnTransformer(
    [
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
    ]
)

model = Pipeline(
    [
        ("prep", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f"Pipeline accuracy: {acc:.3f}")


Pipeline accuracy: 0.963


## Key Takeaways

- **Pipeline** keeps preprocessing and modeling consistent across train/test/inference.
- **ColumnTransformer** lets you mix numeric scaling with categorical encoding cleanly.
- Pipelines integrate directly with **GridSearchCV** for end-to-end tuning.
- Use **handle_unknown='ignore'** for robust categorical handling.


## Regression Pipeline Example

Mix numeric + categorical features in a **regression** pipeline.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error

rng = np.random.default_rng(7)
n = 500
df_reg = pd.DataFrame(
    {
        "size_sqft": rng.normal(1500, 350, size=n).round(0),
        "bedrooms": rng.integers(1, 5, size=n),
        "neighborhood": rng.choice(["North", "East", "South", "West"], size=n),
    }
)

price = (
    120 * df_reg["size_sqft"] + 5000 * df_reg["bedrooms"] + df_reg["neighborhood"].map({"North": 20000, "East": 10000, "South": -5000, "West": 5000}) + rng.normal(0, 15000, size=n)
)

X_train, X_test, y_train, y_test = train_test_split(df_reg, price, test_size=0.2, random_state=42)

num_features = ["size_sqft", "bedrooms"]
cat_features = ["neighborhood"]

preprocess_reg = ColumnTransformer(
    [
        ("num", StandardScaler(), num_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_features),
    ]
)

reg_model = Pipeline(
    [
        ("prep", preprocess_reg),
        ("reg", Ridge(alpha=1.0)),
    ]
)

reg_model.fit(X_train, y_train)
preds = reg_model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
print(f"Regression MAE: {mae:,.0f}")


Regression MAE: 13,643
