
# 🧱 Sklearn Feature Engineering Pipeline + Grid Search (Ames Housing)

This notebook builds a full **feature-engineering pipeline** for the **Ames Housing** dataset using **scikit-learn**.
It includes:

- Custom outlier cleaning that **calls the provided** `clean_outliers(df_in, method="cap", k=1.5)`
- Missing-value handling for **numeric** and **categorical** columns
- Encoding and optional scaling
- **`Pipeline` + `ColumnTransformer`** integration
- **`GridSearchCV`** over outlier parameters, imputation, scaling, and model hyperparameters
- Train/validation report with RMSE & \(R^2\)

> **Note on `method="remove"`**: The original `clean_outliers` function can **drop rows** when `method="remove"`. 
> Standard scikit-learn `Pipeline` objects expect transformers to **preserve the number of samples** (so that `y` stays aligned).
> To keep everything pipeline-safe, our wrapper only uses **`cap`** and **`median`** during grid search.  
> If you request `"remove"`, we **safely map** it to `"median"` internally and print a warning.


In [None]:

# --- Imports ---
import warnings
from typing import Optional

import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.compose import make_column_selector as selector

RANDOM_STATE = 42


## Provided function: `clean_outliers`

In [None]:

# --- Provided function (as requested) ---
def clean_outliers(df_in: pd.DataFrame, method: str = "cap", k: float = 1.5):
    df_clean = df_in.copy()
    for col in df_clean.select_dtypes(include="number").columns:
        s = df_clean[col]
        if s.notna().sum() == 0:
            continue
        q1, q3 = s.quantile([0.25, 0.75])
        iqr = q3 - q1
        low, up = q1 - k * iqr, q3 + k * iqr
        if method == "cap":
            df_clean[col] = s.clip(lower=low, upper=up)
        elif method == "median":
            mask = (s < low) | (s > up)
            df_clean.loc[mask, col] = s.median()
        elif method == "remove":
            # WARNING: This drops rows -> not safe inside a sklearn Pipeline that expects X,y alignment.
            mask = (s < low) | (s > up)
            df_clean = df_clean.loc[~mask]
    return df_clean



## Pipeline-safe wrapper: `OutlierCleaner`

This wrapper **calls `clean_outliers`** but **never changes** the number of rows so the pipeline stays valid.
- Supports `method in {"cap", "median"}` directly.
- If `method == "remove"`, it **falls back to `"median"` and warns** (to keep sample count fixed).


In [None]:

class OutlierCleaner(BaseEstimator, TransformerMixin):
    def __init__(self, method: str = "cap", k: float = 1.5, numeric_only: bool = True):
        self.method = method
        self.k = k
        self.numeric_only = numeric_only

    def fit(self, X, y=None):
        # nothing to learn
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            # Make sure downstream transformers (like ColumnTransformer) still get a DataFrame
            X = pd.DataFrame(X)

        method = self.method
        if method not in {"cap", "median", "remove"}:
            raise ValueError(f"Unsupported method: {method}. Use 'cap', 'median', or 'remove'.")

        # If 'remove' is requested, map to 'median' to avoid changing n_samples
        if method == "remove":
            warnings.warn("OutlierCleaner: 'remove' would drop samples; mapping to 'median' for pipeline safety.")
            method = "median"

        # Optionally restrict to numeric columns only (recommended)
        if self.numeric_only:
            num_cols = X.select_dtypes(include="number").columns
            X_num = X[num_cols]
            X_num_clean = clean_outliers(X_num, method=method, k=self.k)
            X_clean = X.copy()
            X_clean[num_cols] = X_num_clean[num_cols]
            return X_clean
        else:
            return clean_outliers(X, method=method, k=self.k)


## Load data & define columns

In [None]:

# --- Load the CSV exactly as requested ---
csv_path = "Ames_Housing_Data.csv"
df       = pd.read_csv(csv_path)

# Identify numeric and textual columns:
numeric_columns = df.select_dtypes(include=["number"]).columns.tolist()
text_columns    = df.select_dtypes(include=["object"]).columns.tolist()

# Target
TARGET = "SalePrice"
assert TARGET in df.columns, f"{TARGET} not found in DataFrame!"

# Separate features/target
X = df.drop(columns=[TARGET])
y = df[TARGET]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

X_train.head()



## Preprocessing blocks

- **OutlierCleaner** (custom) → numeric columns only
- **Numeric pipeline** → imputer (`mean`/`median`) + optional scaler
- **Categorical pipeline** → imputer (`most_frequent`/`constant`) + one-hot encoding
- Combined via **ColumnTransformer**


In [None]:

# Numeric pipeline: impute -> (optional) scale
numeric_pre = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),   # grid will try mean/median
    ("scaler",  StandardScaler(with_mean=True, with_std=True))  # can be toggled via grid
])

# Categorical pipeline: impute -> one-hot
categorical_pre = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent", fill_value="missing")),
    ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# Full preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pre, selector(dtype_include=np.number)),
        ("cat", categorical_pre, selector(dtype_include=object)),
    ],
    remainder="drop",
    verbose_feature_names_out=True
)

# Final pipeline: OutlierCleaner -> Preprocessor -> Model
pipe = Pipeline(steps=[
    ("outliers", OutlierCleaner(method="cap", k=1.5, numeric_only=True)),
    ("preprocess", preprocessor),
    ("model", RandomForestRegressor(random_state=RANDOM_STATE))
])
pipe



## Grid Search space

We search across:
- Outlier cleaning: `method ∈ {cap, median}`, `k ∈ {1.0, 1.5, 2.0, 3.0}`
- Numeric imputer: `mean` vs `median`
- Scaling: **enabled** vs **disabled** (by swapping scaler with `passthrough`)
- Categorical imputer: `most_frequent` vs `constant`
- Model family & hyperparameters:
  - **RandomForestRegressor** (n_estimators, max_depth, max_features)
  - **Ridge** (alpha)
  
> Tip: You can expand or reduce the grid to fit your compute budget.


In [None]:

# Helper to toggle scaler in the numeric pipeline
from sklearn import set_config
set_config(transform_output="pandas")  # get DataFrame from transformers for readability

param_grid = [
    # --- RandomForest branch ---
    {
        "outliers__method": ["cap", "median"],
        "outliers__k": [1.0, 1.5, 2.0, 3.0],

        "preprocess__num__imputer__strategy": ["mean", "median"],
        # Toggle scaler: either actual StandardScaler or no scaling
        "preprocess__num__scaler": [StandardScaler(with_mean=True, with_std=True), "passthrough"],

        "preprocess__cat__imputer__strategy": ["most_frequent", "constant"],
        "preprocess__cat__imputer__fill_value": ["missing"],  # used when strategy='constant'

        "model": [RandomForestRegressor(random_state=RANDOM_STATE)],
        "model__n_estimators": [300, 600],
        "model__max_depth": [None, 12, 20],
        "model__max_features": ["sqrt", "log2", 0.6, 1.0],
    },
    # --- Ridge branch (linear baseline) ---
    {
        "outliers__method": ["cap", "median"],
        "outliers__k": [1.0, 1.5, 2.0, 3.0],

        "preprocess__num__imputer__strategy": ["mean", "median"],
        "preprocess__num__scaler": [StandardScaler(with_mean=True, with_std=True), "passthrough"],

        "preprocess__cat__imputer__strategy": ["most_frequent", "constant"],
        "preprocess__cat__imputer__fill_value": ["missing"],

        "model": [Ridge(random_state=RANDOM_STATE) if hasattr(Ridge(), "random_state") else Ridge()],
        "model__alpha": [0.1, 1.0, 10.0, 100.0],
    },
]

search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1,
    return_train_score=True,
    verbose=1
)
search



> **Optional:** The full grid may take time. For a quick smoke test, reduce the grid sizes before running.


## Fit Grid Search & Evaluate on Holdout Test

In [None]:

# --- Run the search (may take several minutes depending on CPU/RAM) ---
search.fit(X_train, y_train)

print("Best Params:")
print(search.best_params_)
print("\nCV best score (neg RMSE):", search.best_score_)

# --- Evaluate on test set ---
best_model = search.best_estimator_
y_pred = best_model.predict(X_test)

rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print("\nTest RMSE:", rmse)
print("Test R^2:", r2)


## Inspect engineered feature names

In [None]:

# After fitting, you can inspect the output feature names from the preprocessor
final_pre = best_model.named_steps["preprocess"]
feature_names = []
# numeric
num_features = final_pre.named_transformers_["num"].get_feature_names_out()
feature_names.extend(num_features.tolist())
# categorical
cat_features = final_pre.named_transformers_["cat"].named_steps["ohe"].get_feature_names_out()
feature_names.extend(cat_features.tolist())

print(f"Total engineered features: {len(feature_names)}")
pd.Series(feature_names).head(30)


## Quick report dataframe

In [None]:

results = pd.DataFrame(search.cv_results_)
cols = [
    'rank_test_score','mean_test_score','std_test_score',
    'mean_train_score','std_train_score','param_outliers__method','param_outliers__k',
    'param_preprocess__num__imputer__strategy','param_preprocess__num__scaler',
    'param_preprocess__cat__imputer__strategy','param_model'
]
display(results[cols].sort_values('rank_test_score').head(10))



## Tips to extend

- Add more models (e.g., `GradientBoostingRegressor`, `HistGradientBoostingRegressor`).
- Add a **PowerTransformer** for skewed numeric features.
- Replace the `"remove"` behavior by *masking outliers as `NaN`*, letting the imputer fill them — this preserves sample count and stays pipeline-safe.
- Use **`HalvingGridSearchCV`** or **`RandomizedSearchCV`** for speed.
