# 🏡 Ames Housing — Final Pipeline & Model

**TL;DR:** This notebook contains the final scikit-learn pipeline and the production-ready model used in the project.  
It’s intentionally compact — all EDA, exploratory graphs and modeling experiments are in `exploration.ipynb`. This file focuses on the final pipeline, minimal validation, inference examples, and artifact export/instructions.

This notebook is the **deployment-oriented** artifact. For complete data exploration, visual analysis, PCA, and the model training notebook (all experiments and plots), open `exploration.ipynb`.  



**Author:** Praanshull Verma 
**Date:** 2025-10-25  
**Dataset:** Ames Housing (original EDA & exploration in `exploration.ipynb`)  
**Artifacts produced:** `final_pipeline.pkl`, `feature_names.pkl`


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import joblib


In [2]:
#Starting with importing our base Dataset (Ames Housing)
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.frame
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:

# Drop unhelpful or redundant columns (based on EDA decisions)
drop_cols = [
    "Id",# unique identifier, not predictive
    "Street",# more than 98% of same value
    "Condition2",# more than 98% of same value
    "Utilities",# more than 98% of same value
    "RoofMatl",# more than 98% of same value
    "LowQualFinSF" #more than 98% of same value
    "3SsnPorch", # more than 98% of same value
    "PoolArea"# more than 98% of same value
]
df.drop(columns=drop_cols, inplace=True, errors="ignore")

In [4]:

X = df.drop("SalePrice", axis=1)
y = np.log1p(df["SalePrice"])  # log transform target


In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
import pandas as pd

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        # If X is a NumPy array
        if isinstance(X, np.ndarray):
            return np.log1p(X)
        X_copy = X.copy()
        for col in self.columns:
            X_copy[col] = np.log1p(X_copy[col])
        return X_copy
    def get_feature_names_out(self, input_features=None):
        return np.array(input_features)
    def get_feature_names_out(self, input_features=None):
        return np.array(input_features)  # ✅ just return names unchanged

In [7]:
# Features to log-transform
log_features = ['LotArea', 'BsmtFinSF2', 'ScreenPorch', 'EnclosedPorch', 
           'MasVnrArea', 'LotFrontage', 'OpenPorchSF', 'SalePrice',
           'BsmtFinSF1', 'WoodDeckSF', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea']


In [8]:
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self
    def get_feature_names_out(self, input_features=None):
        return np.array(input_features)  # ✅ just return names unchanged

    def transform(self, X):
        X_copy = X.copy()

        # --- Example engineered features ---
        X_copy["TotalBath"] = (
            X_copy["FullBath"].fillna(0)
            + 0.5 * X_copy["HalfBath"].fillna(0)
            + X_copy["BsmtFullBath"].fillna(0)
            + 0.5 * X_copy["BsmtHalfBath"].fillna(0)
        )
        X_copy["TotalSF"] = (
            X_copy["1stFlrSF"].fillna(0)
            + X_copy["2ndFlrSF"].fillna(0)
            + X_copy["TotalBsmtSF"].fillna(0)
        )

        X_copy["TotalPorchSF"] = (
            X_copy["OpenPorchSF"].fillna(0) + X_copy["EnclosedPorch"].fillna(0)+X_copy["ScreenPorch"].fillna(0)
            +X_copy["WoodDeckSF"].fillna(0)
        )

        # --- Drop old irrelevant features (as decided in EDA) ---
        drop_cols = ['BsmtFinSF2','BsmtHalfBath','MiscVal','KitchenAbvGr','OverallCond','YrSold']
        X_copy.drop(columns=drop_cols, inplace=True, errors="ignore")

        return X_copy


## Pipeline Overview

The final artifact is a scikit-learn `Pipeline` that contains:
1. Preprocessing (imputation, encoders)
2. Feature engineering (custom transformer: `FeatureEngineer`)
3. Transformations (log transforms / scaling if used)
4. Final estimator: **Gradient Boosting Regressor** (tuned)


In [9]:
numeric_features = ['LotArea','GrLivArea','BsmtFinSF1','1stFlrSF','GarageCars','TotalBsmtSF']
log_features = ['LotArea','GrLivArea','BsmtFinSF1','1stFlrSF']

ordinal_features = ['LotShape', 'LandContour', 'LandSlope', 'ExterQual', 'ExterCond', 'BsmtQual','BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC','KitchenQual', 'Functional', 'FireplaceQu', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence']
onehot_features = ['MSSubClass','MSZoning','Alley','LotConfig','Neighborhood','Condition1','BldgType','HouseStyle','RoofStyle','Exterior1st','Exterior2nd','MasVnrType','Foundation','Heating','CentralAir','Electrical','GarageType','MiscFeature','SaleType','SaleCondition']

ordinal_mapping = [
    ['IR3', 'IR2', 'IR1', 'Reg'],
    ['Low', 'HLS', 'Bnk', 'Lvl'],
    ['Sev', 'Mod', 'Gtl'],
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # ExterQual
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # ExterCond
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # BsmtQual
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # BsmtCond
    ['None', 'No', 'Mn', 'Av', 'Gd'],         # BsmtExposure
    ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # BsmtFinType1
    ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # BsmtFinType2
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # HeatingQC
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # KitchenQual
    ['None', 'Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # Functional
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # FireplaceQu
    ['None', 'Unf', 'RFn', 'Fin'],            # GarageFinish
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # GarageQual
    ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],   # GarageCond
    ['N', 'P', 'Y'],                          # PavedDrive
    ['None', 'Fa', 'TA', 'Gd', 'Ex'],         # PoolQC
    ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'] # Fence
]
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("log", LogTransformer(columns=log_features)),
    ("scaler", StandardScaler())
])

ordinal_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OrdinalEncoder(categories=ordinal_mapping))
])

onehot_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("ord", ordinal_transformer, ordinal_features),
    ("nom", onehot_transformer, onehot_features)
], remainder="drop")


In [10]:
pipe = Pipeline([
    ("feature_engineering", FeatureEngineer()),
    ("preprocessor", preprocessor),
    ("model", GradientBoostingRegressor(
    n_estimators=850,
    learning_rate=0.03,
    subsample=0.8,
    min_samples_leaf=9,
    max_features=0.3,
    max_depth=3,
    random_state=42
))
])

In [11]:
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)


## Model Evaluation — Summary

Shown below: the final cross-validated metric(s) used to choose this model (RMSE on log-transformed SalePrice).


In [12]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Calculate metrics
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")


RMSE: 0.1505
MAE: 0.0973
R²: 0.8786




In [13]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
# Convert back from log scale to actual dollars
y_test_exp = np.exp(y_test)
y_pred_exp = np.exp(y_pred)

rmse_dollar = mean_squared_error(y_test_exp, y_pred_exp, squared=False)
mae_dollar = mean_absolute_error(y_test_exp, y_pred_exp)

print(f"RMSE ($): {rmse_dollar:,.2f}")
print(f"MAE ($): {mae_dollar:,.2f}")


RMSE ($): 29,221.21
MAE ($): 16,638.82




## Exported Artifacts

- `final_pipeline.pkl` — full sklearn pipeline (preprocessing + feature engineering + estimator)  
- `feature_names.pkl` — ordered list of features expected by pipeline  



In [14]:
import joblib
joblib.dump(pipe, "final_pipeline.pkl")


['final_pipeline.pkl']

In [15]:
joblib.dump(X_train.columns.tolist(), "feature_names.pkl")


['feature_names.pkl']