# **Regularization and Ensemble Modeling**

---

**Objectives**:

* Extend the baseline Linear Regression model by applying regularization techniques (Ridge and Lasso Regression).

* Compare how these regularized models perform against the baseline and feature-engineered linear model.

* Identify important features through Lasso’s feature selection behavior.

* Train ensemble models (Random Forest, Gradient Boosting, HistGradientBoosting) to capture nonlinear relationships.

* Use GridSearchCV to tune key hyperparameters and determine the best-performing model.

* Save the top model pipeline for use in Notebook 05 (testing on unseen data).


**Files to Be Used**:

**Input**:

* `housing_train_featured.csv`

* `linear_model_pipeline.joblib`

**Output**:

* `best_model_pipeline.joblib`



###**1. Import Dependencies and Load Data**

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

Load housing_train_featured.csv and the saved preprocessor/pipeline:

In [3]:
pd.set_option('display.max_columns', None)
df_train = pd.read_csv('housing_train_featured.csv')
df_train.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,TotalSF,HouseAge,RemodAge,TotalBath,HasPool,HasGarage,HasFireplace,OverallQual_Cond,TotalPorchSF
0,60,RL,65.0,8450,1,Reg,Lvl,1,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,2.0,4.0,PConc,2.0,4.0,No,GLQ,706,Unf,0,150,856,GasA,0.0,1,SBrkr,856,854,0,1710,1,0,2,1,3,1,2.0,8,Typ,0,3.0,Attchd,2003.0,RFn,2,548,5.0,5.0,Y,0,61,0,0,0,0,3.0,0,2,2008,WD,Normal,208500,2566,5,5,3.5,0,1,0,35,61
1,20,RL,80.0,9600,1,Reg,Lvl,1,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,3.0,4.0,CBlock,2.0,4.0,Gd,ALQ,978,Unf,0,284,1262,GasA,0.0,1,SBrkr,1262,0,0,1262,0,1,2,0,3,1,3.0,6,Typ,1,5.0,Attchd,1976.0,RFn,2,460,5.0,5.0,Y,298,0,0,0,0,0,3.0,0,5,2007,WD,Normal,181500,2524,31,31,2.5,0,1,1,48,0
2,60,RL,68.0,11250,1,IR1,Lvl,1,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,2.0,4.0,PConc,2.0,4.0,Mn,GLQ,486,Unf,0,434,920,GasA,0.0,1,SBrkr,920,866,0,1786,1,0,2,1,3,1,2.0,6,Typ,1,5.0,Attchd,2001.0,RFn,2,608,5.0,5.0,Y,0,42,0,0,0,0,3.0,0,9,2008,WD,Normal,223500,2706,7,6,3.5,0,1,1,35,42
3,70,RL,60.0,9550,1,IR1,Lvl,1,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,3.0,4.0,BrkTil,4.0,1.0,No,ALQ,216,Unf,0,540,756,GasA,2.0,1,SBrkr,961,756,0,1717,1,0,1,0,3,1,2.0,7,Typ,1,2.0,Detchd,1998.0,Unf,3,642,5.0,5.0,Y,0,35,272,0,0,0,3.0,0,2,2006,WD,Abnorml,140000,2473,91,36,2.0,0,1,1,35,307
4,60,RL,84.0,14260,1,IR1,Lvl,1,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,2.0,4.0,PConc,2.0,4.0,Av,GLQ,655,Unf,0,490,1145,GasA,0.0,1,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,2.0,9,Typ,1,5.0,Attchd,2000.0,RFn,3,836,5.0,5.0,Y,192,84,0,0,0,0,3.0,0,12,2008,WD,Normal,250000,3343,8,8,3.5,0,1,1,40,84


Load pipeline from notebook 3 and extract preprocessors:

In [9]:
import joblib

pipeline_loaded = joblib.load('linear_model_pipeline.joblib')
print('done')

done


In [10]:
preprocessor = pipeline_loaded.named_steps['preprocessor']
print("Preprocessor extracted from linear_model_pipeline.joblib")

Preprocessor extracted from linear_model_pipeline.joblib



##**2. Regularization**:

Apply Ridge Regression with different alpha values.

Apply Lasso Regression and note which coefficients shrink to zero.

Compare R2, MAE, MSE, and RMSE across Linear, Ridge, and Lasso models.


Define data and split into training and testing sets:

In [7]:
X = df_train.drop('SalePrice', axis=1)
y = np.log1p(df_train['SalePrice'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Train: (1168, 85), Test: (292, 85)


Define Ridge and Lasso:

In [12]:
ridge = Ridge()
lasso = Lasso()

# Create pipelines using preprocessor extracted from joblib file:

ridge_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', ridge)
])

lasso_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', lasso)
])

Evualate Ridge and Lasso:

In [15]:
# Grid search for alpha
param_grid = {'regressor__alpha': [0.1, 1.0, 10.0, 100.0]}

ridge_search = GridSearchCV(ridge_pipe, param_grid, cv=5, scoring='r2', n_jobs=-1)
lasso_search = GridSearchCV(lasso_pipe, param_grid, cv=5, scoring='r2', n_jobs=-1)

ridge_search.fit(X_train, y_train)
lasso_search.fit(X_train, y_train)

# Evaluate best models
for name, model in [('Ridge', ridge_search), ('Lasso', lasso_search)]:
    y_pred = np.expm1(model.predict(X_test))
    y_true = np.expm1(y_test)
    print(f"\n{name} Regression Results:")
    print(f"Best alpha: {model.best_params_['regressor__alpha']}")
    print(f"R2: {r2_score(y_true, y_pred):.4f}")
    print(f"MAE: {mean_absolute_error(y_true, y_pred):,.2f}")
    print(f"MSE: {mean_squared_error(y_true, y_pred):,.2f}")
    print(f"RMSE: {np.sqrt(mean_squared_error(y_true, y_pred)):,.2f}")


Ridge Regression Results:
Best alpha: 100.0
R2: 0.9087
MAE: 17,018.43
MSE: 699,944,870.87
RMSE: 26,456.47

Lasso Regression Results:
Best alpha: 0.1
R2: 0.6780
MAE: 29,634.49
MSE: 2,470,202,689.28
RMSE: 49,701.13


Interpretation:

Our baseline linear regression model has the following scores:

* R2 score: 0.9245
* MAE: 16232.36
* MSE: 579303802.40
* RMSE: 24068.73

After running Ridge and Lasso regression using the same preprocessing pipeline, the results show that both regularized model does not outperformed the baseline Linear Regression.

The Ridge model achieved an **R2 score** of **0.9087** and an **RMSE** of about **26,456**, which is slightly worse than the Linear Regression’s **R2** of **0.9245** and **RMSE** of **24,068**.

Lasso performed worse, with a drop in **R2** to **0.678**, suggesting that its coefficient shrinkage removed too much predictive power.

This indicate that overfitting was not a significant issue in our baseline model, and the added regularization from Ridge and Lasso only reduced accuracy. In other words, the **baseline Linear Regression remains our best linear model so far**.

##**3. Ensemble Models (Base Versions)**

Train initial Random Forest, Gradient Boosting, and HistGradientBoosting models using default parameters.

Compare their results with the linear and regularized models.

In [21]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

X = df_train.drop('SalePrice', axis=1)
y = df_train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

results = {}

for name, model in models.items():
    # If model cannot handle sparse data, convert output to dense
    if name == 'HistGradientBoosting':
        preprocessor_dense = ColumnTransformer(
            transformers=preprocessor.transformers,
            sparse_threshold=0  # force dense output
        )
        pipe = Pipeline([
            ('preprocessor', preprocessor_dense),
            ('regressor', model)
        ])
    else:
        pipe = Pipeline([
            ('preprocessor', preprocessor),
            ('regressor', model)
        ])

    pipe.fit(X_train, y_train)

    y_pred = pipe.predict(X_test)


    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)

    results[name] = {'R2': r2, 'MAE': mae, 'MSE': mse, 'RMSE': rmse}
    print(f"\n{name} Results:")
    print(f"R2: {r2:.4f}")
    print(f"MAE: {mae:,.2f}")
    print(f"RMSE: {rmse:,.2f}")



RandomForest Results:
R2: 0.8873
MAE: 17,326.55
RMSE: 29,396.84

GradientBoosting Results:
R2: 0.9066
MAE: 15,781.37
RMSE: 26,765.99

HistGradientBoosting Results:
R2: 0.8908
MAE: 16,339.45
RMSE: 28,944.98


After introducing ensemble methods, the **Gradient Boosting Regressor** achieved an **R2 score** of **0.9066**, **MAE** of **15,781**, and **RMSE** of **26,766**. While this show solid predictive performance, it’s still slightly below the feature-engineered linear regression model (**R2** = **0.9245**, **RMSE** = **24,069**).

Normally I will stop here but let's see if tuning hyperparameters can close the gap

##**4. Hyperparameter Tuning**

Use GridSearchCV for each ensemble model (start with a small grid to keep runtime manageable).

Record the best parameters and metrics.


In [22]:
# Define parameter grids
param_grids = {
    'RandomForest': {
        'regressor__n_estimators': [100, 200],
        'regressor__max_depth': [None, 10, 20],
        'regressor__min_samples_split': [2, 5],
    },
    'GradientBoosting': {
        'regressor__n_estimators': [100, 200],
        'regressor__learning_rate': [0.05, 0.1, 0.2],
        'regressor__max_depth': [3, 5],
    },
    'HistGradientBoosting': {
        'regressor__max_iter': [100, 200],
        'regressor__learning_rate': [0.05, 0.1],
        'regressor__max_depth': [None, 10, 20],
    }
}

best_models = {}

for name, model in models.items():
    print(f"\n Tuning {name}...")

    if name == 'HistGradientBoosting':
        preprocessor_dense = ColumnTransformer(
            transformers=preprocessor.transformers,
            sparse_threshold=0
        )
        pipe = Pipeline([
            ('preprocessor', preprocessor_dense),
            ('regressor', model)
        ])
    else:
        pipe = Pipeline([
            ('preprocessor', preprocessor),
            ('regressor', model)
        ])

    grid_search = GridSearchCV(
        pipe,
        param_grid=param_grids[name],
        scoring='r2',
        cv=5,
        n_jobs=-1,
        verbose=1
    )

    grid_search.fit(X_train, y_train)

    best_models[name] = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_score = grid_search.best_score_

    print(f"Best params for {name}: {best_params}")
    print(f"R2 (CV mean): {best_score:.4f}")



 Tuning RandomForest...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params for RandomForest: {'regressor__max_depth': 10, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 100}
R2 (CV mean): 0.8549

 Tuning GradientBoosting...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params for GradientBoosting: {'regressor__learning_rate': 0.1, 'regressor__max_depth': 3, 'regressor__n_estimators': 200}
R2 (CV mean): 0.8738

 Tuning HistGradientBoosting...
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params for HistGradientBoosting: {'regressor__learning_rate': 0.1, 'regressor__max_depth': 10, 'regressor__max_iter': 200}
R2 (CV mean): 0.8593


**Interpretation**:

After running hyperparameter tuning using `GridSearchCV`, we found the following optimal configurations for each model:

* **Random Forest** performed best with a moderately shallow tree depth (`max_depth*` = **10**), preventing overfitting while maintaining reasonable bias.

* **Gradient Boosting** achieved the best cross-validated **R2 score** of **0.8738**, with `n_estimators` = 200, `learning_rate` =0.1, and `max_depth`=3 which are a balanced setup for stable convergence.

* **HistGradientBoosting** also showed strong results, with its best configuration at max_depth=10 and learning_rate=0.1, giving a mean cross-validation **R2** of **0.8593**.

Overall, Gradient Boosting slightly outperformed the others in terms of generalization across folds, though Random Forest remained competitive with simpler tuning. The relatively close **R2 scores among all three suggest diminishing returns from adding complexity**, which indicates that the dataset is already well-modeled by the linear and boosting approaches built earlier.

##**5. Feature Importance and Model Comparison**


Below is a summary table of all models (Linear, Ridge, Lasso, RF, GB, HGB).


| Model / Stage              | R2 Score   | MAE       | RMSE      |
| -------------------------- | ---------- | --------- | --------- |
| Baseline Linear Regression | 0.8590     | 19,871.62 | 32,883.12 |
| Log-Scaled Regression      | 0.9061     | 16,833.96 | 26,126.70 |
| After Feature Engineering  | **0.9245**    | 16,232.36 | 24,068.73 |
| Ridge Regression           | 0.9087     | 17,018.43 | 26,456.47 |
| Lasso Regression           | 0.6780     | 29,634.49   | 49,701.13   |
| Gradient Boosting          | 0.9066  | 15,781.37 | 26,765.99 |
| Random Forest              | 0.8873     | 17,326.55 | 29,396.84 |
| HistGradientBoosting       | 0.8908     | 16,339.45 | 28,944.98 |


In [27]:
coefs = pipeline_loaded.named_steps['regressor'].coef_
feature_names = pipeline_loaded.named_steps['preprocessor'].get_feature_names_out()

# ensure same length
min_len = min(len(coefs), len(feature_names))
coefs = coefs[:min_len]
feature_names = feature_names[:min_len]

# create grouped feature importance
importance_df = (
    pd.DataFrame({'feature': feature_names, 'importance': np.abs(coefs)})
    .assign(base_feature=lambda df: df['feature']
            .str.replace(r'cat__|num__', '', regex=True)   # remove pipeline prefixes
            .str.split('_').str[0])                        # take base feature name
    .groupby('base_feature', as_index=False)['importance']
    .mean()
    .sort_values(by='importance', ascending=False)
)

print(importance_df.head(15))

     base_feature  importance
68       RoofMatl    0.588058
17     Condition2    0.227673
55       MSZoning    0.124580
18     Electrical    0.106990
40        HasPool    0.104371
34     GarageType    0.098927
65       PoolArea    0.079026
64     PavedDrive    0.071202
70  SaleCondition    0.071023
28     Functional    0.067766
69      RoofStyle    0.057268
32   GarageFinish    0.056279
43       HouseAge    0.052334
36      GrLivArea    0.049695
22    Exterior1st    0.048416


##**6. Conclusion**:


So far, the results show a clear improvement as I cleaned, log-scaled, and engineered new features into the dataset. the baseline linear regression started with an **R2-score** of **0.8590**, and after **log-scaling** the target variable it went up to **0.9061**. Then after **feature engineering**, the model reached its best performance yet with an **R2-score** of **0.9245** and the **lowest MAE** and **RMSE** among all the models.

**Ridge** and **gradient boosting** came close but didn’t beat the feature-engineered linear regression, while **lasso** didn't perform well.

The feature importance results show that things like `roof material`, `condition2`, and `MSZoning` have the biggest impact on house prices, along with features like `garage type`, `pool area`, and `living area size`.

Overall, the best gains came not from fancy models but from cleaning the data properly, scaling it, and adding smarter features. next step will be to test everything on unseen data in notebook 5 and see how well it generalizes.

-end-