# Model Complexity and Regularization

In this notebook, we evaluate how increasing model complexity affects
performance and generalization, using the baseline Linear Regression
as a reference.

In [2]:
import numpy as np
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge, Lasso

In [3]:
data = fetch_california_housing(as_frame=True)
df = data.frame

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Decision Tree Regressor

Decision Trees can capture complex non_linear relations but are prone to overfitting if not properly constrainded

In [4]:
tree = DecisionTreeRegressor(random_state=42)
tree.fit(X_train, y_train)

y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

rmse_train, rmse_test, r2_train, r2_test

(np.float64(3.218325866275131e-16),
 np.float64(0.7037294974840077),
 1.0,
 0.622075845135081)

### Fully Grown Decision Tree

The unrestricted Decision Tree achieved near-perfect performance on the
training set, with almost zero error and an R² score of 1.0.

However, its performance drops substantially on the test set, indicating
severe overfitting. The model memorizes the training data instead of
learning generalizable patterns.

## Overfitting Analysis

The decision Tree achieves very low training error but significantly worse performance on the test set. This gap indicates overfitting, as the model memorizes training instead of generalizing.

In [5]:
tree_limited = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=20,
    random_state=42
)

tree_limited.fit(X_train, y_train)

y_train_pred = tree_limited.predict(X_train)
y_test_pred = tree_limited.predict(X_test)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

rmse_train, rmse_test, r2_train, r2_test

(np.float64(0.696225521098421),
 np.float64(0.7245672649492821),
 0.637389379963111,
 0.599363458161861)

### Regularized Decision Tree

By constraining tree depth and minimum leaf size, the model reduces
variance and improves generalization.

The smaller gap between training and test performance indicates a better
bias–variance balance compared to the unrestricted tree.

## Ridge Regression

Ridge regression introduces L2 regularization, penalizing large coefficients and reducing model variance.

In [6]:
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

y_train_pred = ridge.predict(X_train)
y_test_pred = ridge.predict(X_test)

rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

rmse_train, rmse_test, r2_train, r2_test

(np.float64(0.7196757706930821),
 np.float64(0.7455222779992702),
 0.6125511245209703,
 0.5758549611440126)

### Ridge Regression

Ridge Regression introduces L2 regularization, which penalizes large
coefficients and stabilizes model behavior.

Although its performance is slightly lower than more complex models,
it demonstrates consistent generalization and robustness.

## Generalization Trade-off

The experiments demonstrate how increasing model complexity improves
training performance at the cost of generalization.

Regularization techniques, such as limiting tree depth or applying L2
penalties, help balance bias and variance, leading to more reliable
models in real-world scenarios.