## California Housing

~ Using baseline LR

~ Checking if scaling helps (in theory it shouldnt)

~ Checking if using polynomial feature helps 

~ Add irrelevant features (noise) 

~ Check multicollinearity 

In [1]:
from sklearn.datasets import fetch_california_housing

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.metrics import r2_score, mean_squared_error

In [3]:
data = fetch_california_housing(as_frame=True)
df = data.frame

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Baseline Linear Regression

model = LinearRegression()
model.fit(X_train, y_train)

train_preds = model.predict(X_train)
test_preds = model.predict(X_test)

train_r2 = r2_score(y_train, train_preds)
test_r2 = r2_score(y_test, test_preds)

train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)

print("Train R2:", train_r2)
print("Test R2:", test_r2)
print("Train MSE:", train_mse)
print("Test MSE:", test_mse)

Train R2: 0.6125511913966952
Test R2: 0.575787706032451
Train MSE: 0.5179331255246699
Test MSE: 0.5558915986952442


In [6]:
# LR with scaling 

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model_scaled = LinearRegression()
model_scaled.fit(X_train_scaled, y_train)

train_preds_scaled = model_scaled.predict(X_train_scaled)
test_preds_scaled = model_scaled.predict(X_test_scaled)

print("Train R2 (scaled):", r2_score(y_train, train_preds_scaled))
print("Test R2 (scaled):", r2_score(y_test, test_preds_scaled))

Train R2 (scaled): 0.6125511913966952
Test R2 (scaled): 0.5757877060324508


Observation: as expected, scaling had no effect on r2 score for lr

In [7]:
## Using polynomial features

poly = PolynomialFeatures(degree=2, include_bias=False)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model_poly = LinearRegression()
model_poly.fit(X_train_poly, y_train)

train_preds_poly = model_poly.predict(X_train_poly)
test_preds_poly = model_poly.predict(X_test_poly)

print("Train R2 (poly):", r2_score(y_train, train_preds_poly))
print("Test R2 (poly):", r2_score(y_test, test_preds_poly))

Train R2 (poly): 0.6852681982344955
Test R2 (poly): 0.6456819728455382


Observation:
 
train r2 increases, test r2 also increased slightly

GAP between them increased

as expected tho, theory said that increasing model complexity will lead to lower bias and higher variance


In [8]:
## adding irrelevant features

np.random.seed(42)

X_train_noise = X_train.copy()
X_test_noise = X_test.copy()

X_train_noise["random_noise"] = np.random.randn(len(X_train))
X_test_noise["random_noise"] = np.random.randn(len(X_test))

model_noise = LinearRegression()
model_noise.fit(X_train_noise, y_train)

test_preds_noise = model_noise.predict(X_test_noise)

print("Test R2 (with noise):", r2_score(y_test, test_preds_noise))

Test R2 (with noise): 0.5757174560575071


In [9]:
coef_df = pd.DataFrame({
    "feature": X_train_noise.columns,
    "coefficient": model_noise.coef_
})

print(coef_df.sort_values(by="coefficient", key=abs, ascending=False))

        feature  coefficient
3     AveBedrms     0.783328
0        MedInc     0.448615
7     Longitude    -0.433671
6      Latitude    -0.419773
2      AveRooms    -0.123351
1      HouseAge     0.009727
8  random_noise     0.007913
5      AveOccup    -0.003517
4    Population    -0.000002


Observation:

coeff of noise is non zero, so model tried to use it

this added variance which caused r2 score to drop slightly

In [13]:
## testing multicollinearity (duplicating a feature)

X_train_multi = X_train.copy()
X_test_multi = X_test.copy()

X_train_multi["duplicate_income"] = X_train_multi["MedInc"]
X_test_multi["duplicate_income"] = X_test_multi["MedInc"]

model_multi = LinearRegression()
model_multi.fit(X_train_multi, y_train)

test_preds_multi = model_multi.predict(X_test_multi)

print("Test R2 (multicollinearity):", r2_score(y_test, test_preds_multi))

Test R2 (multicollinearity): 0.5757877060324506


In [14]:
coef_multi = pd.DataFrame({
    "feature": X_train_multi.columns,
    "coefficient": model_multi.coef_
})

print(coef_multi.sort_values(by="coefficient", key=abs, ascending=False))

            feature  coefficient
3         AveBedrms     0.783145
7         Longitude    -0.433708
6          Latitude    -0.419792
0            MedInc     0.224337
8  duplicate_income     0.224337
2          AveRooms    -0.123323
1          HouseAge     0.009724
5          AveOccup    -0.003526
4        Population    -0.000002


Observation: 

coeff of medinc and duplicate_income are unstable now

infinitely many solutions giving same predictions when features are perfectly correlated

r2 same cause multicollinearity increases variance of coefficient estimates but not necessarily prediction error

In [None]:
# Why did polynomial increase variance?


# Why didn’t scaling change performance?

# Why were predictions stable but coefficients unstable?

# What would Ridge regression fix here?

In [2]:
# questions:

# Why did polynomial increase variance?
# Because it added more features, which can lead to overfitting, 
# especially if the new features are not informative or if the model is too complex for the amount of data available.

# Why didn’t scaling change performance?
# Because linear regression is not sensitive to the scale of the features,
# as it can find the optimal coefficients regardless of the feature scales.

# Why were predictions stable but coefficients unstable?
# Because of multicollinearity, where two or more features are highly correlated,
# leading to large changes in coefficients with small changes in the data, while predictions remain stable.

# What would Ridge regression fix here?
# Ridge regression adds L2 regularization, which can help mitigate multicollinearity by shrinking the coefficients,
# leading to more stable coefficients and potentially better generalization on unseen data.
