# OLS Linear Regression 

The California Housing dataset is loaded and missing values are removed.
To keep the model simpler and easier to interpret, I exclude latitude and longitude.

I engineer the feature avg_rooms as total rooms per household, since this gives a more meaningful measure than using total_rooms alone.

The categorical variable ocean_proximity is included using one-hot encoding (drop_first=True) to avoid multicollinearity when using an intercept term. Dropping the first category sets a baseline category, which the remaining dummy variables are compared against.

In [2]:
import pandas as pd
import numpy as np
from linear_regression import LinearRegression

df = pd.read_csv("data/housing.csv")
df = df.dropna()

cols = ["median_income","housing_median_age","total_rooms","households","ocean_proximity","median_house_value"]

df_model = df[cols].copy()
df_model["avg_rooms"] = df_model["total_rooms"] / df_model["households"]
df_model = df_model.drop(columns=["total_rooms","households"])


X = pd.get_dummies(df_model.drop(columns=["median_house_value"]),
                   columns=["ocean_proximity"],
                   drop_first=True).astype(np.float64)

y = df_model["median_house_value"].astype(np.float64)


# Fitting the model

The model is fitted using Ordinary Least Squares, meaning it finds the coefficients that make the predicted house values as close as possible to the real values by minimizing the squared prediction errors.

The resulting coefficients (beta) represent the estimated effect of each feature on the house value, assuming other variables are held constant.

In [3]:
lr = LinearRegression(X.values, y.values, fit_intercept=True)
lr.fit()

pd.DataFrame({"feature": ["Intercept"] + list(X.columns), "beta": lr.beta})

Unnamed: 0,feature,beta
0,Intercept,49698.759907
1,median_income,37953.411761
2,housing_median_age,936.384135
3,avg_rooms,498.666089
4,ocean_proximity_INLAND,-72347.790082
5,ocean_proximity_ISLAND,184057.707245
6,ocean_proximity_NEAR BAY,13141.510143
7,ocean_proximity_NEAR OCEAN,17241.066446


# Core regression statistics including error measures, R² and the F-test for overall model significance.

RMSE is around 73,000 USD, meaning the model typically misses the true house value by roughly 73k dollars.
The model achieves R² = 0.597, which means it explains about 60% of the variation in house prices.
The F-test gives a p-value close to 0, indicating that the regression model is statistically significant overall.

In [4]:
n = lr.n
d = lr.d
sse = lr.sse()
variance = lr.sample_variance()
std_dev = lr.standard_deviation()
rmse = lr.rmse()
r2 = lr.r_squared()
F_stat, p_value = lr.regression_significance()

pd.DataFrame({
    "n samples": [n],
    "d (features incl. intercept)": [d],
    "SSE": [sse],
    "Sample variance": [variance],
    "Standard deviation": [std_dev],
    "rmse": [rmse],
    "r2": [r2],
    "F-statistics": [F_stat],
    "F p-value": [p_value]
})

Unnamed: 0,n samples,d (features incl. intercept),SSE,Sample variance,Standard deviation,rmse,r2,F-statistics,F p-value
0,20433,8,109673500000000.0,5369570000.0,73277.348882,73263.002575,0.59718,4325.724793,0.0


# Coefficent estimates and significance tests

Here we look at the regression coefficents and test which variables actually matter in the model. 

For each feature we calculate standard errors, t-values, p-values and condifence intervals. 
A low p-value means the feature is statistically significant (it is unlikely that the coefficent is close to zero just by random chance).

Most variables in this model asre extremely significant. The engineered feature avg_rooms is still significant, but clearly weaker than others. 

Each coefficient can be interpreted as the expected change in house value when that feature increases by 1 unit, assuming the other variables stay the same. 
Cofindence intervals are shown both for 95% and 99% confidence levels, where 99% gives wider intervals as expected. 

In [None]:
feature_names = ["Intercept"] + list(X.columns)

beta = lr.beta
se = lr.standard_errors()
tvals = lr.t_values()
pvals = lr.p_values()
ci_low_95, ci_high_95 = lr.confidence_intervals(alpha=0.05)
ci_low_99, ci_high_99 = lr.confidence_intervals(alpha=0.01)

results = pd.DataFrame({
    "feature": feature_names,
    "beta": beta,
    "SE": se,
    "t": tvals,
    "p": pvals,
    "CI_95_low": ci_low_95,
    "CI_95_high": ci_high_95,
    "CI_99_low": ci_low_99,
    "CI_99_high": ci_high_99

})

results.sort_values("p")

Unnamed: 0,feature,beta,SE,t,p,CI_low_95,CI_high_95,CI_low_99,CI_high_99
1,median_income,37953.411761,303.842629,124.911412,0.0,37357.855859,38548.967663,37170.691868,38736.131654
4,ocean_proximity_INLAND,-72347.790082,1287.754265,-56.181363,0.0,-74871.891637,-69823.688528,-75665.135258,-69030.444907
0,Intercept,49698.759907,2201.350778,22.576484,1.705216e-111,45383.935973,54013.583842,44027.926126,55369.593688
2,housing_median_age,936.384135,43.745649,21.405195,1.519766e-100,850.639158,1022.129113,823.69228,1049.075991
7,ocean_proximity_NEAR OCEAN,17241.066446,1625.728649,10.605132,3.300142e-26,14054.508013,20427.624878,13053.075585,21429.057306
6,ocean_proximity_NEAR BAY,13141.510143,1759.806195,7.467589,8.495496e-14,9692.148975,16590.871311,8608.126132,17674.894154
5,ocean_proximity_ISLAND,184057.707245,32787.5015,5.613655,2.006703e-08,119791.57682,248323.837671,99594.807043,268520.607447
3,avg_rooms,498.666089,225.920922,2.207259,0.02730702,55.842977,941.489202,-83.32203,1080.654208


# Pearson correlation 

The Pearson correlation matrix shows no strong linear dependencies between the continuous features. The negative correlations between dummy variables are expected due to one-hot encoding. Overall, multicollinearity does not appear to be a major issue in this model. 

In [6]:
R = lr.pearson_matrix()

corr_df = pd.DataFrame(R, index=X.columns, columns=X.columns)
corr_df

Unnamed: 0,median_income,housing_median_age,avg_rooms,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
median_income,1.0,-0.118278,0.325307,-0.237536,-0.009281,0.056677,0.027351
housing_median_age,-0.118278,1.0,-0.153031,-0.236968,0.017105,0.256149,0.020797
avg_rooms,0.325307,-0.153031,1.0,0.151231,0.001419,-0.029676,-0.034532
ocean_proximity_INLAND,-0.237536,-0.236968,0.151231,1.0,-0.010681,-0.241356,-0.262289
ocean_proximity_ISLAND,-0.009281,0.017105,0.001419,-0.010681,1.0,-0.005531,-0.006011
ocean_proximity_NEAR BAY,0.056677,0.256149,-0.029676,-0.241356,-0.005531,1.0,-0.135819
ocean_proximity_NEAR OCEAN,0.027351,0.020797,-0.034532,-0.262289,-0.006011,-0.135819,1.0


# Conclusion
The regression model is statistically significant and achivese an R² around 0.6, meaning it explainse a meaningful part of the variation in house values. 
median_income is by far the strongest predictor, while avg_rooms contributes less but is still statistically significant. 
The model is not perfect, but it provides an interpretable baseline model with reasonable explanatory power. 