OLS Linear Regression with 

In [8]:
import pandas as pd
import numpy as np
from linear_regression import LinearRegression

df = pd.read_csv("data/housing.csv")
df = df.dropna()

cols = ["median_income","housing_median_age","total_rooms","households","ocean_proximity","median_house_value"]

df_model = df[cols].copy()
df_model["avg_rooms"] = df_model["total_rooms"] / df_model["households"]
df_model = df_model.drop(columns=["total_rooms","households"])


X = pd.get_dummies(df_model.drop(columns=["median_house_value"]),
                   columns=["ocean_proximity"],
                   drop_first=True).astype(np.float64)

y = df_model["median_house_value"].astype(np.float64)

X.head()
y.head()


0    452600.0
1    358500.0
2    352100.0
3    341300.0
4    342200.0
Name: median_house_value, dtype: float64

Here we fit the model using Ordinary Least Squares (OLS).

In [6]:
lr = LinearRegression(X.values, y.values, fit_intercept=True)
lr.fit()

lr.beta

array([ 49698.75990713,  37953.41176124,    936.3841351 ,    498.66608943,
       -72347.79008241, 184057.70724514,  13141.51014279,  17241.06644557])

The following table shows core regression statistics including error measures, RÂ² and the F-test for overall model significance.

In [7]:
n = lr.n
d = lr.d
sse = lr.sse()
variance = lr.sample_variance()
std_dev = lr.standard_deviation()
rmse = lr.rmse()
r2 = lr.r_squared()
F_stat, p_value = lr.regression_significance()

pd.DataFrame({
    "n samples": [n],
    "d (features incl. intercept)": [d],
    "SSE": [sse],
    "Sample variance": [variance],
    "Standard deviation": [std_dev],
    "rmse": [rmse],
    "r2": [r2],
    "F-statistics": [F_stat],
    "F p-value": [p_value]
})

Unnamed: 0,n samples,d (features incl. intercept),SSE,Sample variance,Standard deviation,rmse,r2,F-statistics,F p-value
0,20433,8,109673500000000.0,5369570000.0,73277.348882,73263.002575,0.59718,4325.724793,0.0


Below we compute standard errors, t-tests, p-values and confidence intervals for each regression coefficient. Alpha can be edited on confidence intervals.

In [9]:
feature_names = ["Intercept"] + list(X.columns)

beta = lr.beta
se = lr.standard_errors()
tvals = lr.t_values()
pvals = lr.p_values()
ci_low, ci_high = lr.confidence_intervals(alpha=0.05)

results = pd.DataFrame({
    "feature": feature_names,
    "beta": beta,
    "SE": se,
    "t": tvals,
    "p": p_value,
    "CI_low": ci_low,
    "CI_high": ci_high
})

results.sort_values("p")

Unnamed: 0,feature,beta,SE,t,p,CI_low,CI_high
0,Intercept,49698.759907,2201.350778,22.576484,0.0,45383.935973,54013.583842
1,median_income,37953.411761,303.842629,124.911412,0.0,37357.855859,38548.967663
2,housing_median_age,936.384135,43.745649,21.405195,0.0,850.639158,1022.129113
3,avg_rooms,498.666089,225.920922,2.207259,0.0,55.842977,941.489202
4,ocean_proximity_INLAND,-72347.790082,1287.754265,-56.181363,0.0,-74871.891637,-69823.688528
5,ocean_proximity_ISLAND,184057.707245,32787.5015,5.613655,0.0,119791.57682,248323.837671
6,ocean_proximity_NEAR BAY,13141.510143,1759.806195,7.467589,0.0,9692.148975,16590.871311
7,ocean_proximity_NEAR OCEAN,17241.066446,1625.728649,10.605132,0.0,14054.508013,20427.624878



Pearson correlation is used to detect linear dependencies (multicollinearity) between features.

In [10]:
R = lr.pearson_matrix()

corr_df = pd.DataFrame(R, index=X.columns, columns=X.columns)
corr_df

Unnamed: 0,median_income,housing_median_age,avg_rooms,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
median_income,1.0,-0.118278,0.325307,-0.237536,-0.009281,0.056677,0.027351
housing_median_age,-0.118278,1.0,-0.153031,-0.236968,0.017105,0.256149,0.020797
avg_rooms,0.325307,-0.153031,1.0,0.151231,0.001419,-0.029676,-0.034532
ocean_proximity_INLAND,-0.237536,-0.236968,0.151231,1.0,-0.010681,-0.241356,-0.262289
ocean_proximity_ISLAND,-0.009281,0.017105,0.001419,-0.010681,1.0,-0.005531,-0.006011
ocean_proximity_NEAR BAY,0.056677,0.256149,-0.029676,-0.241356,-0.005531,1.0,-0.135819
ocean_proximity_NEAR OCEAN,0.027351,0.020797,-0.034532,-0.262289,-0.006011,-0.135819,1.0
