In [1]:
import pandas as pd
from linear_regression import LinearRegression1

df = pd.read_csv("housing.csv")

matrix = df.copy()
# fill missing values in total_bedrooms with median
matrix["total_bedrooms"] = matrix["total_bedrooms"].fillna(matrix["total_bedrooms"].median())

# set y to be the target variable (house values) and X to be the predictors (features)
y = matrix["median_house_value"].to_numpy(dtype=float)

numeric_cols = [
    "longitude", 
    "latitude", 
    "housing_median_age", 
    "total_rooms", 
    "total_bedrooms", 
    "population", 
    "households", 
    "median_income"   
]

X_num = matrix[numeric_cols]
X_categorical = pd.get_dummies(matrix["ocean_proximity"], prefix="ocean", drop_first=True, dtype=float)
X_df = pd.concat([X_num, X_categorical], axis=1)

X = X_df.to_numpy(dtype=float)
feature_names = X_df.columns.tolist()

Let's create an instance of the class, and then call on the fit-method. This constitutes the regression???

In [2]:
model = LinearRegression1()
model.fit(X, y)

<linear_regression.LinearRegression1 at 0x1b690ca34d0>

Let's do some tests on the regression and its parameters to analyse our linear regression. First out: let's do a significance test on the entire regression:

In [3]:
F, p = model.significance_regression()
print("F-statistic:", F)
print("p-value:", p)

F-statistic: 3129.2889229152033
p-value: 0.0


Extremely strong proof against the null hypothesis. This result means that the regression is a real pattern in the population, not random.

In [4]:
R2 = model.r_squares()
print("R-squared:", R2)

R-squared: 0.6454530166046624


Result of the statistical test R^2 can be between 0 and 1. Close to 1 meaning a lot of the variation in our predicted variable is actually explained by the modell. We got around 0.65. 


Now let's test the significance of each coefficient/parameter/feature

In [6]:
t_test, p_values = model.t_test_coefficiants()

all_feature_names = ["Intercept"] + feature_names

# print t-values and p-values for each coefficient
print(f"{'Feature':<20} {'t-value':>10} {'p-value':>10}")
print("-" * 40)
for name, t, p in zip(all_feature_names, t_test, p_values):
    print(f"{name:<20} {t:>10.4f} {p:>10.4e}")
    


Feature                 t-value    p-value
----------------------------------------
Intercept              -25.5272 1.5442e-141
longitude              -26.0683 2.0260e-147
latitude               -25.1777 8.4456e-138
housing_median_age      24.2031 1.2406e-127
total_rooms             -6.1386 8.4792e-10
total_bedrooms          12.0274 3.2953e-33
population             -36.9277 4.0872e-289
households              11.6848 1.9142e-31
median_income          116.6703 0.0000e+00
ocean_INLAND           -22.9036 1.1506e-114
ocean_ISLAND             5.0716 3.9790e-07
ocean_NEAR BAY          -1.9398 5.2414e-02
ocean_NEAR OCEAN         3.0452 2.3284e-03


So median income definately affects y - median house value. 