# California Housing Prices: Linear Model

## Problem Statement

Predicting house prices in California is an important task for understanding the real estate market. It s beneficial for buyers, sellers, and policymakers for better decision. This project aims to develop a predictive model to estimate the house price in California based on features like house age, income, rooms and etc.

The goal is to use two approach:
- `Using Scikit-learn's Linear Regression`: Train a machine learning model to learn patterns form the data and predict house prices.
- `Using Regression Formula`: Manually compute the prediction using the normal equation for linear regresison to verify the results.

Finally, we will compare and evaluate the performance of both methods to ensure accuracy and consistency.

## Goal

The objective of this project is to provide an understanding and practical implementation of linear regression in machine learning. It will cover key concepts such as matrix multiplication, the least squares method, and more. By completing this project, you will gain hands-on experience with using scikit-learn's linear regression module, as well as writing custom code to calculate linear regression manually.

## Mathematical Background

**Linear Regression Overview**

\
The linear regression (multiple) is the mathematical relationship between the dependent variable and all other relevent independent variables (factor) affecting the dependent variable jointly. It provides valuable insights for prediction and data analysis. The objective is to find the best fitting line (or hyperplane) that minimizes is predicting dependent variable from independent variables.

\
**Equation of Linear Regression**

\
The equation for the linear regression can be written as: 

\
 **$\hat{y}$ = X.$\beta$ + $\epsilon$**      -- (1)

\
Where:
- $\hat{y}$: Predicted values
- X: Matrix of input features
- $\beta$: Coefficients(weights)
- $\epsilon$: Error term

\
Any equation in the above form is reffered to as a linear model.

\
**Normal Equation (Least Square Method)**

\
The optimal coefficients $\beta$ are computed by minimizing the sum of squared errors. Using the normal equation, $\beta$ is calculated as:

**$\beta$ = (X^T.X)^{-1}X^T.y**

\
Where: 
- ( X^T \): Transpose of \( X \)
- ( (X^T X)^{-1} \): Inverse of \( X^T X \)




## Basic imports

In [1]:
# Basic imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## Load Dataset

In [2]:
file_path = 'dataset/housing.csv'
df = pd.read_csv(file_path)
print(df.head())

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [4]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

## Data Preprocessing

In [5]:
df["total_bedrooms"] = df["total_bedrooms"].fillna(df["total_bedrooms"].median())

In [6]:
feature_cols = ["housing_median_age", "total_rooms", "total_bedrooms", "median_income"]
target_col = "median_house_value"

X = df[feature_cols]
y = df[target_col]

print("X shape: ", X.shape[0])
print("y shape: ", y.shape[0])

X shape:  20640
y shape:  20640


In [7]:
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.2, random_state=42)

In [8]:
# scaling the data using Standardscaler
print("Scaling Features...")
scaler = StandardScaler()
X_trained_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print('Scaling Complete.')

Scaling Features...
Scaling Complete.


## Implementing Formula Based Linear Regression

In [9]:
# a function that add a bias
def add_bias(data):
    return np.c_[np.ones(data.shape[0]), data]

# let's add bias column for both x train and x test
X_train_bias = add_bias(X_trained_scaled)
X_test_bias = add_bias(X_test_scaled)

# findng coefficient:
def find_beta(A,Y):
    A_t = A.T
    return np.linalg.inv(A_t @ A) @ A_t @ Y    # @ represents dot product. 

# beta = (X_T*X)-1*X_T*Y
beta = find_beta(X_train_bias, y_train)

# predicting the result with self implemented model:
y_pred_manual = X_test_bias @ beta

#eval
mse_manual = mean_squared_error(y_test, y_pred_manual)
r2_manual = r2_score(y_test, y_pred_manual)

print(f"Manual MSE: {mse_manual}")
print(f"Manual R²: {r2_manual}")    # alt +  0178 for R²

Manual MSE: 6356567901.161346
Manual R²: 0.5149172505140771


## Training scikit-learn Linear Regression Model 

In [10]:
# init
model = LinearRegression()

# fit the model
model.fit(X_trained_scaled, y_train)

# predict
y_pred_sklearn = model.predict(X_test_scaled)

# eval
mse_sklearn = mean_squared_error(y_test, y_pred_sklearn)
r2_sklearn = r2_score(y_test, y_pred_sklearn)

print(f"Sklearn MSE: {mse_sklearn}")
print(f"Sklearn R²: {r2_sklearn} ")

Sklearn MSE: 6356567901.161352
Sklearn R²: 0.5149172505140767 


## Comparing Formula-Based and scikit-learn Models

In [11]:
# Compare the results from both approaches
print("\nComparison between Manual and Sklearn:")
print(f"Manual vs Sklearn MSE: {mse_manual} vs {mse_sklearn}")
print(f"Manual vs Sklearn R²: {r2_manual} vs {r2_sklearn}")


Comparison between Manual and Sklearn:
Manual vs Sklearn MSE: 6356567901.161346 vs 6356567901.161352
Manual vs Sklearn R²: 0.5149172505140771 vs 0.5149172505140767


In [12]:
print("Feature Importance from scikit-learn:")
coefficients_sklearn = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient (scikit-learn)': model.coef_
})
print(coefficients_sklearn.sort_values(by='Coefficient (scikit-learn)', ascending=False))

Feature Importance from scikit-learn:
              Feature  Coefficient (scikit-learn)
3       median_income                94930.983691
2      total_bedrooms                68121.514977
0  housing_median_age                24497.741358
1         total_rooms               -57874.639635


In [13]:
print("Feature Importance from manual:")
coefficients_sklearn = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient (manual)': beta[1:]
})
print(coefficients_sklearn.sort_values(by='Coefficient (manual)', ascending=False))

Feature Importance from manual:
              Feature  Coefficient (manual)
3       median_income          94930.983691
2      total_bedrooms          68121.514977
0  housing_median_age          24497.741358
1         total_rooms         -57874.639635


## Conclusion


In this project, we:
- Discussed the mathematical foundation of linear regression, including the normal equation and least squares method.
- Preprocessed the California Housing dataset by scaling the features.
- Implemented **formula-based linear regression** to calculate coefficients manually.
- Trained a **scikit-learn Linear Regression model** for comparison.
- Evaluated both models using MSE and R² metrics, confirming they produce identical results.
- Analyzed feature importance to understand the impact of each variable.

This project bridges the gap between theory and practice in linear regression, providing a solid foundation for understanding more advanced machine learning models.