<b> Implementing Linear Regression from Scratch with California Housing Dataset </b>

Objective:
This exercise aims to provide a hands-on experience in implementing linear regression from scratch using the California housing dataset. You will gain a deeper understanding of the inner workings of linear regression, including the concepts of cost function, and gradient descent optimization.

<b>Steps:</b>

1- Load the California Housing Dataset:

- Use the fetch_california_housing function from scikit-learn to load the dataset.

In [54]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [55]:
# Load the California housing dataset
housing = fetch_california_housing()
data, target = housing.data, housing.target

In [56]:
# explore the data
print(data.shape)
print(target.shape)
print(housing.DESCR)

(20640, 8)
(20640,)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This

2- Data Preprocessing:

- Add a bias term to the input features.
- Split the dataset into training and testing sets.

In [57]:
# Add a bias term to the input features
data_bias = np.c_[np.ones((data.shape[0], 1)), data]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_bias, target, test_size=0.2, random_state=42)

3- Standardization:

- Standardize the input features using StandardScaler from scikit-learn.

In [58]:
# Standardize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4- Linear Regression Implementation:

- Implement a simple linear regression class with methods for fitting the model and making predictions.
- Use mean squared error as the cost function.
- Utilize gradient descent for optimization.

In [59]:
# Linear regression implementation from scratch

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        #init our variables
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.w = None # the weight
        self.b = None # the bias

    def fit(self, X, y):
        num_samples, num_features = X.shape
        self.w = np.zeros(num_features) # init the weights to zero with the same shape as matrix X
        self.b = 0 # Bias is a single value init to zero

        for _ in range(self.n_iterations):
            # Actuel prediction with current weights
            y_pred = np.dot(X, self.w) + self.b

            # Compute the gradiant descent with derivative equation of w and b
            dw = (1/num_samples) * np.dot(X.T, (y_pred - y))
            db = (1/num_samples) * np.sum(y_pred - y)

            # Update the w and b with gradient descent it should decrease w and b towards minimum value
            self.w -= self.learning_rate * dw
            self.b -= self.learning_rate * db

        return self

        

    def predict(self, X):
        # Predict y values
        return np.dot(X, self.w) + self.b

5- Training the Model:

- Instantiate the linear regression model.
- Train the model on the training set using the implemented gradient descent algorithm.

In [60]:
# Instantiate and train the model
model = LinearRegression(learning_rate=1, n_iterations=1000)
model.fit(X_train_scaled, y_train)

<__main__.LinearRegression at 0x16e170ec3d0>

6- Prediction and Evaluation:

- Make predictions on the test set.
- Evaluate the model's performance using mean squared error.

In [64]:
# Make predictions on the test set
predictions = model.predict(X_test_scaled)

# Evaluate the model
mse = np.mean((predictions - y_test)**2)
print(f"Mean Squared Error on Test Set: {mse}")

803163672.2826726
428836664.45644003
Mean Squared Error on Test Set: 8.580387824473165e+18


A Mean Squared Error (MSE) of 8 indicates that, on average, the squares of the differences between model predictions and actual values on the test set are equal to 8. The lower the MSE, the better the model's performance.