<b> Implementing Linear Regression from Scratch with California Housing Dataset </b>

Objective:
This exercise aims to provide a hands-on experience in implementing linear regression from scratch using the California housing dataset. You will gain a deeper understanding of the inner workings of linear regression, including the concepts of cost function, and gradient descent optimization.

<b>Steps:</b>

1- Load the California Housing Dataset:

- Use the fetch_california_housing function from scikit-learn to load the dataset.

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the California housing dataset
housing = fetch_california_housing()
data, target = housing.data, housing.target

In [3]:
# explore the data
print(data.shape)
print(target.shape)
print(housing.DESCR)

(20640, 8)
(20640,)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This

2- Data Preprocessing:

- Add a bias term to the input features.
- Split the dataset into training and testing sets.

In [4]:
# Add a bias term to the input features
data_bias = np.c_[np.ones((data.shape[0], 1)), data]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_bias, target, test_size=0.2, random_state=42)

3- Standardization:

- Standardize the input features using StandardScaler from scikit-learn.

In [5]:
# Standardize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

4- Linear Regression Implementation:

- Implement a simple linear regression class with methods for fitting the model and making predictions.
- Use mean squared error as the cost function.
- Utilize gradient descent for optimization.

In [6]:
# Linear regression implementation from scratch
class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        # Weights of the model (parameters), initialized as None for the beginning
        self.weights = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        i=0
        # Perform Gradient Descent
        for _ in range(self.n_iterations):
            i+=1 # Nb of iterations
            # Calculate predictions as the dot product of X and weights
            y_predicted = np.dot(X, self.weights)

            # Calculate & display the MSE with the iterations
            mse = (1/n_samples) * np.sum((y_predicted - y)**2)
            print(f"Iteration {i}: MSE = {mse}")

            # Compute the gradient of the cost function (here the MSE)
            gradients = (1/n_samples) * np.dot(X.T, (y_predicted - y))

            # Update weights by subtracting a fraction of the gradient
            self.weights -= self.learning_rate * gradients

    def predict(self, X):
        return np.dot(X, self.weights)


5- Training the Model:

- Instantiate the linear regression model.
- Train the model on the training set using the implemented gradient descent algorithm.

In [7]:
# Instantiate and train the model
model = LinearRegression(learning_rate=0.01, n_iterations=1000)
model.fit(X_train_scaled, y_train)

Iteration 1: MSE = 5.629742323103131
Iteration 2: MSE = 5.61540311841656
Iteration 3: MSE = 5.6013728770162485
Iteration 4: MSE = 5.587644433125449
Iteration 5: MSE = 5.574210800439958
Iteration 6: MSE = 5.561065167221957
Iteration 7: MSE = 5.54820089154027
Iteration 8: MSE = 5.535611496652303
Iteration 9: MSE = 5.523290666523121
Iteration 10: MSE = 5.511232241477293
Iteration 11: MSE = 5.49943021397921
Iteration 12: MSE = 5.487878724537815
Iteration 13: MSE = 5.476572057731778
Iteration 14: MSE = 5.465504638351247
Iteration 15: MSE = 5.454671027652534
Iteration 16: MSE = 5.444065919722109
Iteration 17: MSE = 5.433684137946476
Iteration 18: MSE = 5.423520631584588
Iteration 19: MSE = 5.413570472439559
Iteration 20: MSE = 5.403828851626576
Iteration 21: MSE = 5.39429107643398
Iteration 22: MSE = 5.384952567274606
Iteration 23: MSE = 5.3758088547245695
Iteration 24: MSE = 5.36685557664678
Iteration 25: MSE = 5.358088475396529
Iteration 26: MSE = 5.349503395106637
Iteration 27: MSE = 5.34

6- Prediction and Evaluation:

- Make predictions on the test set.
- Evaluate the model's performance using mean squared error.

In [8]:
# Make predictions on the test set
predictions = model.predict(X_test_scaled)

# Evaluate the model
mse = np.mean((predictions - y_test)**2)
print(f"Mean Squared Error on Test Set: {mse}")

Mean Squared Error on Test Set: 4.8678876270625615
