<a href="https://colab.research.google.com/github/ExplorerGumel/Linear-Regression-from-scratch/blob/main/Linear_Regression_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LINEAR REGRESSION ALGORITHM FROM SCRATCH




Welcome to this notebook where we delve into the intricacies of the **Linear Regression** model without relying on external libraries like Scikit-learn. While Scikit-learn is undoubtedly a powerful tool for quick implementation, understanding the inner workings of the models we use is crucial. It allows us to grasp the underlying assumptions, limitations, and nuances of the algorithm, enabling us to make informed decisions during the modeling process. By comprehensively understanding the core principles of Linear Regression, including the concepts of slope, intercept, and the impact of outliers, we can fine-tune its parameters and identify scenarios where it excels or requires adjustments.

Linear Regression has a rich history in statistics, tracing its roots back to the early 19th century. Its versatility and simplicity have contributed to its widespread adoption in a multitude of fields, ranging from social sciences to business analytics. This model serves as the cornerstone of predictive analytics, providing a solid foundation for more complex machine learning techniques. In this notebook, we'll unravel the fundamental concepts behind this powerful tool, demystifying its implementation step by step.

Linear Regression operates on the principle of fitting a straight line to a set of data points, allowing us to understand the impact of one variable on another. By estimating the coefficients of the line namely the slope and the intercept, it enables us to make predictions and draw insights from data.

<img src="/content/linear-regression.png" style="width:400px;">



The task of the linear regression model is fit that data points to the best line of fit (line drawn) by making the difference the data points and line of fit as small as possible through adjusting the weights and bias.

You can describe a simple linear regression model as

$$\hat{y} = wx + b,\tag{1}$$

where $\hat{y}$ is a prediction of dependent variable $y$ based on independent variable $x$ using a line equation with the slope $w$ and intercept $b$.

Given a set of training data points $(x_1, y_1)$, ..., $(x_m, y_m)$, you will find the "best" fitting line - such parameters $w$ and $b$ that the differences between original values $y_i$ and predicted values $\hat{y}_i = wx_i + b$ are minimum.

Linear Regression achieve this by performing as follows;

1. Initialize the Model's Parameters:
In the initial stage, the model randomly assigns values to the weights and bias, setting the foundation for the subsequent computations.

2. Implement Forward Propagation (Calculate the Perceptron Output):
Building upon the initialized parameters, the model engages in forward propagation, estimating the predicted output $\hat{Y}$ using the equation:

$$\hat{Y} = WX + b,\$$

where $W$ represents the weights, $X$ denotes the input, and $b$ signifies the bias.

3. Implement Backward Propagation (Calculate Required Corrections for the Parameters):
This phase involves computing the gradients and derivatives to determine the necessary adjustments for the parameters. It aids in understanding how changes in the input variables influence the output and assists in refining the model's predictive accuracy.
To correct the parameters $W$ and $b$ from the randomly initialize values, we calculate the difference between the prediction obtained from **Forward propagation** to the dataset's Y value

$$\mathcal{L}\left(w, b\right)  = \frac{1}{2m}\sum_{i=1}^{m} \left(\hat{y}^{(i)} - y^{(i)}\right)^2.$$

This function is called the sum of squares **cost function**. The aim is to optimize the cost function during the training, which will minimize the differences between original values $y^{(i)}$ and predicted values $\hat{y}^{(i)}$.

When your weights were just initialized with some random values, and no training was done yet, you can't expect good results. You need to calculate the adjustments for the weight and bias, minimizing the cost function. This process is called **backward propagation**.

According to the gradient descent algorithm, you can calculate partial derivatives as:

\begin{align}
\frac{\partial \mathcal{L} }{ \partial w } &=
\frac{1}{m}\sum_{i=1}^{m} \left(\hat{y}^{(i)} - y^{(i)}\right)x^{(i)},\\
\frac{\partial \mathcal{L} }{ \partial b } &=
\frac{1}{m}\sum_{i=1}^{m} \left(\hat{y}^{(i)} - y^{(i)}\right).
\end{align}


4. Update Parameters:
Based on the corrections obtained from the backward propagation, the model updates the weights and bias, fine-tuning the relationship between the input and output. This iterative process continues until the model converges to an optimal solution.

\begin{align}
w &= w - \alpha \frac{\partial \mathcal{L} }{ \partial w },\\
b &= b - \alpha \frac{\partial \mathcal{L} }{ \partial b },
\end{align}

where $\alpha$ is the learning rate. Then repeat the process until the cost function stops decreasing.


5. Make Predictions:
With the refined parameters, the model becomes adept at making predictions on new data points, enabling us to infer outcomes based on the learned relationships between the variables.



By comprehending these fundamental steps, we gain deeper insights into how Linear Regression functions and we are going explore of these step intuitively.

Alright, let's get going

In [None]:
#importing the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv("D:\ML DATASETS\MY2022 Fuel Consumption Ratings.csv")
data.head()

Unnamed: 0,Model Year,Make,Model,Vehicle Class,Engine Size(L),Cylinders,Transmission,Fuel Type,Fuel Consumption (City (L/100 km),Fuel Consumption(Hwy (L/100 km)),Fuel Consumption(Comb (L/100 km)),Fuel Consumption(Comb (mpg)),CO2 Emissions(g/km),CO2 Rating,Smog Rating
0,2022,Acura,ILX,Compact,2.4,4,AM8,Z,9.9,7.0,8.6,33,200,6,3
1,2022,Acura,MDX SH-AWD,SUV: Small,3.5,6,AS10,Z,12.6,9.4,11.2,25,263,4,5
2,2022,Acura,RDX SH-AWD,SUV: Small,2.0,4,AS10,Z,11.0,8.6,9.9,29,232,5,6
3,2022,Acura,RDX SH-AWD A-SPEC,SUV: Small,2.0,4,AS10,Z,11.3,9.1,10.3,27,242,5,6
4,2022,Acura,TLX SH-AWD,Compact,2.0,4,AS10,Z,11.2,8.0,9.8,29,230,5,7


In [None]:
data.shape

(946, 15)

As observed, the dataset comprises 946 samples with 15 features/variables. However, for the purpose of this notebook, which primarily aims to showcase the step-by-step implementation, the decision has been made to predict **Fuel Consumption(Comb (L/100 km))** based on the three most correlated features.

In [None]:
data.corr()

Unnamed: 0,Model Year,Engine Size(L),Cylinders,Fuel Consumption (City (L/100 km),Fuel Consumption(Hwy (L/100 km)),Fuel Consumption(Comb (L/100 km)),Fuel Consumption(Comb (mpg)),CO2 Emissions(g/km),CO2 Rating,Smog Rating
Model Year,,,,,,,,,,
Engine Size(L),,1.0,0.920698,0.834925,0.749374,0.818694,-0.704163,0.824188,-0.766333,-0.448239
Cylinders,,0.920698,1.0,0.845688,0.737652,0.821718,-0.693594,0.833241,-0.762157,-0.502149
Fuel Consumption (City (L/100 km),,0.834925,0.845688,1.0,0.92285,0.990321,-0.909477,0.965632,-0.920524,-0.523928
Fuel Consumption(Hwy (L/100 km)),,0.749374,0.737652,0.92285,1.0,0.967138,-0.877531,0.933991,-0.894668,-0.402099
Fuel Consumption(Comb (L/100 km)),,0.818694,0.821718,0.990321,0.967138,1.0,-0.914305,0.971671,-0.927705,-0.490473
Fuel Consumption(Comb (mpg)),,-0.704163,-0.693594,-0.909477,-0.877531,-0.914305,1.0,-0.913019,0.949561,0.47399
CO2 Emissions(g/km),,0.824188,0.833241,0.965632,0.933991,0.971671,-0.913019,1.0,-0.954593,-0.520437
CO2 Rating,,-0.766333,-0.762157,-0.920524,-0.894668,-0.927705,0.949561,-0.954593,1.0,0.502625
Smog Rating,,-0.448239,-0.502149,-0.523928,-0.402099,-0.490473,0.47399,-0.520437,0.502625,1.0


In [None]:
train = data.copy()
train = train[['Engine Size(L)','Cylinders','CO2 Emissions(g/km)','Fuel Consumption(Comb (L/100 km))']]
train.head()

Unnamed: 0,Engine Size(L),Cylinders,CO2 Emissions(g/km),Fuel Consumption(Comb (L/100 km))
0,2.4,4,200,8.6
1,3.5,6,263,11.2
2,2.0,4,232,9.9
3,2.0,4,242,10.3
4,2.0,4,230,9.8


In [None]:
# Let's explore the descriptive statistics

train.describe()

Unnamed: 0,Engine Size(L),Cylinders,CO2 Emissions(g/km),Fuel Consumption(Comb (L/100 km))
count,946.0,946.0,946.0,946.0
mean,3.198732,5.668076,259.172304,11.092072
std,1.374814,1.93267,64.443149,2.876276
min,1.2,3.0,94.0,4.0
25%,2.0,4.0,213.25,9.1
50%,3.0,6.0,257.0,10.8
75%,3.8,6.0,300.75,12.9
max,8.0,16.0,608.0,26.1


Based on the information presented in the descriptive statistics table, it is apparent that while the minimum and maximum values of **Engine Size (L), Cylinders, and Fuel Consumption (Comb (L/100 km))** are within a similar range, the values of **CO2 Emissions (g/km)** deviate significantly from them. This discrepancy might have an adverse impact on the model, as there is a possibility that **CO2 Emissions (g/km)** could disproportionately influence the model, overshadowing the contribution of other features. Therefore, it is advisable to standardize the features, and for this purpose, Z-score scaling will be employed.

\begin{align}
Z &= \frac (X - \mu) \sigma
\end{align}

where $X$ is a variable $\mu$ is the mean of the variable $X$ and \sigma is the standard deviation of X.

In [None]:
train_scaled = (train - train.mean())/train.std()
train_scaled.corr()

Unnamed: 0,Engine Size(L),Cylinders,CO2 Emissions(g/km),Fuel Consumption(Comb (L/100 km))
Engine Size(L),1.0,0.920698,0.824188,0.818694
Cylinders,0.920698,1.0,0.833241,0.821718
CO2 Emissions(g/km),0.824188,0.833241,1.0,0.971671
Fuel Consumption(Comb (L/100 km)),0.818694,0.821718,0.971671,1.0


In [None]:
X = train_scaled.drop('Fuel Consumption(Comb (L/100 km))', axis=1)
Y = train_scaled['Fuel Consumption(Comb (L/100 km))']
#Y = np.expand_dims(Y, axis=1)
print(X.shape, Y.shape)

(946, 3) (946,)


Having standardized our features and assigned the variables X and Y, it is clear that we can now delve into the primary focus of the session, which is the implementation of Linear Regression from scratch.

<a name='3.1'></a>
### 1. Initialize the Model's Parameters:



In [None]:
def initialize_parameters(n_features):
        """
        Initialize model parameters.

        Parameters:
            n_features (int): The number of features in the input data.
        """
        W = np.random.randn(n_features) * 0.01
        b = 0
        parameters = {"W": W,
                  "b": b}

        return parameters

parameters = initialize_parameters(X.shape[1])
parameters

{'W': array([-0.00703544,  0.0001437 ,  0.00154541]), 'b': 0}

### 2.  Implement Forward Propagation

In [None]:
def forward_propagation(features,params):
        """
        Compute the forward pass of the linear regression model.

        Parameters:
            features (numpy.ndarray): Input data of shape (m, n_features).
            params (float) : parameters initialiazed


        Returns:
            numpy.ndarray: Predictions of shape (m,).
        """
        W = params['W']
        b = params['b']
        return np.dot(features, W) + b

Y_hat = forward_propagation(X,parameters)
Y_hat[0:10]

array([ 0.00254437, -0.00142523,  0.00535872,  0.00559853,  0.00531075,
        0.00533474,  0.00096559,  0.00108549,  0.00471123,  0.004999  ])

In [None]:
def compute_cost(predictions, true):
        """
        Compute the mean squared error cost.

        Parameters:
            predictions (numpy.ndarray): Predictions of shape (m,).

        Returns:
            float: Mean squared error cost.
        """
        m = len(predictions)
        cost = np.sum(np.square(predictions - true)) / (2 * m)
        return cost

cost = compute_cost(Y_hat, Y)

cost

0.5036234716903147

### 3. Implement Backward Propagation

In [None]:
def backward_propagation(predictions,true,features):
        """
        Compute gradients for model parameters.

        Parameters:
            predictions (numpy.ndarray): Predictions of shape (m,).

        Updates:
            numpy.ndarray: Gradient of W.
            float: Gradient of b.
        """
        m = len(predictions)
        dZ = predictions - true
        dW = 1/m * np.dot(dZ, features, )
        db = 1/m * np.sum(dZ)

        grads = {"dW": dW,
             "db": db}

        return grads

grads = backward_propagation(Y_hat, Y, X)

grads['db']

4.867142629730919e-15

In [None]:
grads['dW']

array([-0.82345159, -0.82589024, -0.97477322])

### 4. Update Parameters

In [None]:
def update_parameters(params, grads, learning_rate=0.2):
    """
    Updates parameters using the gradient descent update rule

    Arguments:
    parameters -- python dictionary containing parameters
    grads -- python dictionary containing gradients
    learning_rate -- learning rate parameter for gradient descent

    Returns:
    parameters -- python dictionary containing updated parameters
    """
    # Retrieve each parameter from the dictionary "parameters".
    W = params["W"]
    b = params["b"]

    # Retrieve each gradient from the dictionary "grads".
    dW = grads["dW"]
    db = grads["db"]

    # Update rule for each parameter.
    W = W - (learning_rate * dW)
    b = b - (learning_rate * db)

    parameters = {"W": W,
                  "b": b}

    return parameters

parameters_updated = update_parameters(parameters, grads)

print("W updated = " + str(parameters_updated["W"]))
print("b updated = " + str(parameters_updated["b"]))

W updated = [0.15765488 0.16532174 0.19650006]
b updated = -9.734285259461838e-16


In [None]:
print(f"Going by this steps we can see that the weights and bias randomly initialized as:\n{parameters}\n has now update to:\n{parameters_updated}")

Going by this steps we can see that the weights and bias randomly initialized as:
{'W': array([-0.00703544,  0.0001437 ,  0.00154541]), 'b': 0}
 has now update to:
{'W': array([0.15765488, 0.16532174, 0.19650006]), 'b': -9.734285259461838e-16}


Now let's compare the **cost function** obtained using the randomly initialized weights and bias with the one obtained using the updated weights and bias

In [None]:
updated_Y_hat = forward_propagation(X,parameters_updated)
updated_cost = compute_cost(updated_Y_hat, Y)

print(f"The cost function obtained with the randomly initialize parameters is:\n{cost}\nWhile with the updated parameters it is reduced to:\n {updated_cost}")

The cost function obtained with the randomly initialize parameters is:
0.5036234716903147
While with the updated parameters it is reduced to:
 0.1659692623357513


As noted, our current implementation of the linear regression model involves a single loop of operations. However, it has become evident that a single loop may not suffice to reach the minimum point in most cases. Therefore, we are now integrating the aforementioned functions to enable us to execute the operations for the optimal number of loops.

In [None]:
def linear_regression(features, true, num_iterations=100, learning_rate=0.2, print_cost=False):
    """
    Arguments:
    X -- Input features(Dependent variables)
    Y -- target feature (Independent variable)
    num_iterations -- number of iterations in the loop
    learning_rate -- learning rate parameter for gradient descent
    print_cost -- if True, print the cost every iteration

    Returns:
    parameters -- parameters learnt by the model. They can then be used to make predictions.
    """

    # Loop

        # Gradient descent parameter update. Inputs: "parameters, grads, learning_rate". Outputs: "parameters".
    parameters = initialize_parameters(X.shape[1])

    for i in range(num_iterations):

        Y_hat = forward_propagation(features,parameters)
        cost = compute_cost(Y_hat, true)
        grads = backward_propagation(Y_hat, true, features)
        parameters = update_parameters(parameters, grads, learning_rate)

        if print_cost:
            print ("Cost after iteration %i: %f" %(i, cost))

    return parameters

In [None]:
parameters_simple = linear_regression(X, Y, num_iterations=100, learning_rate=0.2, print_cost=True)
print("W = " + str(parameters_simple["W"]))
print("b = " + str(parameters_simple["b"]))

W_simple = parameters["W"]
b_simple = parameters["b"]

Cost after iteration 0: 0.534885
Cost after iteration 1: 0.171106
Cost after iteration 2: 0.092260
Cost after iteration 3: 0.073072
Cost after iteration 4: 0.066543
Cost after iteration 5: 0.062852
Cost after iteration 6: 0.059936
Cost after iteration 7: 0.057352
Cost after iteration 8: 0.054992
Cost after iteration 9: 0.052822
Cost after iteration 10: 0.050824
Cost after iteration 11: 0.048984
Cost after iteration 12: 0.047289
Cost after iteration 13: 0.045727
Cost after iteration 14: 0.044288
Cost after iteration 15: 0.042962
Cost after iteration 16: 0.041741
Cost after iteration 17: 0.040616
Cost after iteration 18: 0.039579
Cost after iteration 19: 0.038624
Cost after iteration 20: 0.037744
Cost after iteration 21: 0.036933
Cost after iteration 22: 0.036187
Cost after iteration 23: 0.035498
Cost after iteration 24: 0.034864
Cost after iteration 25: 0.034280
Cost after iteration 26: 0.033742
Cost after iteration 27: 0.033246
Cost after iteration 28: 0.032789
Cost after iteration 29:

After a mere 100 iterations, the implementation effectively minimized the cost, substantially reducing the error between the predicted and actual y values, all achieved using the powerful combination of Pandas and NumPy.