<a href="https://colab.research.google.com/github/AksShri2004/Linear-Regression-from-scratch/blob/main/Linear_Regression_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Linear Regression from Scratch by [aksmelittle](https://aksmelittle.vercel.app).


This notebook demonstrates a linear regression model built from scratch. It covers data loading, preprocessing, model implementation, training, and evaluation.

#### Acknowledgment and Self-Application

This notebook's structure and logic are inspired by the excellent work presented in the Kaggle notebook by Fares Elmenshawii: [Linear Regression from Scratch](https://www.kaggle.com/code/fareselmenshawii/linear-regression-from-scratch/notebook). I have referenced this resource to understand the underlying concepts and have personally applied and implemented the logic demonstrated herein.

### Step 1: Download the Dataset

The dataset for this linear regression example is downloaded from KaggleHub.

In [4]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("andonians/random-linear-regression")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'random-linear-regression' dataset.
Path to dataset files: /kaggle/input/random-linear-regression


###Imports


### Step 2: Import Necessary Libraries

We import `math`, `numpy`, `pandas`, `plotly.express`, and `pickle` for numerical operations, data manipulation, visualization, and model serialization, respectively.

In [5]:
import math
import numpy as np
import pandas as pd
import plotly.express as px
import pickle

###Sample data loading and analysis


### Step 3: Load and Prepare Data

Load the training and testing datasets from the downloaded CSV files. We also handle missing values by dropping rows that contain `NaN`.

In [6]:
# Load the training and test datasets
train_data = pd.read_csv('/kaggle/input/random-linear-regression/train.csv')
test_data = pd.read_csv('/kaggle/input/random-linear-regression/test.csv')

# Remove rows with missing values
train_data = train_data.dropna()
test_data = test_data.dropna()

#### Inspect Training Data

Display the first few rows of the training dataset to understand its structure.

In [7]:
train_data.head()


Unnamed: 0,x,y
0,24.0,21.549452
1,50.0,47.464463
2,15.0,17.218656
3,38.0,36.586398
4,87.0,87.288984


#### Visualize Training Data

A scatter plot helps visualize the relationship between 'x' and 'y' in the training data.

In [8]:
px.scatter(x=train_data['x'], y=train_data['y'],template='seaborn')


#### Separate Features and Target Variables

Define the features (X) and target (y) for both training and testing datasets.

In [9]:
# Set training data and target
X_train = train_data['x'].values
y_train = train_data['y'].values

# Set testing data and target
X_test = test_data['x'].values
y_test = test_data['y'].values

#### Standardize Data

Standardization scales the features to have a mean of 0 and a standard deviation of 1. This helps in faster convergence of gradient descent.

The `standardize_data` function applies this transformation to both training and testing sets based on the training data's statistics.

In [10]:
def standardize_data(X_train, X_test):
    """
    Standardizes the input data using mean and standard deviation.

    Parameters:
        X_train (numpy.ndarray): Training data.
        X_test (numpy.ndarray): Testing data.

    Returns:
        Tuple of standardized training and testing data.
    """
    # Calculate the mean and standard deviation using the training data
    mean = np.mean(X_train, axis=0)
    std = np.std(X_train, axis=0)

    # Standardize the data
    X_train = (X_train - mean) / std
    X_test = (X_test - mean) / std

    return X_train, X_test

X_train, X_test = standardize_data(X_train, X_test)

#### Reshape Data

Expand the dimensions of the feature arrays to match the expected input shape for the linear regression model (e.g., for compatibility with matrix multiplication).

In [11]:
X_train = np.expand_dims(X_train, axis=-1)
X_test = np.expand_dims(X_test, axis=-1)

#### Check Feature Dimensions

Verify the number of features after reshaping.

In [20]:
X_train.shape

(699, 1)

### Step 4: Implement Linear Regression Model

Define a `LinearRegression` class with methods for initialization, forward pass (prediction), cost calculation, backward pass (gradient calculation), training (`fit`), and model saving/loading.

In [13]:
class LinearRegression:
  def __init__(self, learning_rate, convergence_tol=1e-6):
    self.learning_rate = learning_rate
    self.convergence_tol = convergence_tol
    self.W = None
    self.b = None
    self.m = None #total no of examples

  def initialize_parameters(self, n_examples , n_features):
    self.W = np.random.randn(n_features)*0.01
    self.b = 0
    self.m = n_examples

  def forward_pass(self, X):
    return np.dot(X, self.W) + self.b   #returns predictions

  def cost(self, prediction, y):
    return (1/(2*self.m)*np.sum(np.square(prediction - y)))

  def backward_pass(self, prediction , X ,y): #calculation of gradient
    self.dW = np.dot((prediction - y) , X)/self.m
    self.db = np.sum(prediction - y)/self.m

  def fit( self, X , y , iterations, plot_cost =True):
    assert isinstance(X, np.ndarray), "X must be a NumPy array"
    assert isinstance(y, np.ndarray), "y must be a NumPy array"
    assert X.shape[0] == y.shape[0], "X and y must have the same number of samples"
    assert iterations > 0, "Iterations must be greater than 0"

    self.X = X
    self.y = y
    self.initialize_parameters(self.X.shape[0], self.X.shape[1])
    costs = []

    for i in range (iterations):
      predictions = self.forward_pass(self.X)
      cost = self.cost(predictions, self.y)
      self.backward_pass(predictions, self.X, self.y)
      self.W -= self.learning_rate * self.dW
      self.b -= self.learning_rate * self.db
      costs.append(cost)

      if i % 100 == 0: #prints cost every 100th iteration
        print(f'Iteration: {i}, Cost: {cost}')

      if i>0 and abs(costs[-1]-costs[-2])<self.convergence_tol:
        print(f"converged in {i}th iteration")
        break

    if plot_cost:
      fig = px.line(y=costs, title="Cost vs Iteration", template="plotly_dark")
      fig.update_layout(
        title_font_color="#41BEE9",
        xaxis=dict(color="#41BEE9", title="Iterations"),
        yaxis=dict(color="#41BEE9", title="Cost")
      )

      fig.show()

  def save_model(self, filename = None):
    model_data = {
        'learning_rate' : self.learning_rate,
        'convergence_tol' : self.convergence_tol,
        'W' : self.W,
        'b' : self.b
    }

    with open(filename, 'wb') as file:
      pickle.dump(model_data, file)

  @classmethod
  def load_model(cls, filename):
    with open(filename, 'rb') as file:
        model_data = pickle.load(file)

    # Create a new instance of the class and initialize it with the loaded parameters
    loaded_model = cls(model_data['learning_rate'], model_data['convergence_tol'])
    loaded_model.W = model_data['W']
    loaded_model.b = model_data['b']

    return loaded_model


### Step 5: Train the Model

Initialize the `LinearRegression` model with a specified learning rate and train it using the `fit` method on the standardized training data. The cost is printed periodically, and a plot of cost vs. iteration is generated.

In [14]:
lr = LinearRegression(0.01)
lr.fit(X_train, y_train, 10000)

Iteration: 0, Cost: 1669.977488114356
Iteration: 100, Cost: 227.14985776787353
Iteration: 200, Cost: 33.84028097796812
Iteration: 300, Cost: 7.940726732726334
Iteration: 400, Cost: 4.4707128759826915
Iteration: 500, Cost: 4.005801547703548
Iteration: 600, Cost: 3.943512879102922
Iteration: 700, Cost: 3.9351674635364766
Iteration: 800, Cost: 3.93404934747233
converged in 863th iteration


### Step 6: Save and Load the Model

Save the trained model parameters to a file using `pickle` and then demonstrate how to load the model back. This allows for persistent storage and reuse of trained models.

In [15]:
lr.save_model('model.pkl')



In [16]:
model = LinearRegression.load_model("model.pkl")

### Step 7: Define Regression Metrics

Implement a `RegressionMetrics` class to calculate common evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (RÂ²).

In [17]:
class RegressionMetrics:
    @staticmethod
    def mean_squared_error(y_true, y_pred):
        """
        Calculate the Mean Squared Error (MSE).

        Args:
            y_true (numpy.ndarray): The true target values.
            y_pred (numpy.ndarray): The predicted target values.

        Returns:
            float: The Mean Squared Error.
        """
        assert len(y_true) == len(y_pred), "Input arrays must have the same length."
        mse = np.mean((y_true - y_pred) ** 2)
        return mse

    @staticmethod
    def root_mean_squared_error(y_true, y_pred):
        """
        Calculate the Root Mean Squared Error (RMSE).

        Args:
            y_true (numpy.ndarray): The true target values.
            y_pred (numpy.ndarray): The predicted target values.

        Returns:
            float: The Root Mean Squared Error.
        """
        assert len(y_true) == len(y_pred), "Input arrays must have the same length."
        mse = RegressionMetrics.mean_squared_error(y_true, y_pred)
        rmse = np.sqrt(mse)
        return rmse

    @staticmethod
    def r_squared(y_true, y_pred):
        """
        Calculate the R-squared (R^2) coefficient of determination.

        Args:
            y_true (numpy.ndarray): The true target values.
            y_pred (numpy.ndarray): The predicted target values.

        Returns:
            float: The R-squared (R^2) value.
        """
        assert len(y_true) == len(y_pred), "Input arrays must have the same length."
        mean_y = np.mean(y_true)
        ss_total = np.sum((y_true - mean_y) ** 2)
        ss_residual = np.sum((y_true - y_pred) ** 2)
        r2 = 1 - (ss_residual / ss_total)
        return r2

### Step 8: Evaluate Model Performance

Make predictions on the test set using the trained model and calculate the MSE, RMSE, and R-squared values to assess its performance.

#### Display Evaluation Metrics

Print the calculated evaluation metrics for the linear regression model on the test data.

In [19]:
y_pred = model.forward_pass(X_test)
mse_value = RegressionMetrics.mean_squared_error(y_test, y_pred)
rmse_value = RegressionMetrics.root_mean_squared_error(y_test, y_pred)
r_squared_value = RegressionMetrics.r_squared(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse_value}")
print(f"Root Mean Squared Error (RMSE): {rmse_value}")
print(f"R-squared (Coefficient of Determination): {r_squared_value}")

Mean Squared Error (MSE): 9.442669458970064
Root Mean Squared Error (RMSE): 3.072892685885738
R-squared (Coefficient of Determination): 0.988789872694102
