<h1><i> CREATING MULTI LINEAR REGRESSION MODEL FROM SCRATCH </i></h1>

Here are the step-by-step explanations for the mathematical formulas used in multi-linear regression:

Given:
- X1 = [x1_1, x1_2, ..., x1_n] (array of values for independent variable X1)
- X2 = [x2_1, x2_2, ..., x2_n] (array of values for independent variable X2)
- y = [y_1, y_2, ..., y_n] (array of values for dependent variable y)
- n (number of data points)

1. Mean Calculation:
   - Calculate the mean of X1:
     - mean_X1 = sum(X1) / n
   - Calculate the mean of X2:
     - mean_X2 = sum(X2) / n
   - Calculate the mean of y:
     - mean_y = sum(y) / n

2. Variance Calculation:
   - Calculate the variance of X1:
     - variance_X1 = sum((x1_i - mean_X1)^2) / n
   - Calculate the variance of X2:
     - variance_X2 = sum((x2_i - mean_X2)^2) / n
   - Calculate the variance of y:
     - variance_y = sum((y_i - mean_y)^2) / n

3. Covariance Calculation:
   - Calculate the covariance between X1 and y:
     - covariance_X1_y = sum((x1_i - mean_X1) * (y_i - mean_y)) / n
   - Calculate the covariance between X2 and y:
     - covariance_X2_y = sum((x2_i - mean_X2) * (y_i - mean_y)) / n

4. Slope Calculation:
   - Calculate the slope for X1:
     - slope_X1 = covariance_X1_y / variance_X1
   - Calculate the slope for X2:
     - slope_X2 = covariance_X2_y / variance_X2

5. Intercept Calculation:
   - Calculate the intercept:
     - intercept = mean_y - (slope_X1 * mean_X1) - (slope_X2 * mean_X2)

The final linear regression equation for the given data is:
   - y = intercept + (slope_X1 * X1) + (slope_X2 * X2)

This equation represents the relationship between the dependent variable (y) and the independent variables (X1 and X2) based on the given data.

<h1> Importing Libraries </h1>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

<h1> Helper Function </h1>

In [2]:
class LinearRegressionModel:
    def __init__(self, data_file):
        """
        Initialize the LinearRegressionModel object.

        Args:
            data_file (str): Path to the CSV data file.
        """
        self.data_file = data_file
        self.data = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.coefficients = None

    def read_data(self):
        """
        Read the data from the CSV file.
        """
        self.data = pd.read_csv(self.data_file)
        self.X = self.data[['age', 'experience']]
        self.y = self.data['income']

    def split_data(self, test_size=0.2, random_state=42):
        """
        Split the data into training and testing sets.

        Args:
            test_size (float, optional): The proportion of the data to include in the test split.
            random_state (int, optional): Seed used by the random number generator.

        Returns:
            None
        """
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=test_size, random_state=random_state)

    def train_model(self):
        """
        Train the linear regression model using the provided training data.
        """
        # Add a column of ones to X_train for the intercept term
        X_train_extended = np.column_stack((np.ones(len(self.X_train)), self.X_train))
        
        # Compute the coefficients using the normal equation
        XtX = X_train_extended.T @ X_train_extended
        XtX_inverse = np.linalg.inv(XtX)
        XtX_inverse_Xt = XtX_inverse @ X_train_extended.T
        self.coefficients = XtX_inverse_Xt @ self.y_train

    def predict(self, X):
        """
        Predict the target variable using the learned coefficients.

        Args:
            X (pandas.DataFrame): Input features.

        Returns:
            numpy.ndarray: Predicted values.
        """
        # Add a column of ones to X for the intercept term
        X_extended = np.column_stack((np.ones(len(X)), X))
        return X_extended @ self.coefficients

    def evaluate_model(self):
        """
        Evaluate the trained linear regression model using the testing data.

        Returns:
            tuple: Mean Squared Error (MSE) and R-squared (coefficient of determination).
        """
        predictions = self.predict(self.X_test)
        mse = self.calculate_mse(predictions, self.y_test)
        r2 = self.calculate_r2(predictions, self.y_test)
        return mse, r2

    def calculate_mse(self, predictions, y):
        """
        Calculate the Mean Squared Error (MSE) between the predicted and actual values.

        Args:
            predictions (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: Mean Squared Error (MSE).
        """
        residuals = predictions - y
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, predictions, y):
        """
        Calculate the R-squared (coefficient of determination) between the predicted and actual values.

        Args:
            predictions (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: R-squared (coefficient of determination).
        """
        tss = np.sum((y - np.mean(y)) ** 2)
        rss = np.sum((y - predictions) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def plot_data(self):
        """
        Plot the data.
        """
        plt.scatter(self.X['age'], self.y, color='blue', label='Actual')
        plt.scatter(self.X['age'], self.predict(self.X), color='red', label='Predicted')
        plt.xlabel('Age')
        plt.ylabel('Income')
        plt.legend()
        plt.show()

<h1> Main Function </h1>

In [3]:
# Create an instance of the LinearRegressionModel class
model = LinearRegressionModel('multi-dataset.csv')

# Read the data from the dataset file
model.read_data()

# Split the data into training and testing sets
model.split_data()

# Train the linear regression model using the training set
model.train_model()

# Evaluate the model using the testing set
mse, r2 = model.evaluate_model()
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Plot the data
# model.plot_data()


Mean Squared Error: 753796.7693735545
R-squared: 0.9387098237077806


<h1> Compares the output with SKLEARN Built-In Multi-Linear Regression Library</h1>

<h1> Importing Libraries </h1>

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

<h1> Helper Function </h1>

In [5]:
class LinearRegressionModel:
    def __init__(self, data_file):
        """
        Initialize the LinearRegressionModel object.

        Args:
            data_file (str): Path to the CSV data file.
        """
        self.data_file = data_file
        self.data = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.model = None

    def read_data(self):
        """
        Read the data from the CSV file.
        """
        self.data = pd.read_csv(self.data_file)
        self.X = self.data[['age', 'experience']]
        self.y = self.data['income']

    def split_data(self, test_size=0.2, random_state=42):
        """
        Split the data into training and testing sets.

        Args:
            test_size (float, optional): The proportion of the data to include in the test split.
            random_state (int, optional): Seed used by the random number generator.

        Returns:
            None
        """
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=test_size, random_state=random_state
        )

    def train_model(self):
        """
        Train the linear regression model using the provided training data.
        """
        self.model = LinearRegression()
        self.model.fit(self.X_train, self.y_train)

    def evaluate_model(self):
        """
        Evaluate the trained linear regression model using the testing data.

        Returns:
            tuple: Mean Squared Error (MSE) and R-squared (coefficient of determination).
        """
        predictions = self.model.predict(self.X_test)
        mse = self.calculate_mse(predictions, self.y_test)
        r2 = self.calculate_r2(predictions, self.y_test)
        return mse, r2

    def calculate_mse(self, prediction, y):
        """
        Calculate the Mean Squared Error (MSE) between the predicted and actual values.

        Args:
            prediction (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: Mean Squared Error (MSE).
        """
        residuals = prediction - y
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, prediction, y):
        """
        Calculate the R-squared (coefficient of determination) between the predicted and actual values.

        Args:
            prediction (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: R-squared (coefficient of determination).
        """
        tss = np.sum((y - np.mean(y)) ** 2)
        rss = np.sum((y - prediction) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def plot_data(self):
        """
        Plot the data.
        """
        plt.scatter(self.X['age'], self.y, color='blue', label='Actual')
        plt.scatter(self.X['age'], self.model.predict(self.X), color='red', label='Predicted')
        plt.xlabel('Age')
        plt.ylabel('Income')
        plt.legend()
        plt.show()


<h1> Main Function </h1>

In [6]:
# Create an instance of the LinearRegressionModel class
model = LinearRegressionModel('multi-dataset.csv')

# Read the data from the dataset file
model.read_data()

# Split the data into training and testing sets
model.split_data()

# Train the linear regression model using the training set
model.train_model()

# Evaluate the model using the testing set
model.evaluate_model()
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Plot the data
# model.plot_data()


Mean Squared Error: 753796.7693735545
R-squared: 0.9387098237077806
