<h1><i> CREATING MULTI LINEAR REGRESSION MODEL FROM SCRATCH </i></h1>

Here are the step-by-step explanations for the mathematical formulas used in multi-linear regression:

Certainly! Here's the formulaic representation of the steps involved in linear regression:

Given:
- X1 = [x1_1, x1_2, ..., x1_n] (array of values for independent variable X1)
- X2 = [x2_1, x2_2, ..., x2_n] (array of values for independent variable X2)
- y = [y_1, y_2, ..., y_n] (array of values for dependent variable y)
- n (number of data points)

1. Mean Calculation:
   - Calculate the mean of X1:
     - mean_X1 = sum(X1) / n
   - Calculate the mean of X2:
     - mean_X2 = sum(X2) / n
   - Calculate the mean of y:
     - mean_y = sum(y) / n

2. Variance Calculation:
   - Calculate the variance of X1:
     - variance_X1 = sum((x1_i - mean_X1)^2) / n
   - Calculate the variance of X2:
     - variance_X2 = sum((x2_i - mean_X2)^2) / n
   - Calculate the variance of y:
     - variance_y = sum((y_i - mean_y)^2) / n

3. Covariance Calculation:
   - Calculate the covariance between X1 and y:
     - covariance_X1_y = sum((x1_i - mean_X1) * (y_i - mean_y)) / n
   - Calculate the covariance between X2 and y:
     - covariance_X2_y = sum((x2_i - mean_X2) * (y_i - mean_y)) / n

4. Slope Calculation:
   - Calculate the slope for X1:
     - slope_X1 = covariance_X1_y / variance_X1
   - Calculate the slope for X2:
     - slope_X2 = covariance_X2_y / variance_X2

5. Intercept Calculation:
   - Calculate the intercept:
     - intercept = mean_y - (slope_X1 * mean_X1) - (slope_X2 * mean_X2)

The final linear regression equation for the given data is:
   - y = intercept + (slope_X1 * X1) + (slope_X2 * X2)

This equation represents the relationship between the dependent variable (y) and the independent variables (X1 and X2) based on the given data.

<h1> Importing Libraries </h1>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

<h1> Helper Functions </h1>

In [2]:
class LinearRegression:
    """
    Linear regression model for predicting a target variable based on a single feature.

    Args:
        dataset_file (str): File path of the dataset.

    Attributes:
        dataset_file (str): File path of the dataset.
        df (pd.DataFrame): Dataframe containing the dataset.
        x (ndarray): Array of feature values.
        y (ndarray): Array of target values.
    """

    def __init__(self, dataset_file):
        self.dataset_file = dataset_file
        self.df = None
        self.x = None
        self.y = None

    def read_data(self):
        """
        Read the dataset from a CSV file and assign feature and target values.
        """
        self.df = pd.read_csv(self.dataset_file).reset_index(drop=True)

        self.X = self.df.loc[:,self.df.columns[:-1]].values
        self.y = self.df['income'].values
#         print('Shape of X and y: ' ,self.X.shape, self.y.shape)       


    def get_mean(self, arr):
        """
        Calculate the mean of an array.

        Args:
            arr (ndarray): Input array.

        Returns:
            float: Mean value.
        """
#         return arr.mean(axis=0)
        total = sum(arr)
        mean = total / len(arr)
        return mean
    
    def get_variance(self, arr, mean):
        """
        Calculate the variance of an array.

        Args:
            arr (ndarray): Input array.
            mean (float): Mean value of the array.

        Returns:
            float: Variance value.
        """
        variance = np.sum((arr - mean) ** 2, axis=0) / arr.shape[0]
        return variance

    def get_covariance(self, arr_x, mean_x, arr_y, mean_y):
        """
        Calculate the covariance between two arrays.

        Args:
            arr_x (ndarray): First input array.
            mean_x (float): Mean value of the first array.
            arr_y (ndarray): Second input array.
            mean_y (float): Mean value of the second array.

        Returns:
            float: Covariance value.
        """

        covariance = sum((x1 - mean_x) * (x2 - mean_y) for x1, x2 in zip(arr_x, arr_y)) / len(arr_x)
        return covariance

    def get_coefficients(self, x, y):
        """
        Calculate the slope and intercept coefficients for linear regression.

        Args:
            x (ndarray): Array of feature values.
            y (ndarray): Array of target values.

        Returns:
            tuple: Slope and intercept coefficients (m, c).
        """
        x_mean = self.get_mean(x)
        print(x_mean)
        y_mean = self.get_mean(y)
        print(y_mean)
        variance = self.get_variance(x, x_mean)
        print(variance)
        covariance = self.get_covariance(x, x_mean, y, y_mean)
        print(covariance)
        m = covariance / variance
        c = y_mean - x_mean * m
        print('m and c:' ,m,c)
        return m, c

    def linear_regression(self, x_train, y_train, x_test, y_test):
        """
        Perform linear regression on the training data and evaluate the model using the test data.

        Args:
            x_train (ndarray): Array of feature values for training.
            y_train (ndarray): Array of target values for training.
            x_test (ndarray): Array of feature values for testing.
            y_test (ndarray): Array of target values for testing.

        Returns:
            list: Predicted target values for the test data.
        """
        prediction = []
        m, c = self.get_coefficients(x_train, y_train)

#         print(m.shape,x_test.shape,c.shape)
        prediction = m@x_test.T
        
        mse = self.calculate_mse(prediction, y_test)
        r2 = self.calculate_r2(prediction, y_test)
        print("MSE SCORE:", mse)
        print("R2 SCORE:", r2)

        return prediction

    def calculate_mse(self, prediction, y_test):
        """
        Calculate the mean squared error (MSE).

        Args:
            prediction (ndarray): Predicted target values.
            y_test (ndarray): Actual target values.

        Returns:
            float: Mean squared error.
        """
        residuals = prediction - y_test
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, prediction, y_test):
        """
        Calculate the R-squared (R2) score.

        Args:
            prediction (ndarray): Predicted target values.
            y_test (ndarray): Actual target values.

        Returns:
            float: R-squared score.
        """
        tss = np.sum((y_test - np.mean(y_test)) ** 2)
        rss = np.sum((y_test - prediction) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def plot_reg_line(self, x, y):
        """
        Plot the scatter plot of the data points and the regression line.

        Args:
            x (ndarray): Array of feature values.
            y (ndarray): Array of target values.
        """
        prediction = []
        m, c = self.get_coefficients(x, y)
        for x0 in range(1, 100):
            yhat = m * x0 + c
            prediction.append(yhat)

        sns.scatterplot(x=x, y=y, color='blue')
        sns.lineplot(x=[i for i in range(1, 100)], y=prediction, color='red')
        plt.xlabel('X')
        plt.ylabel('Y')
        plt.title('Regression Plot')
        plt.show()


<h1> Main Function </h1>

In [3]:
# Create an instance of the LinearRegression class
regression = LinearRegression('multi-dataset.csv')
# Call the necessary methods
regression.read_data()

# # Split the data into training and testing sets (to change -> scikit train_test_split)
# x_train = regression.x[:80]
# y_train = regression.y[:80]
# x_test = regression.x[80:]
# y_test = regression.y[80:]

# Perform linear regression using the training and testing data
# regression = regression.linear_regression(regression.x[:80], regression.y[:80], regression.x[80:], regression.y[80:])
regression = regression.linear_regression(regression.X[:2],regression.y[:2],regression.X[2:],regression.y[2:])

regression.plot_reg_line(regression.x, regression.y)


[27.5  2. ]
33060.0
[6.25 1.  ]
[6525. 2610.]
m and c: [1044. 2610.] [ 4350. 27840.]
MSE SCORE: 485796054.8888889
R2 SCORE: -6.233707667205879


AttributeError: 'numpy.ndarray' object has no attribute 'plot_reg_line'

<h1> Compares the output with SKLEARN Built-In Linear Regression Library</h1>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression

<h1> Helper Functions </h1>

In [30]:
class LinearRegressionModel:
    def __init__(self, dataset_file):
        """
        Initializes the LinearRegressionModel class.

        Args:
            dataset_file (str): The path to the dataset file.
        """
        self.dataset_file = dataset_file
        self.df = None
        self.x = None
        self.y = None
        self.reg = LinearRegression()

    def read_data(self):
        """
        Reads the dataset file and assigns the values of X and Y to self.x and self.y respectively.
        """
        self.df = pd.read_csv(self.dataset_file)
        self.x = self.df[['age','experience']].values
        self.y = self.df['income'].values
        
    def train_model(self):
        """
        Trains the linear regression model using the training data.
        """
        
        self.reg.fit(self.x[:80].reshape(-1, 1), self.y[:80])

    def calculate_mse(self, prediction, y):
        """
        Calculate the mean squared error (MSE) manually.

        Args:
            prediction (ndarray): Predicted target values.
            y (ndarray): Actual target values.

        Returns:
            float: Mean squared error.
        """
        residuals = prediction - y
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, prediction, y):
        """
        Calculate the R-squared (R2) score manually.

        Args:
            prediction (ndarray): Predicted target values.
            y (ndarray): Actual target values.

        Returns:
            float: R-squared score.
        """
        tss = np.sum((y - np.mean(y)) ** 2)
        rss = np.sum((y - prediction) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def evaluate_model(self):
        """
        Evaluates the trained model using the test data and prints the R2 score and MSE score.
        """
        prediction = self.reg.predict(self.x[80:].reshape(-1, 1))
        r2 = self.calculate_r2(prediction, self.y[80:])
        mse = self.calculate_mse(prediction, self.y[80:])
        print("R2 SCORE:", r2)
        print("MSE SCORE:", mse)

    def predict_values(self):
        """
        Predicts the target values for a range of input values.

        Returns:
            numpy.ndarray: The predicted target values.
        """
        prediction = self.reg.predict(np.array([i for i in range(1, 100)]).reshape(-1, 1))
        return prediction

    def plot_data(self):
        """
        Plots the scatter plot of X and Y and the regression line.
        """
        prediction = self.predict_values()

        sns.scatterplot(x=self.x, y=self.y, color='green')
        sns.lineplot(x=[i for i in range(1, 100)], y=prediction, color='red')
        plt.xlabel('X')
        plt.ylabel('Y')
        plt.title('Regression Plot')
        plt.show()

<h1> Main Function </h1>

In [29]:
# Create an instance of the LinearRegression class
model = LinearRegressionModel('multi-dataset.csv')

# Read the data from the dataset file
model.read_data()

# Train the linear regression model
model.train_model()

# Evaluate the model
model.evaluate_model()

# Plot the data
model.plot_data()

[[25  1]
 [30  3]
 [47  2]
 [32  5]
 [43 10]
 [51  7]
 [28  5]
 [33  4]
 [37  5]
 [39  8]
 [29  1]
 [47  9]
 [54  5]
 [51  4]
 [44 12]
 [41  6]
 [58 17]
 [23  1]
 [44  9]
 [37 10]]
[30450 35670 31580 40130 47830 41630 41340 37650 40250 45150 27840 46110
 36720 34800 51300 38900 63600 30870 44190 48700]


ValueError: Found input variables with inconsistent numbers of samples: [40, 20]