<h1><i> CREATING MULTI LINEAR REGRESSION MODEL FROM SCRATCH </i></h1>

Here are the step-by-step explanations for the mathematical formulas used in multi-linear regression:

Certainly! Here's the formulaic representation of the steps involved in linear regression:

Given:
- X1 = [x1_1, x1_2, ..., x1_n] (array of values for independent variable X1)
- X2 = [x2_1, x2_2, ..., x2_n] (array of values for independent variable X2)
- y = [y_1, y_2, ..., y_n] (array of values for dependent variable y)
- n (number of data points)

1. Mean Calculation:
   - Calculate the mean of X1:
     - mean_X1 = sum(X1) / n
   - Calculate the mean of X2:
     - mean_X2 = sum(X2) / n
   - Calculate the mean of y:
     - mean_y = sum(y) / n

2. Variance Calculation:
   - Calculate the variance of X1:
     - variance_X1 = sum((x1_i - mean_X1)^2) / n
   - Calculate the variance of X2:
     - variance_X2 = sum((x2_i - mean_X2)^2) / n
   - Calculate the variance of y:
     - variance_y = sum((y_i - mean_y)^2) / n

3. Covariance Calculation:
   - Calculate the covariance between X1 and y:
     - covariance_X1_y = sum((x1_i - mean_X1) * (y_i - mean_y)) / n
   - Calculate the covariance between X2 and y:
     - covariance_X2_y = sum((x2_i - mean_X2) * (y_i - mean_y)) / n

4. Slope Calculation:
   - Calculate the slope for X1:
     - slope_X1 = covariance_X1_y / variance_X1
   - Calculate the slope for X2:
     - slope_X2 = covariance_X2_y / variance_X2

5. Intercept Calculation:
   - Calculate the intercept:
     - intercept = mean_y - (slope_X1 * mean_X1) - (slope_X2 * mean_X2)

The final linear regression equation for the given data is:
   - y = intercept + (slope_X1 * X1) + (slope_X2 * X2)

This equation represents the relationship between the dependent variable (y) and the independent variables (X1 and X2) based on the given data.

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

class LinearRegression:
    """
    Linear regression model class.
    """

    def __init__(self, dataset):
        """
        Initialize LinearRegression object.

        Args:
            dataset (str): Path to the CSV dataset file.
        """
        self.dataset = dataset
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.slopes = None
        self.intercept = None

    def load_data(self):
        """
        Load the dataset from CSV using pandas.

        Returns:
            None
        """
        data = pd.read_csv(self.dataset)
        self.X = data.drop('income', axis=1)
        self.y = data['income']

    def train_test_split(self, test_size=0.2, random_state=42):
        """
        Split the data into training and testing sets.

        Args:
            test_size (float): The proportion of the dataset to include in the test split (default: 0.2).
            random_state (int): Controls the shuffling applied to the data before splitting (default: 42).

        Returns:
            None
        """
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=test_size, random_state=random_state)
    
    def calculate_mean(self, data):
        """
        Calculate the mean of the given data.

        Args:
            data (numpy.ndarray): Data array.

        Returns:
            float: Mean value.
        """
        return np.sum(data) / len(data)


    def calculate_variance(self, data, mean):
        """
        Calculate the variance of the given data.

        Args:
            data (numpy.ndarray): Data array.
            mean (float): Mean value.

        Returns:
            float: Variance value.
        """
        return np.sum((data - mean) ** 2) / len(data)

    def calculate_covariance(self, X, y, mean_X, mean_y):
        """
        Calculate the covariance between X and y.

        Args:
            X (pandas.DataFrame): Independent variables.
            y (pandas.Series): Dependent variable.
            mean_X (numpy.ndarray): Mean values of X.
            mean_y (float): Mean value of y.

        Returns:
            numpy.ndarray: Covariance matrix.
        """
        return np.dot((X - mean_X).T, y - mean_y) / len(y)

    def predict(self, X):
        """
        Predict the dependent variable using the linear regression model.

        Args:
            X (pandas.DataFrame): Independent variables.

        Returns:
            numpy.ndarray: Predicted values.
        """
        # Step 7: Predict the values using the linear regression equation
        y_pred = self.intercept + np.dot(X, self.slopes)
        return y_pred

    def calculate_mse(self, y_true, y_pred):
        """
        Calculate the Mean Squared Error (MSE) between the true and predicted values.

        Args:
            y_true (array-like): True values.
            y_pred (array-like): Predicted values.

        Returns:
            float: Mean Squared Error (MSE).
        """
        residuals = y_true - y_pred
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, y_true, y_pred):
        """
        Calculate the R-squared (coefficient of determination) between the true and predicted values.

        Args:
            y_true (array-like): True values.
            y_pred (array-like): Predicted values.

        Returns:
            float: R-squared (coefficient of determination).
        """
        tss = np.sum((y_true - self.calculate_mean(y_true)) ** 2)
        rss = np.sum((y_true - y_pred) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def fit(self):
        """
        Fit the linear regression model to the training data.

        Returns:
            None
        """
        self.load_data()
        self.train_test_split()

        # Step 2: Mean Calculation
        mean_X = self.calculate_mean(self.X_train)
        mean_y = self.calculate_mean(self.y_train)
        # Step 3: Variance Calculation
        var_X = self.calculate_variance(self.X_train, mean_X)
        var_y = self.calculate_variance(self.y_train, mean_y)

        # Step 4: Covariance Calculation
        cov_X_y = self.calculate_covariance(self.X_train, self.y_train, mean_X, mean_y)

        # Step 5: Slope Calculation
        self.slopes = cov_X_y / var_X

        # Step 6: Intercept Calculation
        self.intercept = mean_y - np.dot(self.slopes, mean_X)

    def evaluate(self, y_true, y_pred):
        """
        Evaluate the model using MSE and R-squared.

        Args:
            y_true (array-like): True values.
            y_pred (array-like): Predicted values.

        Returns:
            float: MSE, R-squared.
        """
        mse = self.calculate_mse(y_true, y_pred)
        r2 = self.calculate_r2(y_true, y_pred)
        return mse, r2

    def plot_data(self):
        """
        Plot the data.
        """
        plt.scatter(self.X_test['age'], self.y_test, color='blue', label='Actual')
        plt.scatter(self.X_test['age'], self.predict(self.X_test), color='red', label='Predicted')
        plt.xlabel('Age')
        plt.ylabel('Income')
        plt.legend()
        plt.show()

    def run_linear_regression(self):
        """
        Perform linear regression on the loaded dataset.

        Returns:
            None
        """
        self.fit()
        y_pred = self.predict(self.X_test)
        mse, r2 = self.evaluate(self.y_test, y_pred)
        print("Mean Squared Error (MSE):", mse)
        print("R-squared (R2):", r2)
        # self.plot_data()

# Usage
regression = LinearRegression('multi-dataset.csv')
regression.run_linear_regression()


Mean Squared Error (MSE): 33182463.612196453
R-squared (R2): -1.6980203779486103


In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

class LinearRegressionModel:
    def __init__(self, data_file):
        """
        Initialize the LinearRegressionModel object.

        Args:
            data_file (str): Path to the CSV data file.
        """
        self.data_file = data_file
        self.data = None
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.model = None

    def read_data(self):
        """
        Read the data from the CSV file.
        """
        self.data = pd.read_csv(self.data_file)
        self.X = self.data[['age', 'experience']]
        self.y = self.data['income']

    def split_data(self, test_size=0.2, random_state=42):
        """
        Split the data into training and testing sets.

        Args:
            test_size (float, optional): The proportion of the data to include in the test split.
            random_state (int, optional): Seed used by the random number generator.

        Returns:
            None
        """
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y, test_size=test_size, random_state=random_state
        )

    def train_model(self):
        """
        Train the linear regression model using the provided training data.
        """
        self.model = LinearRegression()
        self.model.fit(self.X_train, self.y_train)

    def evaluate_model(self):
        """
        Evaluate the trained linear regression model using the testing data.

        Returns:
            tuple: Mean Squared Error (MSE) and R-squared (coefficient of determination).
        """
        predictions = self.model.predict(self.X_test)
        mse = self.calculate_mse(predictions, self.y_test)
        r2 = self.calculate_r2(predictions, self.y_test)
        return mse, r2

    def calculate_mse(self, prediction, y):
        """
        Calculate the Mean Squared Error (MSE) between the predicted and actual values.

        Args:
            prediction (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: Mean Squared Error (MSE).
        """
        residuals = prediction - y
        squared_residuals = residuals ** 2
        mse = np.mean(squared_residuals)
        return mse

    def calculate_r2(self, prediction, y):
        """
        Calculate the R-squared (coefficient of determination) between the predicted and actual values.

        Args:
            prediction (array-like): Predicted values.
            y (array-like): Actual values.

        Returns:
            float: R-squared (coefficient of determination).
        """
        tss = np.sum((y - np.mean(y)) ** 2)
        rss = np.sum((y - prediction) ** 2)
        r2 = 1 - (rss / tss)
        return r2

    def plot_data(self):
        """
        Plot the data.
        """
        plt.scatter(self.X['age'], self.y, color='blue', label='Actual')
        plt.scatter(self.X['age'], self.model.predict(self.X), color='red', label='Predicted')
        plt.xlabel('Age')
        plt.ylabel('Income')
        plt.legend()
        plt.show()

# Create an instance of the LinearRegressionModel class
model = LinearRegressionModel('multi-dataset.csv')

# Read the data from the dataset file
model.read_data()

# Split the data into training and testing sets
model.split_data()

# Train the linear regression model using the training set
model.train_model()

# Evaluate the model using the testing set
model.evaluate_model()
# Plot the data
# model.plot_data()
mean_age = model.data['age'].mean()
mean_ex = model.data['experience'].mean()

mean_income = model.data['income'].mean()
print("Mean Age:", mean_age)
print("Mean ex:", mean_ex)
print("Mean in:", mean_income)

Mean Age: 39.65
Mean ex: 6.2
Mean in: 40735.5
