# Linear Regression from Scratch in Python
In the Jupyter Notebook, a linear regression machine learning algorithm using Python programming is implemented from scratch for educational purposes. The linear regression model makes predictions about a person's salary. If an employee has more experience and is older, the higher output salary is predicted from the model. The less experience and the younger an employee is the lower the salary output prediction from the model. The ScikitLearn machine learning library with NumPy, Matplotlib, Pandas and Scipy has been used to pre-process the data and evaluate performance metrics from the linear regression model.
The ScikitLearn linear regression algorithm has been implemented at the end of the Jupyter Notebook to compare the evaluation test results against the Python linear regression algorithm.

Although this is a very small dataset for a statistical model to make predictions on, it provides an introduction to machine learning algorithms and automatic differentiation. A larger dataset may return missing values (NaN) as output predictions over other linear regression algorithms written in software using well-optimised machine learning libraries such as ScikitLearn, Pytorch or TensorFlow. Larger datasets require optimised data loading software that is not added to the linear regression model object to reduce code complexity for educational purposes.

# 1. Linear Regression Object

In [1]:
# Imports the NumPy library module
import numpy as np

# Linear Regression Python class object
class LinearRegression:
    """Linear Regression (LR) model in Python from scratch.
       The code comments throughout the LR model explain the code.
       Object Parameters:
       x = features of the dataset
       y = target values of the dataset
       Learning Rate (lr)
       Number of Iterations (n_iters)
       x, y, LR and n_iters can be set as hyperparameters to train the
       model on different datasets.
    """

    # Linear Regression Python object constructor
    def __init__(self, x, y, lr=0.01, n_iters=10000):
        self.features = x           # Stores the features from the dataset
        self.target = y             # Stores the target values y from the dataset
        self.lr = lr                # Learning rate
        self.n_iters = n_iters      # Number of iterations
        self.weights = None         # Initialises the weights to 0 and stores their values
        self.bias = None            # Initialises the bias to 0 and stores their values
        self.n_samples = len(x)     # Sets the number of samples to the length of the features x

    # Python method to fit the data to the Multiple Linear Regression model
    def fit(self):
        '''
        Fit method to train the linear regression model.
        '''

        # Initialize weights and bias to zero values
        self.weights = np.zeros(self.features.shape[1])
        self.bias = 0

        # Gradient Descent implementation
        for i in range(self.n_iters):
            # Line equation that stores the predictions of the LR model during training
            y_pred = np.dot(self.features, self.weights) + self.bias

            # Calculate derivatives of the weights and bias
            dw = (1 / self.n_samples) * (2 * np.dot(self.features.T, (y_pred - self.target)))
            db = (1 / self.n_samples) * (2 * np.sum(y_pred - self.target))

            # Updates the weights and the bias of the LR during training
            self.weights = self.weights - self.lr * dw
            self.bias = self.bias - self.lr * db

    # Python prediction method for the LR algorithm
    def predict(self, X):
        ''' Makes predictions using the line equation.
            X: features from the dataset (X_test)
        '''

        # Returns the prediction made from the LR model
        return np.dot(X, self.weights) + self.bias

# 2. Load the Dataset

Dataset Available from Kaggle:

https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset

In [2]:
# Imports the Pandas library to use in the Jupyter Notebook
import pandas as pd

# Load the data to the Jupyter Notebook
df = pd.read_csv('./dataset/multiple_linear_regression_dataset.csv')

df

Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830
5,51,7,41630
6,28,5,41340
7,33,4,37650
8,37,5,40250
9,39,8,45150


In [3]:
# Seperate the target values from the dataset using the Pandas Pop method
y = df.pop('income')

y

0     30450
1     35670
2     31580
3     40130
4     47830
5     41630
6     41340
7     37650
8     40250
9     45150
10    27840
11    46110
12    36720
13    34800
14    51300
15    38900
16    63600
17    30870
18    44190
19    48700
Name: income, dtype: int64

In [4]:
# Displays the dataset features X to the screen
X = df

X

Unnamed: 0,age,experience
0,25,1
1,30,3
2,47,2
3,32,5
4,43,10
5,51,7
6,28,5
7,33,4
8,37,5
9,39,8


# 3. Normalise the Data

In [5]:
# Convert column names to a list
list_numerical = X.columns.tolist()

list_numerical

['age', 'experience']

In [6]:
# Python method to normalise the features X of the dataset 
def get_scale(list_numerical, X):
    
    # Import the StandardScaler object from ScikitLearn to normalise the data
    from sklearn.preprocessing import StandardScaler
    
    # Creates the StandardScaler object from ScikitLearn
    scaler = StandardScaler().fit(X[list_numerical]) 

    # Fits the normalisation to the dataset features X
    X[list_numerical] = scaler.transform(X[list_numerical])

    # Returns the normalised dataset features X
    return X

In [7]:
# Calls the get_scale Python method to normalise the dataset features X
X = get_scale(list_numerical, X)

X

Unnamed: 0,age,experience
0,-1.498903,-1.293548
1,-0.987332,-0.79603
2,0.752009,-1.044789
3,-0.782703,-0.298511
4,0.342752,0.945285
5,1.161266,0.199007
6,-1.19196,-0.298511
7,-0.680389,-0.54727
8,-0.271133,-0.298511
9,-0.066504,0.447767


# 4. Train Test Split

In [8]:
# Imports the train test split method from ScikitLearn
from sklearn.model_selection import train_test_split

# Carry out a train, test split from ScikitLearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Prints out the X_train features to the screen
X_train

Unnamed: 0,age,experience
8,-0.271133,-0.298511
5,1.161266,0.199007
11,0.752009,0.696526
3,-0.782703,-0.298511
18,0.445067,0.696526
16,1.877465,2.6866
13,1.161266,-0.54727
2,0.752009,-1.044789
9,-0.066504,0.447767
19,-0.271133,0.945285


In [9]:
# Prints out the y_train targets to the screen
y_train

8     40250
5     41630
11    46110
3     40130
18    44190
16    63600
13    34800
2     31580
9     45150
19    48700
4     47830
12    36720
7     37650
10    27840
14    51300
6     41340
Name: income, dtype: int64

In [10]:
# Prints out the X_test features to the screen
X_test

Unnamed: 0,age,experience
0,-1.498903,-1.293548
17,-1.703531,-1.293548
15,0.138124,-0.049752
1,-0.987332,-0.79603


In [11]:
# Prints out the y_test targets to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

# 5. Train the Linear Regression Model

In [12]:
# Creating the LinearRegression class object
model = LinearRegression(X_train, y_train)

# Train the model with the fit method
model.fit()

# 6. Make Predictions on the Test Dataset

In [13]:
# Makes predictions on the test dataset X_test
y_pred = model.predict(X_test)

y_pred

array([31093.38107376, 31295.49954076, 40250.46080162, 34897.6958918 ])

In [14]:
# Prints out the y_test variable to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

In [15]:
# Flattens the y_test 2D array to 1D and stores it in the y variable
y = y_test.values.flatten()

y

array([30450, 30870, 38900, 35670], dtype=int64)

# 7. PLCC, SRCC and KRCC Correlation Performance Metrics Method

In [16]:
# Python method to calculate correlation coefficient metrics
def correlation(X, y):
    
    # Import Scipy libraray
    import scipy.stats
    
    # Calculate Pearson's Linear Correlation Coefficient with Scipy
    PLCC = scipy.stats.pearsonr(X, y)[0]    # Pearson's r

    # Calculate Spearman Rank Correlation Coefficient (SRCC) with Scipy
    SRCC = scipy.stats.spearmanr(X, y)[0]   # Spearman's rho

    # Calculate Kendalls Rank Correlation Coefficient (KRCC) with Scipy
    KRCC = scipy.stats.kendalltau(X, y)[0]  # Kendall's tau
    
    # Prints out the correlation performance metric results to the screen
    print("PLCC: ", PLCC)
    print("SRCC: ", SRCC)
    print("KRCC: ", KRCC)

# 8. Evaluate PLCC, SRCC and KRCC Performance Metrics

In [17]:
# Calls the correlation coefficient Python method to calculate PLCC, SRCC, and KRCC
# with SciPy
correlation(y_pred, y)

PLCC:  0.9791244408611758
SRCC:  1.0
KRCC:  1.0


# Root Mean Square Error (RMSE)

In [18]:
# Imports the mean_squared_error method from ScikitLearn
from sklearn.metrics import mean_squared_error

# Calls the Mean Squared Error method from ScikitLearn
rmse = mean_squared_error(y_pred, y)

# Prints out the RMSE value to the screen
np.sqrt(rmse)

868.2147023480362

# R2 Score

In [19]:
# Imports the r2_score method from ScikitLearn
from sklearn.metrics import r2_score

# Calls the R2 Score method from ScikitLearn
r2 = r2_score(y_pred, y)

# Prints out the R2 Score value to the screen
r2

0.9452244815314687

# 9. Test on the ScikitLearn Linear Regression Model

In [20]:
# Imports the Linear Regression model from ScikitLearn
from sklearn.linear_model import LinearRegression

# Trains the ScikitLEarn Linear Regression model
model = LinearRegression().fit(X_train, y_train)

# 10. Make Predictions

In [21]:
# Makes predictions on the Linear Regression model
y_pred = model.predict(X_test)

y_pred

array([31093.38107376, 31295.49954076, 40250.46080162, 34897.6958918 ])

In [22]:
# Flattens the 2D pred array to 1D from the model and stores it in the flat Python variable
flat = y_pred.flatten()

flat

array([31093.38107376, 31295.49954076, 40250.46080162, 34897.6958918 ])

In [23]:
# Prints out the y_test variable to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

In [24]:
# Flattens the y_test 2D array to 1D and stores it in the y variable
y = y_test.values.flatten()

y

array([30450, 30870, 38900, 35670], dtype=int64)

# 11. Test the ScikitLearn Linear Regression Model

In [25]:
# Calls the correlation coefficient Python method to calculate PLCC, SRCC, and KRCC
# with SciPy
correlation(flat, y)

PLCC:  0.9791244408611761
SRCC:  1.0
KRCC:  1.0


# Root Mean Square Error (RMSE)

In [26]:
# Imports the mean_squared_error method from ScikitLearn
from sklearn.metrics import mean_squared_error

# Calls the Mean Squared Error method from ScikitLearn
rmse = mean_squared_error(flat, y)

# Prints out the RMSE value to the screen
np.sqrt(rmse)

868.2147023481325

# R2 Score

In [27]:
# Imports the r2_score method from ScikitLearn
from sklearn.metrics import r2_score

# Calls the R2 Score method from ScikitLearn
r2 = r2_score(flat, y)

# Prints out the R2 Score value to the screen
r2

0.9452244815314554

# 12. Display Jupyter Notebook Python Variables

In [28]:
# Prints out all the Python variables from the Jupyter Notebook
%whos

Variable             Type                Data/Info
--------------------------------------------------
LinearRegression     ABCMeta             <class 'sklearn.linear_mo<...>._base.LinearRegression'>
X                    DataFrame                    age  experience\<...>n19 -0.271133    0.945285
X_test               DataFrame                    age  experience\<...>n1  -0.987332   -0.796030
X_train              DataFrame                    age  experience\<...>n6  -1.191960   -0.298511
correlation          function            <function correlation at 0x00000226D5E643A0>
df                   DataFrame                    age  experience\<...>n19 -0.271133    0.945285
flat                 ndarray             4: 4 elems, type `float64`, 32 bytes
get_scale            function            <function get_scale at 0x00000226D3835C10>
list_numerical       list                n=2
mean_squared_error   function            <function mean_squared_er<...>or at 0x00000226D5D938B0>
model                Li