# Lasso Regression from Scratch in Python¶

In the Jupyter Notebook, a Lasso regression machine learning algorithm using Python programming is implemented from scratch for educational purposes. The Lasso regression model makes predictions about a person's salary. If an employee has more experience and is older, the higher output salary is predicted from the model. The less experience and the younger an employee is the lower the salary output prediction from the model. The ScikitLearn machine learning library with NumPy, Matplotlib, Pandas and Scipy has been used to pre-process the data and evaluate performance metrics from the Lasso regression model. The ScikitLearn Lasso regression algorithm has been implemented at the end of the Jupyter Notebook to compare the evaluation test results against the Python Lasso regression algorithm.

The Lasso regression algorithm adds an L1 regularisation penalty to the model to reduce overfitting. L1 regularisation is where the absolute magnitudes of the model weights is added to the result of the least squares cost function during model training.

Although this is a very small dataset for a statistical model to make predictions on, it provides an introduction to machine learning algorithms and automatic differentiation. A larger dataset may return missing values (NaN) as output predictions over other Lasso regression algorithms written in software using well-optimised machine learning libraries such as ScikitLearn, Pytorch or TensorFlow. Larger datasets require optimised data loading software that is not added to the Lasso regression model object to reduce code complexity for educational purposes.


# 1. Lasso Regression Object

In [1]:
import numpy as np

# Regression class object to prcess continious data
class Regression:
    
    # Regression class constructor
    def __init__(self, regularization, lr, n_iters):
        self.n_samples = None # Dataset samples
        self.features = None # Features from the dataset
        self.weights = None # weight parameters
        self.bias = None # bias parameter
        self.regularization = regularization # rgularisation
        self.lr = lr # learning rate
        self.n_iters = n_iters # iterations
        
    # Regression class cost function to calculate the loss of the model to determine the performance
    def __calculate_cost(self, y, y_pred):
        
        # Calculates the cost of the regression model during training and returns the cost value
        return (1 / (2*self.n_samples)) * np.sum(np.square(y_pred-y)) + self.regularization(self.weights)
        
    # Python fit method to fit the regression model during training
    def fit(self, X, y):

        # Retrieves the y_train target values dataset length in preperation for reshaping the dataset to 1 dimension
        dim_1 = len(y)

        # Reshapes the y_train target values for the Lasso Regression algorithm and converts it to a NumPy array
        y = y.values.reshape(dim_1, 1)
        
        # Initialises the wieght parameters before training
        X = np.insert(X, 0, 1, axis=1)
        
        # stores the number of samples and features from the dataset
        self.n_samples, self.features = X.shape
        
        # Updates the weight parameters after initilization
        self.weights = np.zeros((self.features,1))
        
        # Training loop for the regression model
        for e in range(1, self.n_iters+1):
            
            # Stores the training predictions made by the regression model
            y_pred = np.dot(X, self.weights)
            
            # Calculats the cost of the regression model during training
            cost = self.__calculate_cost(y, y_pred)
            
            # Partial differentiation for the weigh parameters
            dw = (1/self.n_samples) * np.dot(X.T, (y_pred - y)) + self.regularization.derivation(self.weights)
            
            # Updates the weight parameter after each training iteration
            self.weights = self.weights - self.lr * dw
            
            # Prints the cost of hte regression model after every 100 iterations
            if e % 100 == 0:
                
                # Print statement to print the cost to the screen with an F string
                print(f"The Cost in iteration {e}:\t {cost}")

        # Prints out training complete to the screen
        print("Training Complete")
        
    # Python predict method to make predictions on the trained regression model at test time
    def predict(self, X_test):
        
        # Reshapes the test dataset X_test for the regression model
        X_test = np.insert(X_test, 0 , 1, axis= 1)
        
        # Stores the predictions made from the regression model at test time
        y_pred = np.dot(X_test, self.weights)
        
        # Returns the predictions made from the regression model at test time
        return y_pred

In [2]:
# LassoPenalty class object
class LassoPenalty:
    
    # LassoPenalty Class constructor
    def __init__(self, l):
        self.l = l # lambda regularization parameter
        
    # Calculats the regularization lambda parameter's absolute value
    def __call__(self,w):
        
        # Returns the regularization lambda parameter's absolute value
        return self.l * np.sum(np.abs(w))
        
    # Calculates the derivation of the Lasso regression model
    def derivation(self, w):
        
        # Returns the derivation of the Lasso regression model
        return self.l * np.sign(w)

In [3]:
# Lasso regression Python object that inherits the regression class object
class Lasso(Regression):
    
    # Lasso regression class constructor
    def __init__(self, l, lr, n_iters):
        
        # Stores the regularization parameter lambda calculated by the LassoPenalty class object
        self.regularization = LassoPenalty(l)
        
        # Calls the Regression class object to input the lambda regularisation poramter, 
        # learning rate and number of iterations (n_iters)
        super().__init__(self.regularization, lr, n_iters)

# 2. Load the Dataset

Dataset Available from Kaggle:

https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset


In [4]:
# Imports the Pandas library to use in the Jupyter Notebook
import pandas as pd

# Load the data to the Jupyter Notebook
df = pd.read_csv('./dataset/multiple_linear_regression_dataset.csv')

df

Unnamed: 0,age,experience,income
0,25,1,30450
1,30,3,35670
2,47,2,31580
3,32,5,40130
4,43,10,47830
5,51,7,41630
6,28,5,41340
7,33,4,37650
8,37,5,40250
9,39,8,45150


In [5]:
# Seperate the target values from the dataset using the Pandas Pop method
y = df.pop('income')

y

0     30450
1     35670
2     31580
3     40130
4     47830
5     41630
6     41340
7     37650
8     40250
9     45150
10    27840
11    46110
12    36720
13    34800
14    51300
15    38900
16    63600
17    30870
18    44190
19    48700
Name: income, dtype: int64

In [6]:
# Displays the dataset features X to the screen
X = df

X

Unnamed: 0,age,experience
0,25,1
1,30,3
2,47,2
3,32,5
4,43,10
5,51,7
6,28,5
7,33,4
8,37,5
9,39,8


# 3. Normalise the Data

In [7]:
# Convert column names to a list
list_numerical = X.columns.tolist()

list_numerical

['age', 'experience']

In [8]:
# Python method to normalise the features X of the dataset 
def get_scale(list_numerical, X):
    
    # Import the StandardScaler object from ScikitLearn to normalise the data
    from sklearn.preprocessing import StandardScaler
    
    # Creates the StandardScaler object from ScikitLearn
    scaler = StandardScaler().fit(X[list_numerical]) 

    # Fits the normalisation to the dataset features X
    X[list_numerical] = scaler.transform(X[list_numerical])

    # Returns the normalised dataset features X
    return X

In [9]:
# Calls the get_scale Python method to normalise the dataset features X
X = get_scale(list_numerical, X)

X

Unnamed: 0,age,experience
0,-1.498903,-1.293548
1,-0.987332,-0.79603
2,0.752009,-1.044789
3,-0.782703,-0.298511
4,0.342752,0.945285
5,1.161266,0.199007
6,-1.19196,-0.298511
7,-0.680389,-0.54727
8,-0.271133,-0.298511
9,-0.066504,0.447767


# 4. Train Test Split

In [10]:
# Imports the train test split method from ScikitLearn
from sklearn.model_selection import train_test_split

# Carry out a train, test split from ScikitLearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Prints out the X_train features to the screen
X_train

Unnamed: 0,age,experience
8,-0.271133,-0.298511
5,1.161266,0.199007
11,0.752009,0.696526
3,-0.782703,-0.298511
18,0.445067,0.696526
16,1.877465,2.6866
13,1.161266,-0.54727
2,0.752009,-1.044789
9,-0.066504,0.447767
19,-0.271133,0.945285


In [11]:
# Prints out the y_train targets to the screen
y_train

8     40250
5     41630
11    46110
3     40130
18    44190
16    63600
13    34800
2     31580
9     45150
19    48700
4     47830
12    36720
7     37650
10    27840
14    51300
6     41340
Name: income, dtype: int64

In [12]:
# Prints out the X_test features to the screen
X_test

Unnamed: 0,age,experience
0,-1.498903,-1.293548
17,-1.703531,-1.293548
15,0.138124,-0.049752
1,-0.987332,-0.79603


In [13]:
# Prints out the y_test targets to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

# 5. Train the Lasso Regression Model

In [14]:
# Stores the input the parameters to the Lasso regression algorithm
parameters = {
    "l" : 0.1,
    "lr" : 0.1,
    "n_iters" : 700
}

# Creates the Lasso Regression algorithm and stores it in the model object
model = Lasso(**parameters)

In [15]:
# Fits the data to the Lasso regression algorithm during training
model.fit(X_train, y_train)

The Cost in iteration 100:	 877251.2670377009
The Cost in iteration 200:	 874800.7225684733
The Cost in iteration 300:	 874800.5295188489
The Cost in iteration 400:	 874800.5295035316
The Cost in iteration 500:	 874800.5295035315
The Cost in iteration 600:	 874800.5295035309
The Cost in iteration 700:	 874800.5295035329
Training Complete


# 6. Make Predictions on the Test Dataset

In [16]:
# Makes predictions on the test dataset X_test
y_pred = model.predict(X_test)

y_pred

array([[31093.12479731],
       [31295.19156223],
       [40250.38191698],
       [34897.47438062]])

In [17]:
# Flattens the y_test 2D array to 1D and stores it in the y variable
y_pred = y_pred.flatten()

y_pred

array([31093.12479731, 31295.19156223, 40250.38191698, 34897.47438062])

In [18]:
# Prints out the y_test variable to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

In [19]:
# Flattens the y_test 2D array to 1D and stores it in the y variable
y = y_test.values.flatten()

y

array([30450, 30870, 38900, 35670])

# 7. PLCC, SRCC and KRCC Correlation Performance Metrics Method

In [20]:
# Python method to calculate correlation coefficient metrics
def correlation(X, y):
    
    # Import Scipy libraray
    import scipy.stats
    
    # Calculate Pearson's Linear Correlation Coefficient with Scipy
    PLCC = scipy.stats.pearsonr(X, y)[0]    # Pearson's r

    # Calculate Spearman Rank Correlation Coefficient (SRCC) with Scipy
    SRCC = scipy.stats.spearmanr(X, y)[0]   # Spearman's rho

    # Calculate Kendalls Rank Correlation Coefficient (KRCC) with Scipy
    KRCC = scipy.stats.kendalltau(X, y)[0]  # Kendall's tau
    
    # Prints out the correlation performance metric results to the screen
    print("PLCC: ", PLCC)
    print("SRCC: ", SRCC)
    print("KRCC: ", KRCC)

# 8. Evaluate PLCC, SRCC and KRCC Performance Metrics

In [21]:
# Calls the correlation coefficient Python method to calculate PLCC, SRCC, and KRCC
# with SciPy
correlation(y_pred, y)

PLCC:  0.9791237866482259
SRCC:  1.0
KRCC:  1.0


# Root Mean Square Error (RMSE)

In [22]:
# Imports the mean_squared_error method from ScikitLearn
from sklearn.metrics import mean_squared_error

# Calls the Mean Squared Error method from ScikitLearn
rmse = mean_squared_error(y_pred, y)

# Prints out the RMSE value to the screen
np.sqrt(rmse)

868.1481042727627

# R2 Score

In [23]:
# Imports the r2_score method from ScikitLearn
from sklearn.metrics import r2_score

# Calls the R2 Score method from ScikitLearn
r2 = r2_score(y_pred, y)

# Prints out the R2 Score value to the screen
r2

0.9452353084209956

# 9. Test on the ScikitLearn Lasso Regression Model

Lasso Regression ScikitLearn Documentation Link:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

## Train the Model

In [24]:
# Imports the Lasso Regression model from ScikitLearn
from sklearn.linear_model import Lasso

# Trains the ScikitLearn Lasso Regression model
model = Lasso(alpha=0.1).fit(X_train, y_train)

# 10. Make Predictions

In [25]:
# Makes predictions on the Linear Regression model
y_pred = model.predict(X_test)

y_pred

array([31093.28218034, 31295.35328037, 40250.48790683, 34897.61424096])

In [26]:
# Flattens the 2D pred array to 1D from the model and stores it in the flat Python variable
flat = y_pred.flatten()

flat

array([31093.28218034, 31295.35328037, 40250.48790683, 34897.61424096])

In [27]:
# Prints out the y_test variable to the screen
y_test

0     30450
17    30870
15    38900
1     35670
Name: income, dtype: int64

# 11. Test the ScikitLearn Lasso Regression Model

In [28]:
# Calls the correlation coefficient Python method to calculate PLCC, SRCC, and KRCC
# with SciPy
correlation(flat, y)

PLCC:  0.9791238509822464
SRCC:  1.0
KRCC:  1.0


# Root Mean Square Error (RMSE)

In [29]:
# Imports the mean_squared_error method from ScikitLearn
from sklearn.metrics import mean_squared_error

# Calls the Mean Squared Error method from ScikitLearn
rmse = mean_squared_error(flat, y)

# Prints out the RMSE value to the screen
np.sqrt(rmse)

868.2071647849484

# R2 Score

In [30]:
# Imports the r2_score method from ScikitLearn
from sklearn.metrics import r2_score

# Calls the R2 Score method from ScikitLearn
r2 = r2_score(flat, y)

# Prints out the R2 Score value to the screen
r2

0.9452272123280635

# 12. Display Jupyter Notebook Python Variables

In [31]:
# Prints out all the Python variables from the Jupyter Notebook
%whos

Variable             Type         Data/Info
-------------------------------------------
Lasso                ABCMeta      <class 'sklearn.linear_mo<...>oordinate_descent.Lasso'>
LassoPenalty         type         <class '__main__.LassoPenalty'>
Regression           type         <class '__main__.Regression'>
X                    DataFrame             age  experience\<...>n19 -0.271133    0.945285
X_test               DataFrame             age  experience\<...>n1  -0.987332   -0.796030
X_train              DataFrame             age  experience\<...>n6  -1.191960   -0.298511
correlation          function     <function correlation at 0x7ad9d7dc0900>
df                   DataFrame             age  experience\<...>n19 -0.271133    0.945285
flat                 ndarray      4: 4 elems, type `float64`, 32 bytes
get_scale            function     <function get_scale at 0x7ad9dc561760>
list_numerical       list         n=2
mean_squared_error   function     <function mean_squared_error at 0x7ad9d7d