## Lasso Regression

Basic Example of Lasso Regression Implementation
Lasso: (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting linear model.

## About the data


The dataset has 9 variables, including the name of the car and its various attributes like horsepower, weight, region of origin, etc. Missing values in the data are marked by a series of question marks.

A detailed description of the variables is given below.

1. mpg: miles per gallon
2. cylinders: number of cylinders
3. displacement: engine displacement in cubic inches
4. horsepower: horsepower of the car
5. weight: weight of the car in pounds
6. acceleration: time taken, in seconds, to accelerate from O to 60 mph
7. model year: year of manufacture of the car (modulo 100)
8. origin: region of origin of the car (1 - American, 2 - European, 3 - Asian)
9. car name: name of the car


## 1. Importing Libraries

In [1]:
# for reading and manipulating data
import pandas as pd
import numpy as np

# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# spliting data
from sklearn.model_selection import train_test_split

# for model building
from sklearn.linear_model import Lasso

# for supressing warnings
import warnings
warnings.simplefilter('ignore')

## 2. Data Load & Overview

In [2]:
data = pd.read_csv('../Data/auto_mpg.csv')

In [3]:
df = data.copy()

In [4]:
# checking the size of the dataframe
print(f'This data set has {df.shape[0]} rows')
print(f'There are {df.shape[1]} variables')

This data set has 398 rows
There are 9 variables


In [5]:
# looking at first 5 rows of the dataset
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


This dataset contains features of different cars

The main goal is to create a model that will predict the miles per gallon (mpg) of a car based on the other features

In [6]:
# checking data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


* Most of the columns in the data are numeric in nature ('int64' or 'float64' type).
* The horsepower and car name columns are string columns ('object' type)
* From the data above we know that horsepower should be numeric.

For Prediction purposes I will be dropping the 'car name' column

In [7]:
# dropping 'car name' column
df.drop('car name',axis=1, inplace=True)

In [8]:
# Checking for missing values
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
dtype: int64

In [9]:
# changing 'horsepower' type from object to int
df.horsepower = df.horsepower.astype('int')


ValueError: invalid literal for int() with base 10: '?'

The above attempt failed to change the data type of the column `horsepower` and returned an error due to the column containing string values '?', we will have to take care of the strings before converting to float type.

In [10]:
# finding all unconvertible strings in the 'horsepower' column
strings = [ string for string in df['horsepower'] if string.isdigit() == False]
strings

['?', '?', '?', '?', '?', '?']

There are 6 uncomvertible strings in the column and all of them are the same '?'

we could drop these rows for simplicity of the example, however
We will use the median of the column to replace these strings

In [11]:
# replacing the '?' strings with None types
df['horsepower'] = np.where(df['horsepower'] == '?',None,df['horsepower'])

In [12]:
# creating a method to fill the None types with the median
medianFiller = lambda x: x.fillna(x.median())
df = df.apply(medianFiller,axis=0)

In [13]:
# let's convert the horsepower column from object type to float type
df.horsepower = df.horsepower.astype(float)

In [14]:
# looking at final dataset
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
0,18.0,8,307.0,130.0,3504,12.0,70,1
1,15.0,8,350.0,165.0,3693,11.5,70,1
2,18.0,8,318.0,150.0,3436,11.0,70,1
3,16.0,8,304.0,150.0,3433,12.0,70,1
4,17.0,8,302.0,140.0,3449,10.5,70,1


## 3. Model

* Build Model
* Evaluate Model
* Make Predictions

## 3.1 Building Model

In [15]:
# Separating Independent variables
X = df.drop('mpg',axis=1)

# Separating Dependent variable (Target variable)
y = df['mpg']

In [16]:
# splitting data into training and testing sets
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=123)

In [17]:
# defining model
model = Lasso()

# trainning model
model.fit(X_train, y_train);


## 3.2 Evaluating Model

we will use the R^2 value to evaluate model

In [18]:
# Evaluate Model
print(f'Trainning R^2: {model.score(X_train,y_train):.2f}')
print(f'Testing R^2: {model.score(X_test,y_test):.2f}')

Trainning R^2: 0.81
Testing R^2: 0.80


The R^2 values are very close for both the training and testing data, indicating that there is no overfitting

The model can explain 80% of the variation on the testing data

## 3.3 Making Predictions

Car with:

* Cylinders: 6
* displacement: 250
* HorsePower: 100
* Weight: 3282
* Acceleration: 15
* Model Year: 71
* Origin: 1 (American)

In [19]:
model.predict([[6,250,100,3282,15,71,1]])

array([17.90570095])

## 4 Understanding the model

**Lasso regression model:**

Lasso (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the resulting linear model.

Equation of a line:

$
y = mx + b
$

Where:
        
    y: dependent variable
    m: coefficient (slope)
    x: independent variable
    b: y-intercept
            
Depending on the number of independent variables the equation can also look like:


$
y = mx_1 + mx_2 + mx_3 + .... + mx_n + b
$


In [20]:
# getting the coefficients of the model's linear equation 
model.coef_

array([-0.        , -0.00415628,  0.        , -0.00628693,  0.        ,
        0.68956463,  0.        ])

`cylinders`, `horsepower`, `acceleration`, and `origin` variables were droppred from the final model

In [21]:
len(model.coef_[model.coef_ == 0]) # total dropped variables

4

In [22]:
# getting the y-intercept of the model's linear equation
model.intercept_

-9.380614845397869

In [23]:
# creating a list of the independent variables and modifying the name for simplicity and better visualization of the equation
factors1 = X.columns

# creating a list of the coefficients and reducing the number of digits for simplicity and better visualization of the equation
coefficients = np.around(model.coef_,4).tolist()

In [24]:
# creating a string of the full equation of our model
string = ''
for m,x in zip(factors1,coefficients):
    if m != factors1[0] and x > 0:
        string = string + f' + {x}*{m}'
    elif x < 0:
        string = string + f' - {x*-1}*{m}'
    elif x == 0:
        continue

string = 'y = ' + string + ' ' + str(np.around(model.intercept_,4))

In [25]:
string

'y =  - 0.0042*displacement - 0.0063*weight + 0.6896*model year -9.3806'

## Lasso Regression Model

$
y =  - 0.0042*displacement - 0.0063*weight + 0.6896*model year -9.3806
$

## 5 Hyper paramenter tunning

Lets try to improve the accuracy of the model

In [26]:
# Lasso regression: effect of alpha regularization parameter

for alpha in [0.1,1,2,3,4,5,6,7]:
    
    # Initialize and train the model
    model = Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    
    # getting the R^2 value
    dropped_variables = len(model.coef_[model.coef_ == 0])
    r2_train = model.score(X_train, y_train)
    r2_test = model.score(X_test, y_test)
    
    print(f'''Alpha = {alpha:.2f},
    dropped variables: {dropped_variables},
    r-squared training: {r2_train:.2f},
    r-squared test: {r2_test:.2f} \n''')

Alpha = 0.10,
    dropped variables: 0,
    r-squared training: 0.83,
    r-squared test: 0.79 

Alpha = 1.00,
    dropped variables: 4,
    r-squared training: 0.81,
    r-squared test: 0.80 

Alpha = 2.00,
    dropped variables: 4,
    r-squared training: 0.80,
    r-squared test: 0.80 

Alpha = 3.00,
    dropped variables: 4,
    r-squared training: 0.80,
    r-squared test: 0.79 

Alpha = 4.00,
    dropped variables: 4,
    r-squared training: 0.79,
    r-squared test: 0.78 

Alpha = 5.00,
    dropped variables: 4,
    r-squared training: 0.78,
    r-squared test: 0.77 

Alpha = 6.00,
    dropped variables: 4,
    r-squared training: 0.76,
    r-squared test: 0.76 

Alpha = 7.00,
    dropped variables: 3,
    r-squared training: 0.74,
    r-squared test: 0.74 



Below are the parameters that can be tunned for improved scores

In [27]:
model_params = Lasso(alpha=1.0, 
            fit_intercept=True, 
            precompute=False, 
            copy_X=True, 
            max_iter=1000, 
            tol=0.0001, 
            warm_start=False, 
            positive=False, 
            random_state=None, 
            selection='cyclic')