## Lasso regularisation

Regularisation consists in adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model and avoid overfitting. In linear model regularization, the penalty is applied to the coefficients that multiply each of the predictors. The Lasso regularization or l1 has the property that is able to shrink some of the coefficients to zero. Therefore, those features can be removed from the model.

I will demonstrate how to select features using the Lasso regularisation on a regression and classification problem.

In [1]:
# %load_ext autoreload
# %autoreload 2

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler

from helper_fe_v2 import (
            get_full_datapath_nm,
            read_df_from_file,
            check_module_members,
            gen_correlation,
            do_bkwd_fwd_selection,
            yaml_path,
            read_yaml_conf,
            remove_duplicates, 
            drop_const_features,
            drop_quasi_const_features ,
            run_randomForestClassifier,
            run_logistic,
            run_randomForestRegressor
)

## Classification

In [3]:
config = read_yaml_conf(yaml_path())
print ("yaml_ conf ", config ) 

yaml_ conf  {'write_file': True, 'base_dir': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments', 'full_config_file': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\py-projects.yaml', 'cur_kaggle_expt': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\feature-engineering\\fe-recipes', 'data_dir_nm': 'input_data', 'files': {'test_data_set2': 'fselect_dataset_2.csv', 'housing_data': 'housing_prices_train.csv', 'test_data_set1': 'fselect_dataset_1.csv', 'titanic_data': 'fe-cookbook-titanic.csv', 'comp_eda_file': 'none.csv'}, 'project_parms': {'use_mlxtnd': 'False'}, 'process_eda': {'main_file': 'fselect_dataset_2.csv', 'compre_file': 'none.csv', 'pairwise_analysis': 'on', 'show_html': 'False'}, 'RandomForestConfig': {'n_estimators': 200, 'rand_state': 39, 'max_depth': 4}}


In [4]:
data = read_df_from_file ( config['files']['test_data_set2'], set_nrows=False, nrws=0 ) 
data.shape

Full path NM exists  C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_2.csv
full_path nm -from read_df C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_2.csv


(50000, 109)

In [5]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


In [6]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 108), (15000, 108))

In [7]:
# linear models benefit from feature scaling

scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [8]:
# here I will do the model fitting and feature selection
# altogether in one line of code

# first I specify the Logistic Regression model, and I
# make sure I select the Lasso (l1) penalty.

# Then I use the selectFromModel class from sklearn, which
# will select the features which coefficients are non-zero

sel_ = SelectFromModel(
    LogisticRegression(C=0.5, penalty='l1', solver='liblinear', random_state=10))

sel_.fit(scaler.transform(X_train), y_train)

SelectFromModel(estimator=LogisticRegression(C=0.5, penalty='l1',
                                             random_state=10,
                                             solver='liblinear'))

In [9]:
# this command let's me visualise the index of the
# features that were selected

sel_.get_support()

array([ True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True, False,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True, False])

In [10]:
# Now I make a list with the selected features
selected_feat = X_train.columns[(sel_.get_support())]

print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 108
selected features: 93
features with coefficients shrank to zero: 15


### Examine coefficients that shrank to zero

In [11]:
# the number of features which coefficient was shrank to zero:
np.sum(sel_.estimator_.coef_ == 0)

15

In [12]:
# we can identify the removed features like this:

removed_feats = X_train.columns[(sel_.estimator_.coef_ == 0).ravel().tolist()]
removed_feats

Index(['var_8', 'var_19', 'var_42', 'var_47', 'var_53', 'var_59', 'var_62',
       'var_64', 'var_73', 'var_75', 'var_85', 'var_87', 'var_91', 'var_105',
       'var_109'],
      dtype='object')

In [13]:
# we can then remove the features from the training and testing set
# like this:

X_train_selected = sel_.transform(X_train)
X_test_selected = sel_.transform(X_test)

X_train_selected.shape, X_test_selected.shape



((35000, 93), (15000, 93))

Remember that sklearn SelectFromModel returns a NumPy array, so if you need a dataframe, you need to capture the feature names first and then convert the array to a dataframe.

### Ridge regularisation does not shrink coefficients to zero

For the sake of the demo, let's inspect if the Ridge Regularization or L2 shrinks coefficients to zero.

In [14]:
# For comparison, I will fit a logistic regression with a
# Ridge regularisation, and evaluate the coefficients

l1_logit = LogisticRegression(C=0.5, penalty='l2', max_iter=300, random_state=10)
l1_logit.fit(scaler.transform(X_train), y_train)

# I count the number of coefficients with zero values
# and it is zero, as expected
np.sum(l1_logit.coef_ == 0)

0

## classification

In [15]:
data = read_df_from_file ( config['files']['housing_data'], set_nrows=False, nrws=0 ) 
data.shape

Full path NM exists  C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\housing_prices_train.csv
full_path nm -from read_df C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\housing_prices_train.csv


(1460, 81)

In [16]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess how deterministic they are of the target

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [17]:
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((1022, 37), (438, 37))

In [18]:
X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

In [19]:
# the features in the house dataset are in very
# different scales, so it helps the regression to scale
# them

scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [20]:
# here, again I will train a Lasso Linear regression and select
# the non zero features in one line.

# bear in mind that the linear regression object from sklearn does
# not allow for regularisation. So If you want to make a regularised
# linear regression you need to import specifically "Lasso"

# alpha is the penalisation, so I set it high
# to force the algorithm to shrink some coefficients

sel_ = SelectFromModel(Lasso(alpha=100, random_state=10))
sel_.fit(scaler.transform(X_train), y_train)

SelectFromModel(estimator=Lasso(alpha=100, random_state=10))

In [21]:
sel_.get_support()

array([False,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True])

In [22]:
# make a list with the selected features and print the outputs
selected_feat = X_train.columns[(sel_.get_support())]

print('total features: {}'.format((X_train.shape[1])))
print('selected features: {}'.format(len(selected_feat)))
print('features with coefficients shrank to zero: {}'.format(
    np.sum(sel_.estimator_.coef_ == 0)))

total features: 37
selected features: 33
features with coefficients shrank to zero: 4


As we can see, both for linear and logistic regression we used the Lasso regularisation to remove non-important features from the dataset. 

Keep in mind that increasing the penalisation will increase the number of features removed. Therefore, you will need to keep an eye and monitor the final model performance to ensure that you don't set a penalty too high so it removes a lot of features, or too low, and thus useless features are retained.


## Basic Filter Methods plus LASSO pipeline
### Putting it all together

In [23]:
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.feature_selection import SelectFromModel

from sklearn.metrics import roc_auc_score

from helper_fe_v2 import (
            get_full_datapath_nm,
            read_df_from_file,
            check_module_members,
            gen_correlation,
            do_bkwd_fwd_selection,
            yaml_path,
            read_yaml_conf,
            remove_duplicates, 
            drop_const_features,
            drop_quasi_const_features ,
            run_randomForestClassifier,
            run_logistic,
            run_randomForestRegressor
)

In [24]:
config = read_yaml_conf(yaml_path())
print ("yaml_ conf ", config ) 

yaml_ conf  {'write_file': True, 'base_dir': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments', 'full_config_file': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\py-projects.yaml', 'cur_kaggle_expt': 'C:\\Users\\Arindam Banerji\\CopyFolder\\IOT_thoughts\\python-projects\\kaggle_experiments\\feature-engineering\\fe-recipes', 'data_dir_nm': 'input_data', 'files': {'test_data_set2': 'fselect_dataset_2.csv', 'housing_data': 'housing_prices_train.csv', 'test_data_set1': 'fselect_dataset_1.csv', 'titanic_data': 'fe-cookbook-titanic.csv', 'comp_eda_file': 'none.csv'}, 'project_parms': {'use_mlxtnd': 'False'}, 'process_eda': {'main_file': 'fselect_dataset_2.csv', 'compre_file': 'none.csv', 'pairwise_analysis': 'on', 'show_html': 'False'}, 'RandomForestConfig': {'n_estimators': 200, 'rand_state': 39, 'max_depth': 4}}


In [25]:
data = read_df_from_file ( config['files']['test_data_set1'], set_nrows=False, nrws=0 ) 
data.shape

Full path NM exists  C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv
full_path nm -from read_df C:\Users\Arindam Banerji\CopyFolder\IOT_thoughts\python-projects\kaggle_experiments\input_data\fselect_dataset_1.csv


(50000, 301)

In [26]:
# separate dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

((35000, 300), (15000, 300))

In [27]:
# I keep a copy of the dataset with all the variables
# to measure the performance of the machine learning models
# at the end of the notebook

X_train_original = X_train.copy()
X_test_original = X_test.copy()

In [28]:
constant_features = drop_const_features(X_train, X_test, drop_feat='True')

(35000, 266) (15000, 266)


In [29]:
X_train, X_test, features_to_keep = drop_quasi_const_features(threshold=0.01,
                                                              X_train=X_train, 
                                                              X_test=X_test, 
                                                              drop_feat='True')

(35000, 215) (15000, 215)


In [30]:
X_train.shape, X_test.shape

((35000, 215), (15000, 215))

In [31]:
duplicated_feat = remove_duplicates(X_train, X_test, drop_dup=True)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
(35000, 205) (15000, 205)


In [32]:
# I keep a copy of the dataset without constant, quasi-constant and duplicated variables
# to measure the performance of machine learning models
# at the end of the notebook

X_train_basic_filter = X_train.copy()
X_test_basic_filter = X_test.copy()

In [33]:
# remove correlated features to reduce the feature space
# remove correlated features to reduce the feature space

corr_features = gen_correlation(X_train, 0.8)
print('correlated features: ', len(set(corr_features)) )


correlated features:  93


In [34]:
X_train.drop(labels=corr_features, axis=1, inplace=True)
X_test.drop(labels=corr_features, axis=1, inplace=True)

X_train.shape, X_test.shape

((35000, 112), (15000, 112))

In [35]:
# keep a copy of the dataset without
# correlated features

X_train_corr = X_train.copy()
X_test_corr = X_test.copy()

## Remove features using Lasso

In [36]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

In [37]:
# fit a Lasso and selet features, make sure to select l1

sel_ = SelectFromModel(
    LogisticRegression(C=0.5,
                       penalty='l1',
                       solver='liblinear',
                       random_state=10))

sel_.fit(scaler.transform(X_train), y_train)

# remove features with zero coefficient from dataset
# and parse again as dataframe

X_train_lasso = pd.DataFrame(sel_.transform(X_train))
X_test_lasso = pd.DataFrame(sel_.transform(X_test))

# add the columns name
X_train_lasso.columns = X_train.columns[(sel_.get_support())]
X_test_lasso.columns = X_train.columns[(sel_.get_support())]



In [38]:
X_train_lasso.shape, X_test_lasso.shape

((35000, 90), (15000, 90))

### Compare the performance of Logistic Regression with the different feature subsets

In [39]:
# original dataset - all variables

run_logistic(X_train_original,
             X_test_original,
             y_train,
             y_test,
             C=1.0, max_iter=500, penalty='l2' )

Train set
Logistic Regression roc-auc: 0.8028290544821253
Test set
Logistic Regression roc-auc: 0.795107708741321


array([[0.98589778, 0.01410222],
       [0.93240872, 0.06759128],
       [0.99671579, 0.00328421],
       ...,
       [0.98943534, 0.01056466],
       [0.98810889, 0.01189111],
       [0.92512968, 0.07487032]])

In [40]:
# filter methods - basic

run_logistic(X_train_basic_filter,
             X_test_basic_filter,
             y_train,
             y_test, C=1.0, max_iter=500, penalty='l2')

Train set
Logistic Regression roc-auc: 0.8022717196742084
Test set
Logistic Regression roc-auc: 0.794739778054715


array([[0.98590733, 0.01409267],
       [0.93233699, 0.06766301],
       [0.99685249, 0.00314751],
       ...,
       [0.98945936, 0.01054064],
       [0.98805048, 0.01194952],
       [0.92508176, 0.07491824]])

In [41]:
# filter methods - correlation

run_logistic(X_train_corr,
             X_test_corr,
             y_train,
             y_test, C=1.0, max_iter=500, penalty='l2')

Train set
Logistic Regression roc-auc: 0.7942729298770534
Test set
Logistic Regression roc-auc: 0.7881886337111546


array([[0.98592044, 0.01407956],
       [0.93597827, 0.06402173],
       [0.99689585, 0.00310415],
       ...,
       [0.98950632, 0.01049368],
       [0.98118991, 0.01881009],
       [0.92669791, 0.07330209]])

In [42]:
# embedded methods - Lasso

run_logistic(X_train_lasso,
             X_test_lasso,
             y_train,
             y_test, C=1.0, max_iter=500, penalty='l2')

Train set
Logistic Regression roc-auc: 0.7941651632070669
Test set
Logistic Regression roc-auc: 0.7882319263234343


array([[0.98590989, 0.01409011],
       [0.93598939, 0.06401061],
       [0.99678196, 0.00321804],
       ...,
       [0.98949726, 0.01050274],
       [0.98120589, 0.01879411],
       [0.92668904, 0.07331096]])

As you can see, with these procedures we reduced the feature space quite a bit, without losing model performance dramatically.
