<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Learn to automate feature selection with lasso regression**

Estimated time needed: **30** minutes

This project is based on the <a href="https://developer.ibm.com/tutorials/awb-lasso-regression-automatic-feature-selection/" target="_blank" rel="noopener noreferrer">IBM developer tutorial</a> by Eda Kavlakoglu.



# __Table of contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li><a href="#What is Lasso Regression?">What is lasso regression?</a></li>
    <li><a href="#Setup">Setup</a></li>
    <li><a href="">Steps</a></li>
        <ul>
            <li><a href="">Step 1. Import libraries and load the data set</a></li>
            <li><a href="">Step 2. Explore the data set</a></li>
            <li><a href="">Step 3. Split the data set</a></li>
            <li><a href="">Step 4. Standardize data points through feature scaling</a></li>
            <li><a href="">Step 5. Implement and evaluate the model</a></li>
            <li><a href="">Step 6. Optimize model with hyperparameter tuning</a></li>
        </ul>
    <li><a href="">Summary and next steps</a></li>
    <li><a href="">Exercises</a></li>
    
</ol>


---


# Objectives

After completing this lab you are able to:
 - Gain a solid understanding of regularization concepts in the context of linear regression models
 - Implement lasso regression for linear models by using Sklearn, and use grid search for hyperparameter tuning
 - Regularize linear regression models by applying lasso regression in Python to have the most predictive value


---


# What is lasso regression?


Lasso regression, also known as L1 regularization, is a form of regularization for linear regression models. Regularization is a statistical method to reduce errors caused by overfitting on training data.

Lasso stands for Least Absolute Shrinkage and Selection Operator. It's frequently used in machine learning to handle high dimensional data as it facilitates automatic feature selection. It does this by adding a penalty term to the residual sum of squares (RSS), which is then multiplied by the regularization parameter (lambda or λ). This regularization parameter controls the amount of regularization applied. Larger values of lambda increase the penalty, shrinking more of the coefficients towards zero, which subsequently reduces the importance of (or altogether eliminates) some of the features from the model, which results in automatic feature selection. Conversely, smaller values of lambda reduce the effect of the penalty, retaining more features within the model.

This penalty promotes sparsity within the model, which can help avoid issues of multicollinearity and overfitting issues within data sets. Multicollinearity occurs when two or more independent variables are highly correlated with one another, which can be problematic for causal modeling. Overfit models will generalize poorly to new data, diminishing their value altogether. By reducing regression coefficients to zero, lasso regression can effectively eliminate independent variables from the model, sidestepping these potential issues within the modeling process. Model sparsity can also improve the interpretability of the model compared to other regularization techniques such as ridge regression (also known as L2 regularization).


# Setup


For this lab, you use the following libraries:

*   [`pandas`](https://pandas.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for managing the data
*   [`NumPy`](https://numpy.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for mathematical operations
*   [`sklearn`](https://scikit-learn.org/stable/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for machine learning and machine learning pipeline-related functions
*   [`seaborn`](https://seaborn.pydata.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for visualizing the data
*   [`Matplotlib`](https://matplotlib.org/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML0187ENSkillsNetwork31430127-2021-01-01) for additional plotting tools
*   [`statsmodels`](https://www.statsmodels.org/stable/index.html) for statistical data exploration


### Installing required libraries

The following required libraries are pre-installed in the Skills Network Labs environment. However, if you run this notebook's commands in a different Jupyter environment (for example, Watson Studio or Ananconda), you must install these libraries by removing the `#` sign before `!pip` in the following code cell.


In [None]:
# All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented.
# !pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1
# - Update a specific package
# !pip install pmdarima -U
# - Update a package to specific version
# !pip install --upgrade pmdarima==2.0.2
# Note: If your environment doesn't support "!pip install", use "!mamba install"

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You must run the following cell__ to install them:


In [None]:
!pip install tqdm seaborn skillsnetwork scikit-learn

# Steps


## Step 1. Import libraries and load the data set

In this step, you import the necessary Python libraries for implementing lasso regression.



In [None]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import Lasso 
from sklearn.metrics import mean_squared_error, r2_score 
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt

# You can also use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

sns.set_context('notebook')
sns.set_style('white')
warn()

In [None]:
#Load the dataset
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
df.sample(5)

This data set is originally from the 1974 Motor Trend US magazine, highlighting different attributes of car models from 1973-1974. To explore the different definitions in this data set, please check out <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html">this link</a>.


---


## Step 2. Explore the data set

Before initiating data preprocessing, you should conduct an <a href="https://www.ibm.com/topics/exploratory-data-analysis?utm_source=skills_network&utm_content=in_lab_content_link&utm_id=Lab-593&cm_sp=ibmdev-_-developer-tutorials-_-ibmcom">exploratory data analysis</a>
 to understand the data's structure and format, including the types of variables, their distributions, and the overall organization of information.



In [None]:
#check shape of the data
df.shape

In [None]:
#Check for missing values in each column of the dataset
print(df.isnull().sum())

There are no missing values in any of the columns of this 11-dimensional data set, which means that you won’t have to drop any rows from the data set or impute any values.

To understand whether any of the features are correlated to one another, different data visualizations, such as pair plots and correlation heatmaps, can show potential signs of multicollinearity. This can subsequently indicate the need for a dimensionality reduction.


In [None]:
#Create a heatmap
corr = df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            cmap="BuPu",
            vmin=-1,
            vmax=1,
            annot=True)
plt.title("Correlation Heatmap of mtcars dataset")
plt.show()

The heatmap computes and displays the correlation between two variables. The values range from -1 to 1, where -1 indicates perfect negative correlation, 0 indicates no correlation, and 1 denotes perfect positive correlation. In this data set, there is a strong positive correlation of 0.9 between cylinders (cyl) and displacement (disp), meaning that as one increases, the other also increases. There is also a strong negative correlation between mpg and cyl (-.85), mpg and disp (-.85), and mpg and wt (-.87). This means that as displacement, weight, and the number of cylinders increase, the fuel efficiency (mpg) tends to decrease.


---


## Step 3. Split the data set


Given the exploratory analysis, you can conclude that there are some correlated features within the data set, and as a result, you can expect that the lasso regression model will automatically drop features to reduce the redundancy within the data set. That said, it is important to keep in mind that lasso regression has its limitations, and it will arbitrarily drop one of the correlated features from the model.

From here, you split the data set into two sets, a training set and the test set. Setting the random state ensures that the splits you generate are reproducible.


In [None]:
features = df.columns[1:]
target = df.columns[0]
X = df[features].values
y = df[target].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---


## Step 4. Standardize data points through feature scaling


Next, you normalize the data to ensure that the scale of your predictors does not negatively impact variable selection as the scale of your variable does affect the size of your coefficients. Scaling your variables with a mean of zero and a standard deviation of one is a common feature scaling technique. This allows the lasso model to select the most important features more accurately.


In [None]:
#scaling and centering the data
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

---


## Step 5: Implement and evaluate the model


Now, you apply this algorithm to your data set and return the predicted values. It’s worth noting that you’re using `GridSearchCV` with the lasso model for this project, but you can also use `LassoCV`, which automatically incorporates cross-validation into the model.

Important: While you can use `LassoCV` or `GridSearchCV` with the lasso model in scikit-learn, they won’t necessarily yield the same results. According to the documentation, `LassoCV` is warm started using the coefficients of a previous iteration of the model on the regularization path. This tends to speed up the hyperparameter search for alpha value. This is further explained in a <a href="https://github.com/scikit-learn/scikit-learn/issues/24877">scikit-learn bug ticket</a>.


In [None]:
#Initialize lasso regression model
model = Lasso(max_iter=10000) #default alpha is 1
model.fit(X_train_scaled,y_train)

y_pred = model.predict(X_test_scaled)

#Calculate R-squared
rsquared = r2_score(y_test, y_pred)
print(f"R-squared: {rsquared}")

#Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

The R-squared and mean squared error (MSE) for this model is 0.77 and 9.26, respectively. To see which coefficients shrunk down to zero, you will plot your coefficients in both a DataFrame and a graph.


In [None]:
coeff = pd.Series(model.coef_, index=features)
coeff

alphas = np.linspace(0.01, 1000, 100)
coefs = []

for a in alphas:
    model.set_params(alpha=a)
    model.fit(X_train_scaled, y_train)
    coefs.append(model.coef_)

ax = plt.gca()
ax.plot(alphas*2, coefs)
ax.set_xscale('log')
ax.legend(features)
ax.grid(False)
plt.axis('tight')
plt.xlabel('alpha')
plt.ylabel('coefficients')
plt.title("Coefficient values with increasing values of alpha")

In your DataFrame, you see that 7 of the 10 variables have shrunk down to zero. According to this model, there are three features that are key predictors of mpg, which are the number of cylinders, horsepower, and the weight of the vehicle. However, when you observe the line graph, it looks like the model might have dropped some variables, like disp, due to high collinearity, which might have also had good predictive power. As you can see, lasso regression is helpful in reducing the number of features in a model to some of the most important ones, but it's not without its limitations.


---


## Step 6: Optimize model with hyperparameter tuning


You want the lowest possible value of MSE for the optimal lasso model, and you find this by trying different values of alpha through grid search. `GridSearchCV` helps you to conduct this optimization through cross-validation, allowing you to find or confirm the best value for the alpha hyperparameter.


In [None]:
alphas = {"alpha": 10.0 ** np.arange(-5, 6)}
grid_search = GridSearchCV(model, alphas, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train_scaled,y_train)

print(f"Best value for lambda : ", grid_search.best_params_)
print("Best score for cost function: ", grid_search.best_score_)

This grid search confirms that the default alpha value of 1 is, in fact, the optimal value for this hyperparameter.


---


# Summary and next steps


In this project, you learned how to apply lasso regression to conduct automatic feature selection, which identified the subset of features in the data set that have the most predictive value to your target variable.

To learn more about other supervised machine learning models that you can apply to classification and regression problems, see these tutorials in the <a href="https://developer.ibm.com/learningpaths/learning-path-machine-learning-for-developers">Getting started</a> with machine learning learning path: 
- <a href="https://developer.ibm.com/learningpaths/learning-path-machine-learning-for-developers/learn-classification-algorithms">Tutorial: Learn classification algorithms using Python and scikit-learn</a>
- <a href="https://developer.ibm.com/learningpaths/learning-path-machine-learning-for-developers/learn-regression-algorithms">Tutorial: Learn regression algorithms using Python and scikit-learn</a>


---


# Exercises


### Exercise 1 - Initialize lasso regression model to your data


In [None]:
from sklearn.linear_model import LassoCV
# TODO

<details>
    <summary>Click here for solution</summary>

```python
#Initialize lasso regression model
model = Lasso(max_iter=10000) #default alpha is 1
model.fit(X_train_scaled,y_train)

y_pred = model.predict(X_test_scaled)
```

</details>


### Exercise 2 - From exercise 1, get the R-squared and the MSE


In [None]:
#Calculate R-squared
# TODO

<details>
    <summary>Click here for solution</summary>

```python
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")

#Calculate the mean squared error
mse_lasso = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse_lasso}")
```

</details>


### Exercise 3 - Again using `LassoCV`, get the optimal alpha and best cost function value 


In [None]:
# TODO

<details>
    <summary>Click here for solution</summary>
    
```python
lasso_cv = LassoCV(cv=5)
lasso_cv.fit(X_train_scaled,y_train)

# Print the optimal alpha parameter
optimal_alpha = lasso_cv.alpha_
print("Optimal alpha:", optimal_alpha)

# Print the corresponding cost (objective) function value
best_cost_function_value = lasso_cv.score(X_train_scaled, y_train)
print("Best cost function value:", best_cost_function_value)
```
</details>


---


# Congratulations! You have completed the lab


## Lead Instructor


[Eda Kavlakoglu](https://author.skills.network/instructors/eda_kavlakoglu)

Eda Kavlakoglu is a Marketing leader with a technical background in data science.


## Instructor


[Lucy Xu](https://author.skills.network/instructors/lucy_xu)

Lucy Xu is a data scientist at the Ecosystems Skills Network at IBM and a fourth year student in Statistics at the University of Waterloo.


## Other Contributors


[Wojciech Fulmyk](https://author.skills.network/instructors/wojciech_fulmyk)

Wojciech Fulmyk is a data scientist at the Ecosystems Skills Network at IBM and a Ph.D. candidate in Economics at the University of Calgary.


Copyright © 2023 IBM Corporation. All rights reserved.
