# Shrinkage Methods: The Lasso

## Objective and Prerequisites

In this notebook you will learn how to:

1. Apply the Lasso using mathematical programming.
2. Perform hyper-parameter tuning using random search. 


To fully understand the content of this notebook, the reader should be familiar with the following:

- Differential calculus.
- Linear algebra (matrix multiplication, transpose and inverse).
- Linear regression analysis.

---
## Motivation

For regression problems, the standard linear model is often used to describe the relationship between a response variable and a set of features. In fact, when applied to real-world problems, it is usually competitive in relation to non-linear methods. However, it may fall short when dealing with problems that have few observations with respect to the number of features considered. Particularly, and provided that the true relationship is approximately linear:

- If the number of observations $n$ is much larger than the number of features $d$, i.e. $n >> d$, then the Ordinary Least Squares (OLS) estimation will tend to have low variance. Hence, the linear model will perform well on unseen data.
- If $n$ is not much larger that $d$, then there can be a lot of variability in the fitting process, resulting in overfitting and thus a poor performance on new observations.
- If $n < d$, there is no longer a unique solution to the OLS algorithm, as the variance is infinite.

Use-cases that commonly have few observations per feature include:

- Genome-scale data analysis.
- Clinical trials.
- Destructive testing.
- Analysis of production systems under abnormal conditions.

To exacerbate this, oftentimes some or many features may not be associated with the response, and including them would only increase the complexity of the resulting model. This notebook will present the Least Absolute Shrinkage and Selection Operator, better known as the Lasso, a technique that has gained a lot of traction because of its ability to deal with both issues.

---
## Problem Description

Linear Regression is a supervised learning algorithm used to predict a quantitative response. It assumes that there is a linear relationship between the feature vector $x_i \in \mathbb{R}^d$ and the response $y_i \in \mathbb{R}$. Mathematically speaking, for sample $i$ we have $y_i = \beta^T x_i + \epsilon_i$, where $\beta \in \mathbb{R}^d$ is the vector of feature weights, including the intercept, and  $\epsilon_i$ is a normally-distributed random variable with zero mean and constant variance representing the error term. We can learn the weights from a training dataset with $n$ observations $\{X \in \mathbb{M}^{nxd},y \in \mathbb{R}^n\}$ by minimizing the Residual Sum of Squares (RSS): $e^Te =(y-X\beta)^T (y-X\beta)=\beta^T X^T X\beta- 2y^TX\beta+y^T y$. The Ordinary Least Squares (OLS) method achieves this by taking the derivative of this quadratic and convex function and then finding the stationary point: $\beta_{OLS}=(X^T X)^{-1} X^T y$.

As previously discussed, Linear Regression does not perform well when the observations-per-feature ratio is low. Shrinkage methods, such as the Lasso, address this problem by constraining —or shrinking— the weight estimates, thus significantly reducing the variance at the expense of a slight increase in bias. 

Specifically, the Lasso fits a model containing all $d$ predictors, while incorporating a budget constraint based on the L1-norm of $\beta$, disregarding the intercept component. Mathematically speaking, this method minimizes the RSS, subject to $\sum_{l=1}^{d-1}\mathopen|\beta_l\mathclose| \leq s$, where $s$ is a hyper-parameter representing the budget that is usually tuned via cross-validation (because of this constraint, the procedure is no longer scale-invariant). This constraint has the effect of shrinking all weight estimates, also known as regularization, allowing some of them to be exactly zero when $s$ is small enough. In fact, when $s \rightarrow 0$ the null model (intercept only) is retrieved, and when $s \geq ||\beta_{OLS}||_1$ the OLS model is recovered (whenever the budget constraint is non-binding). In light of this, the Lasso implicitly assumes that some of the weights truly equal to zero, allowing it to perform feature selection.

It is worth noting that the unconstrained version of the Lasso is more frequently used. This version solves an unconstrained optimization problem where $RSS + \lambda \sum_{l=1}^{d-1}\mathopen|\beta_l\mathclose|$ is minimized, for a given value of the —modified— lagrangian multiplier $\lambda \in \mathbb{R}^+$. However, this notebook will focus on the model presented in the previous paragraph, which may be casted as a Quadratic Program (quadratic and convex objective function with linear constraints).

---
## Solution Approach

Mathematical programming is a declarative approach where the modeler formulates a mathematical optimization model that captures the key aspects of a complex decision problem. The Gurobi optimizer solves such models using state-of-the-art mathematics and computer science. 

A mathematical optimization model has five components, namely:

- Sets and indices.
- Parameters.
- Decision variables.
- Objective function(s).
- Constraints.

Before delving into the mathematical model, Note that the budget constraint involves the L1-norm of the feature weights, which is defined as the sum of their absolute values. This constraint, in its current form, is not amenable to Quadratic Programming because it is nonlinear.  However, it is possible to linearize it using a variable transformation technique, at the expense of additional decision variables.

Let $\beta_j = \beta_j^+ - \beta_j^-$ and $|\beta_j| = \beta_j^+ + \beta_j^-$, where $\beta_j^+, \beta_j^- \geq 0 \quad \forall j \in \{1,2,\dots, d\}$ (for brevity, the intercept is also transformed). Then, the budget constraint can be re-expressed as $\sum_{l=1}^{d-1}(\beta_l^+ + \beta_l^-) \leq s$. Consequently, the RSS is modified as follows:

$$
\begin{equation}
e^Te \\
=(y^T-\beta^TX)(y-X\beta) \\
=[y^T-(\beta^+-\beta^-)^TX^T][y-X(\beta^+-\beta^-)] \\
=y^Ty -y^TX\beta^+ +y^TX\beta^- -\beta^{+T}X^Ty +\beta^{+T}X^TX\beta^+ -\beta^{+T}X^TX\beta^- +\beta^{-T}X^Ty -\beta^{-T}X^TX\beta^+ +\beta^{-T}X^TX\beta^- \\
= \beta^{+T}X^TX\beta^+ +\beta^{-T}X^TX\beta^- -2\beta^{+T}X^TX\beta^- -2y^TX\beta^+ +2y^TX\beta^- +y^Ty
\end{equation}
$$

As a side note, it can be shown from the definitions that $\beta_j^+ = \frac{|\beta_j|+\beta_j}{2}$ and $\beta_j^- = \frac{|\beta_j|-\beta_j}{2}$, which in turn implies that $\beta_j^-=0$ when $\beta_j \geq 0$ and $\beta_j^+=0$ when $\beta_j \leq 0$. In other words, $\beta_j^+ \cdot \beta_j^- = 0 \quad \forall j \in \{1,2,\dots, d\}$.

We are now ready to present a QP formulation that finds the weight estimates for a linear regression problem with an L1-norm constraint on the feature weights:

### Sets and Indices
$i \in I=\{1,2,\dots,n\}$: Set of observations.

$j \in J=\{1,2,\dots,d\}$: Set of features, where the last ID corresponds to the intercept.

$l \in L = J \backslash \{d\}$: Set of features, where the intercept is excluded.

### Parameters
$s \in \mathbb{R}^+$: Budget available for the L1-norm of $\beta$.

$Q = X^T X \in \mathbb{M}^{|J|x|J|}$: Quadratic component of the objective function.

$c = y^TX \in \mathbb{R}^{|J|}$: Linear component of the objective function.

***Note:** Recall that the RSS is defined as $\beta^{+T}X^TX\beta^+ +\beta^{-T}X^TX\beta^- -2\beta^{+T}X^TX\beta^- -2y^TX\beta^+ +2y^TX\beta^- +y^Ty$. However, Since $y^T y$ is constant w.r.t. the decision variables, we can drop this term altogether.

### Decision Variables
$\beta_j = \beta_j^+ - \beta_j^- \in \mathbb{R}$: Weight of feature $j$, representing the change in the response variable per unit-change of feature $j$.

$\beta_j^+, \beta_j^- \in \mathbb{R}^+$: Auxiliary variables used to model the absolute value of $\beta_j$.

### Objective Function

- **Training error**: Minimize the Residual Sum of Squares (RSS):

\begin{equation}
\text{Min} \quad Z = \frac{1}{2}\sum_{j \in J}\sum_{j' \in J}Q_{j,j'}\beta_j^+\beta_{j'}^+ +\frac{1}{2}\sum_{j \in J}\sum_{j' \in J}Q_{j,j'}\beta_j^-\beta_{j'}^- -\sum_{j \in J}\sum_{j' \in J}Q_{j,j'}\beta_j^+\beta_{j'}^- - \sum_{j \in J}c_j\beta_j^+ + \sum_{j \in J}c_j\beta_j^-
\tag{0}
\end{equation}

Note that we use the fact that if $x^*$ is a minimizer of $f(x)$, it is also a minimizer of $a\cdot f(x)$, as long as $a > 0$.

### Constraints

- **Budget constraint**: The L1-norm of the feature weights cannot exceed the budget:

\begin{equation}
\sum_{l \in L}(\beta_l^+ + \beta_l^-) \leq s
\tag{1}
\end{equation}

---
## Python Implementation

In the following implementation, we make use of three main libraries:

- **Numpy** for scientific computing.
- **Scikit learn** for machine learning algorithms.
- **Gurobi** for mathematical optimization.

In [1]:
# Let's import all the necessary libraries

import numpy as np
from itertools import product
from gurobipy import *

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.metrics import mean_squared_error as mse

In [2]:
# Create and deploy the optimization model

def lasso(X, y, budget, verbose=False):
    """
    Deploy and optimize the QP formulation of the Lasso. It assumes that data is
    centered and scaled.
    """
    regressor = Model()
    samples, dim = X.shape
    assert samples == y.shape[0]  
    assert budget >= 0
    
    # Append a column of ones to the feature matrix to account for the y-intercept
    X = np.concatenate([X, np.ones((samples, 1))], axis=1)
    
    # Decision variables. Note that a change of variable is used
    beta_plus = regressor.addVars(dim + 1, name="beta_plus") # Weight estimates
    beta_minus = regressor.addVars(dim + 1, name="beta_minus") # Weight estimates
    intercept_plus = beta_plus[dim] # Last decision variable captures the y-intercept
    intercept_minus = beta_minus[dim] # Last decision variable captures the y-intercept
    intercept_plus.varname = 'intercept_plus'
    intercept_minus.varname = 'intercept_minus'
    
    # Objective Function (OF): minimize 1/2 * RSS using the facts that
    # if x* is a minimizer of f(x), it is also a minimizer of k*f(x) iff k > 0
    # if x* is a minimizer of f(x), it is also a minimizer of f(x) + k
    Quad = np.dot(X.T, X)
    lin = np.dot(y.T, X)
    obj = sum(0.5 * Quad[i,j] * beta_plus[i] * beta_plus[j] for i, j in product(range(dim + 1), repeat=2))
    obj += sum(0.5 * Quad[i,j] * beta_minus[i] * beta_minus[j] for i, j in product(range(dim + 1), repeat=2))
    obj -= sum(Quad[i,j] * beta_plus[i] * beta_minus[j] for i, j in product(range(dim + 1), repeat=2))
    obj -= sum(lin[i] * beta_plus[i] for i in range(dim + 1))
    obj += sum(lin[i] * beta_minus[i] for i in range(dim + 1))
    regressor.setObjective(obj, GRB.MINIMIZE)
    
    # Budget constraint. Note that the intercept is not included
    regressor.addConstr(quicksum([beta_plus[j] + beta_minus[j] for j in range(dim)]) == budget)
    
    if not verbose:
        regressor.params.OutputFlag = 0
    regressor.params.timelimit = 60
    regressor.optimize()
    
    coeff_plus = np.array([beta_plus[i].X for i in range(dim)])
    coeff_minus = np.array([beta_minus[i].X for i in range(dim)])
    return intercept_plus.X - intercept_minus.X, coeff_plus - coeff_minus

---
## Hyper-parameter Tuning

Unlike Ordinary Least Squares (OLS), the Lasso produces a different set of weight estimates $\beta(s)$ for each value of the budget $s$. Thus, we require a method for selecting a value for that hyper-parameter. As $s \in \mathbb{R}^+$, we will:

1. Try several values using random search and compute the cross-validated Mean Square Error (MSE) for each of them.
2. Select the budget for which the cross-validated MSE is smallest.
3. Fit a model using all of the available observations and the selected value for the budget.

***Note:** As previously stated, we can bound our search to $0 \leq s \leq ||\beta_{OLS}||_1$.

In [3]:
def split_folds(features, response, train_mask):
    """
    Assign folds to either train or test partitions based on train_mask.
    """
    xtrain = features[train_mask,:]
    xtest = features[~train_mask,:]
    ytrain = response[train_mask]
    ytest = response[~train_mask]
    return xtrain, xtest, ytrain, ytest

def cross_validate(features, response, budget, folds, seed):
    """
    Train a Lasso model for each fold and report the cross-validated MSE.
    """
    if seed is not None:
        np.random.seed(seed)
    samples, dim = features.shape
    assert samples == response.shape[0]
    fold_size = int(np.ceil(samples / folds))
    # Randomly assign each sample to a fold
    shuffled = np.random.choice(samples, samples, replace=False)
    mse_cv = 0
    # Exclude folds from training, one at a time, to get out-of-sample estimates of the MSE
    for fold in range(folds):
        idx = shuffled[fold * fold_size : min((fold + 1) * fold_size, samples)]
        train_mask = np.ones(samples, dtype=bool)
        train_mask[idx] = False
        xtrain, xtest, ytrain, ytest = split_folds(features, response, train_mask)
        intercept, beta = lasso(xtrain, ytrain, budget)
        ypred = np.dot(xtest, beta) + intercept
        mse_cv += mse(ytest, ypred) / folds
    # Report the average out-of-sample MSE
    return mse_cv

def get_budget_UB(X, y):
    """
    Calculates the L1 norm of the OLS estimates, in log10 units.
    """
    X = np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
    arg1 = np.linalg.inv(np.dot(X.T, X))
    arg2 = np.dot(X.T, y)
    budget = np.linalg.norm(np.dot(arg1, arg2)[:-1], ord=1)
    return np.log10(budget)

def lasso_cv(features, response, trials=10, folds=5, seed=None):
    """
    Looks for a promising value for the budget on a log-linear space and
    returns the weight estimates for the corresponding Lasso model.
    """
    if seed is not None:
        np.random.seed(seed)
    upper_bound = get_budget_UB(features, response)
    best_mse = np.inf
    best_budget = None
    for i in range(trials):
        # Explore values between the null and OLS models, in log-linear space
        exponent = np.random.uniform(0, upper_bound)
        budget = np.power(10, exponent)
        val = cross_validate(features, response, budget, folds=folds, seed=seed)
        if val < best_mse:
            best_mse = val
            best_budget = budget
    intercept, beta = lasso(features, response, best_budget)
    return intercept, beta

---
## Benchmark

We will validate the output of the implementation presented above with the one provided by Scikit learn. The Boston dataset is used for this purpose. This dataset measures the prices of 506 houses, along with 13 features that provide insights about their neighbourhoods. We will use the original feature terminology, so the interested reader is referred to [this website](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). 

Note that 20% of the samples are reserved for computing the out-of-sample MSE. Furthermore, as the Lasso is not scale-invariant, we first standardize all features so they are expressed in the same units. Such preprocessing entails three steps, namely:

For each feature $x_l$:
1. Compute its sample average $\mu_l$ and sample standard deviation $\sigma_l$.
2. Center by subtracting $\mu_l$ from $x_l$.
3. Scale by dividing the resulting difference by $\sigma_l$. 

This procedure has the effect of coercing the sample average and variance to be equal to zero and one, respectively, across all features.

In [4]:
# Load data and split into train (80%) and test (20%)
boston = load_boston()
X = boston['data']
y = boston['target']
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20, random_state=10101)
# Center and scale data to have zero mean and variance of one
scaler = StandardScaler()
scaler.fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

In [5]:
# Hyper-parameter tuning using random search
b0, beta = lasso_cv(Xtrain, ytrain, seed=10101)
mse(ytest, np.dot(Xtest, beta) + b0)

23.36210006159952

In [6]:
# Hyper-parameter tuning performed using grid search:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html
lr = linear_model.LassoCV(cv=5)
lr.fit(Xtrain, ytrain)
mse(ytest, lr.predict(Xtest))

23.36501227078064

In [7]:
# intercept is not accounted for
norm_qp = np.round(np.linalg.norm(np.array(beta), 1), 2)
norm_sklearn = np.round(np.linalg.norm(lr.coef_, 1), 2)
print("The L1-norm of Qp's output is: {0}".format(norm_qp))
print("The L1-norm of Scikit-learn's output is: {0}".format(norm_sklearn))

The L1-norm of Qp's output is: 22.1
The L1-norm of Scikit-learn's output is: 22.14


We can see that both metrics, the MSE and the L1-norm of the vector of feature weights, are virtually the same in both implementations. Notice that the norms are not exactly equal because the implementation in Scikit-learn uses grid search, rather than random search.

### Final Model

The previous analysis indicates that the best model is as follows:

$$
\text{medv} = 22.56-1.00\text{crim}+1.41\text{zn}+0.49\text{chas}-1.84\text{nox}+2.57\text{rm}
$$

$$
-1.84\text{age}-3.48\text{dis}+2.52\text{rad}-2.12\text{tax}-1.85\text{ptratio}+1.01\text{b}-3.64\text{lstat}
$$

Since we standardized the data, the intercept represents the estimated median value (in thousands) of a house with mean values across features. Likewise, we can interpret $\beta_1=-1.00$ as the decrease in the house value when the per-capita crime rate increases by one standard deviation from the average value, *ceteris paribus* (similar statements can be made for the rest of the features).

---
## Conclusion

As already shown, the Lasso is especially useful when we have few observations for the learning process, or when we want to increase the interpretability of the model.

Also, this notebook presented a mathematical model based on Quadratic Programming to find an optimal solution for the Lasso. While this approach may not be the fastest, it can easily accommodate additional linear constraints (Bertsimas, 2015), such as: 

- Enforcing group sparcity among features.
- Limiting pairwise multicollinearity.
- Limiting global multicollinearity.
- Considering a fixed set of nonlinear transformations.

Thus, mathematical programming provides a systematic approach to address common aspects of the model-building process, such as imposing statistical properties, which otherwise would be hard to incorporate into specialized algorithms. Ultimately, it comes down to choosing between efficiency and versatility.

---
## References

1. Bertsimas, D., & King, A. (2015). OR forum—An algorithmic approach to linear regression. Operations Research, 64(1), 2-16.
2. Busa, J. (2012). Solving quadratic programming problem with linear constraints containing absolute values. Acta Electrotechnica et Informatica, 12(3), 11.
3. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: springer.
4. The Boston housing dataset (1996, October 10). Retrieved from https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
5. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Copyright &copy; 2019 Gurobi Optimization, LLC