## Lasso
The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. 

### Setting regularization parameter
* The alpha parameter controls the degree of sparsity of the estimated coefficients.

### Using cross-validation
* scikit-learn exposes objects that set the Lasso alpha parameter by cross-validation: LassoCV and LassoLarsCV.
* LassoLarsCV is based on the Least Angle Regression algorithm .

* For high-dimensional datasets with many collinear features, LassoCV is most often preferable.
* However, LassoLarsCV has the advantage of exploring more relevant values of alpha parameter, and if the number of samples is very small compared to the number of features, it is often faster than LassoCV.


### Information-criteria based model selection
* The estimator LassoLarsIC proposes to use the Akaike information criterion (AIC) and the Bayes Information criterion (BIC).
* It is a computationally cheaper alternative to find the optimal value of alpha as the regularization path is computed only once instead of k+1 times when using k-fold cross-validation.

Indeed, these criteria are computed on the in-sample training set. In short, they penalize the over-optimistic scores of the different Lasso models by their flexibility.
They has a tendency to break when the problem is badly conditioned (e.g. more features than samples).

**The AIC criterion is defined as:**

AIC = -2log(L-hat) + 2d


where `L-hat`
 is the maximum likelihood of the model and `d`  is the number of parameters (as well referred to as degrees of freedom).
 
 
 **The BIC criterion is defined as:**
 
BIC = -2log(L-hat) + log(N)d

where `N`  is the number of samples.


**For a linear Gaussian model, the maximum log-likelihood is defined as:**


 ` log(L-hat) = -n/2 * log(2pi) - n/2 ln(σ^2) - ( Σ(y-yhat)^2 ) / 2*σ^2 `

where `σ` is an estimate of the noise variance,  and `y` and `y-hat`
 are respectively the true and predicted targets, and `n` is the number of samples.
 
 
 Plugging the maximum log-likelihood in the AIC formula yields:
 
 
 `-2 * ( -n/2 * log(2pi) - n/2 ln(σ^2) - ( Σ(y-yhat)^2 ) / 2*σ^2 ) + 2d`
 
 `log(ab) = log(a)+log(b)`
 
AIC =  `nlog(2 pi σ^2) - ( Σ(y-yhat)^2 ) / σ^2 ) + 2d`
 
 
 
 The first term of the above expression is sometimes discarded since it is a constant when `σ` is provided
 
`σ^2` is an estimate of the noise variance. In LassoLarsIC when the parameter noise_variance is not provided (default), the noise variance is estimated via the unbiased estimator

σ^2 = `( Σ(y-yhat)^2 ) / n - P )`

where `P` is the number of features and `yhat` is the predicted target using an ordinary least squares regression. Note, that this formula is valid only when` n_samples > n_features`   (ie n > P).

In [1]:
from sklearn.datasets import load_diabetes

X, y = load_diabetes(return_X_y=True, as_frame=True)


import numpy as np
import pandas as pd

# add some random features to the original data to better illustrate the feature selection performed by the Lasso model.
rng = np.random.RandomState(42)
num_random_features = 14
#  floating-point samples from the standard normal distribution
X_random = pd.DataFrame(data=rng.randn(X.shape[0], num_random_features),
                        columns=[f'random-{i}' for i in range(num_random_features)])
# X_random.shape --> (442, 14)
X = pd.concat([X, X_random], axis=1)
X.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,...,random-4,random-5,random-6,random-7,random-8,random-9,random-10,random-11,random-12,random-13
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,...,-0.234153,-0.234137,1.579213,0.767435,-0.469474,0.54256,-0.463418,-0.46573,0.241962,-1.91328
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,...,-0.908024,-1.412304,1.465649,-0.225776,0.067528,-1.424748,-0.544383,0.110923,-1.150994,0.375698
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,...,-0.013497,-1.057711,0.822545,-1.220844,0.208864,-1.95967,-1.328186,0.196861,0.738467,0.171368
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,...,-0.460639,1.057122,0.343618,-1.76304,0.324084,-0.385082,-0.676922,0.611676,1.031,0.93128
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,...,-0.479174,-0.185659,-1.106335,-1.196207,0.812526,1.35624,-0.07201,1.003533,0.361636,-0.64512


In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoLarsIC
from sklearn.preprocessing import StandardScaler

In [4]:
%%time
pipe = make_pipeline(StandardScaler(), LassoLarsIC(criterion='aic', normalize=False))
pipe.fit(X,y)

CPU times: total: 15.6 ms
Wall time: 15.6 ms


### The value of the information criteria (‘aic’, ‘bic’) across all alphas.

In [5]:
pipe[-1].alpha_

0.5867979213449432

### The alpha which has the smallest information criterion is chosen

In [7]:
result = pd.DataFrame(data={'alphas':pipe[-1].alphas_, 'AIC':pipe[-1].criterion_}).set_index('alphas')

In [8]:
def higlight(result):
    min_val = result.min()
    return ['color:red' if val==min_val else 'color:grey' for val in result]

In [9]:
result.style.apply(higlight)

Unnamed: 0_level_0,AIC
alphas,Unnamed: 1_level_1
45.16003,5244.764779
42.300343,5208.250639
21.542052,4928.0189
15.034077,4869.678359
6.189631,4815.437362
5.329616,4810.423641
4.306012,4803.573491
4.124225,4804.126502
3.820705,4803.621645
3.750389,4805.012521


## Noise variance

noise_variance =  summation(y - yhat) / n - p <br>
n = number of a samples<br>
p = number of features<br>

In [10]:
pipe[-1].noise_variance_

2870.3303455996593

In [28]:

y_hat = pipe.predict(X)
n = X.shape[0]
p = X.shape[1]

est_noise_variance = np.sum( np.power(y-y_hat, 2) ) / (n-p)

In [29]:
est_noise_variance

2888.2696215888777

https://direct.mit.edu/neco/article-abstract/15/7/1691/6752/Comparison-of-Model-Selection-for-Regression?redirectedFrom=fulltext