## Automatic feature selection by Lasso

$
\frac{1}{2N_{training}}  \Sigma_{i=1}^{N_{training}} (Y_{real}^{(i)}-Y_{predict}^{(i)})^2 + \alpha \Sigma_{j=1}^{N}|a_j|
$


In order to minimize this cost function, we need to minimize the coefficients as low as possible.
With this approach an automatic feature selection will be done by lasso because if two features are 
collinear their presence inside the dataset will increase the value of this cost function so lasso regression
tries to shrink those coefficients to zero if features are useless or collinear and do not help to minimizing the cost function.
The idea is to use lasso l1 term to perform automatic feature selection.
Lasso works good on the scaled data


In [1]:
import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_diabetes

In [3]:
X,y = load_diabetes(return_X_y=True)
features = load_diabetes()['feature_names']

In [5]:
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.33, random_state=42)
pipeline = Pipeline([
	('scaler',StandardScaler()),
	('model',Lasso())
])


In [9]:
search = GridSearchCV(pipeline,
	{'model__alpha':np.arange(0.1, 3, 0.1)},# from 0.1 to 3 increment step = 0.1
	cv = 5 ,#Determines the cross-validation splitting strategy
	scoring = 'neg_mean_squared_error',
	verbose=3
)

In [None]:
search.fit(X_train,y_train)

In [16]:
# Best alpha
search.best_params_

{'model__alpha': 1.2000000000000002}

In [12]:
coef = search.best_estimator_[1].coef_

In [14]:
# features considered by the model
np.array(features)[coef!=0]

array(['age', 'sex', 'bmi', 'bp', 's1', 's3', 's5'], dtype='<U3')

In [15]:
# features discarded by the model
np.array(features)[coef==0]

array(['s2', 's4', 's6'], dtype='<U3')