# Lasso Regularization

In [80]:
import pandas as pd

data = pd.read_csv('data.csv')

data.head()

Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest,price_range
0,28,1,37.62,1,0,0,0,0,1,0,cheap
1,28,1,24.32,1,0,0,1,0,0,0,expensive
2,35,1,34.8,1,0,0,0,0,0,1,cheap
3,51,3,36.385,1,0,0,0,1,0,0,expensive
4,20,0,30.59,1,0,0,1,0,0,0,cheap


You should know the dataset by now! 
- Each row corresponds to the profile of health insurance client
- The features are client specificities
- `charges` is the amount paid by the client for the insurance

👇 Optimize the regularization penalty of a Lasso classification model. According to your optimal model, which features do not influence the charges paid by a client?

In [170]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# X and y
X = data.drop(columns="price_range")
y = data['price_range']

# Classification model with Lasso Regualrization
model = LogisticRegression(penalty='l1', solver = 'liblinear')

# Hyperparameter search space for C (reg parameter equivalent to alpha)
search_space = {'C': uniform(0,10)}

# Instanciate Random Search
search = RandomizedSearchCV(model, param_distributions = search_space, n_jobs=-1, scoring = 'accuracy', cv = 10, n_iter = 50)

# Fit data to Grid Search
search.fit(X, y)

RandomizedSearchCV(cv=10,
                   estimator=LogisticRegression(penalty='l1',
                                                solver='liblinear'),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1280dfbd0>},
                   scoring='accuracy')

In [171]:
# Best regularization penalty
search.best_params_

{'C': 1.634581030837473}

In [172]:
# Mean score of optimal model

search.cv_results_['mean_test_score'].mean()

0.8676153846153847

In [179]:
# Rank the features by order of importance
pd.Series(search.best_estimator_.coef_.tolist()[0], index = X.columns).sort_values(ascending=False)



smoker              5.396185
age                 0.127587
region_northwest    0.000000
region_northeast    0.000000
sex_female          0.000000
children            0.000000
bmi                -0.068469
sex_male           -0.175765
region_southeast   -0.331855
region_southwest   -0.582688
dtype: float64

⚠️ Please, push the exercice once you have completed it 🙃

# 🏁