### Codio Activity 13.6: Use L1 Regularization to Select Features

**Expected Time = 90 minutes** 

**Total Points = 60** 

This activity focuses on using the L1 regularization penalty to select features in a classification setting.  In the following, you will explore the value of different coefficients as you increase regularization.  Be sure to use the `liblinear` solver in your models throughout.

### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV


import seaborn as sns

### The Data

For this exercise you will use the built in dataset from seaborn containing information on passengers on the titanic.  Here, you will only use the numeric features.  The data is loaded and prepared below.  We will only use one set for `X` and `y` to explore the effect of added regularization. 

In [None]:
data = sns.load_dataset('titanic').dropna()
# data = data.frame

In [None]:
data.head()

In [None]:
X, y = data.select_dtypes(np.number).drop('survived', axis = 1), data.survived

[Back to top](#-Index)

### Problem 1

#### Scaling the Data

**10 Points**

Because we are using regularization, it is important to have each of the features represented on the same scale.  To do so, use the `StandardScaler` to create `X_scaled` below.  

In [None]:
### GRADED

scaler = ''
X_scaled = ''

### BEGIN SOLUTION
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
### END SOLUTION

# Answer check
X_scaled.mean()

In [None]:
### BEGIN HIDDEN TESTS
scaler = StandardScaler()
X_scaled_ = scaler.fit_transform(X)
#
#
#
np.testing.assert_array_equal(X_scaled, X_scaled_)
### END HIDDEN TESTS

### Problem 2

#### `C` values to explore

**20 Points**

Next, you want to create an array of different `C` values to explore.  Remember that `C` is actually the inverse of regularization so small values are large amounts of regularization.  


Use the array of `Cs` below to create models on `X_scaled` and `y`.  Keep track of the coefficients (as a list not array!) in a list `coef_list` below.  

In [None]:
Cs = np.logspace(-5, .5)

In [None]:
### GRADED

coef_list = []

### BEGIN SOLUTION
coef_list = []
for C in Cs:
    lgr = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = C, random_state=42, max_iter = 1000).fit(X_scaled, y)
    coef_list.append(list(lgr.coef_[0]))
### END SOLUTION

### ANSWER CHECK
coef_list[0]

In [None]:
### BEGIN HIDDEN TESTS
coef_list_ = []
for C in Cs:
    lgr_ = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = C, random_state=42, max_iter = 1000).fit(X_scaled, y)
    coef_list_.append(list(lgr_.coef_[0]))
#
#
#
assert coef_list == coef_list_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 3

#### DataFrame of Coefficients

**10 Points**

Next, create a DataFrame based on the coefficients in `coef_list`.  Set the index of this DataFrame to the `Cs` values.  

In [None]:
### GRADED

coef_df = ''

### BEGIN SOLUTION
coef_df = pd.DataFrame(coef_list, columns = X.columns)
coef_df.index = Cs
### END SOLUTION

### ANSWER CHECK
coef_df.head()

In [None]:
### BEGIN HIDDEN TESTS
coef_list_ = []
for C in Cs:
    lgr_ = LogisticRegression(penalty = 'l1', solver = 'liblinear', C = C, random_state=42, max_iter = 1000).fit(X_scaled, y)
    coef_list_.append(list(lgr_.coef_[0]))
coef_df_ = pd.DataFrame(coef_list_, columns = X.columns)
coef_df_.index = Cs
#
#
#
pd.testing.assert_frame_equal(coef_df, coef_df_)
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 4

#### Visualizing the Results

**10 Points**

Below, the data from the coefficients is plotted.  Based on this plot, which feature seems more important -- `age` or `parch`?  Assign your answer as a string to `ans4` below.

<center>
    <img src = 'images/coefl1.png' />
</center>

In [None]:
# plt.figure(figsize = (12, 5))
# plt.semilogx(coef_df)
# plt.gca().invert_xaxis()
# plt.grid()
# plt.legend(list(coef_df.columns));
# plt.title('Increasing Regularization on Titanic Features')
# plt.xlabel("Increasing 1/C")
# plt.savefig('images/coefl1.png')

In [None]:
### GRADED

ans4 = ''

### BEGIN SOLUTION
ans4 = 'age'
### END SOLUTION

### ANSWER CHECK
print(ans4)

In [None]:
### BEGIN HIDDEN TESTS
ans4_ = 'age'
#
#
#
assert ans4 == ans4_
### END HIDDEN TESTS

[Back to top](#-Index)

### Problem 5

#### Using `SelectFromModel`

**10 Points**

In a similar manner, you can use `SelectFromModel` together with `LogisticRegression` to select features based on coefficient values.  Below, create an instance of the `SelectFromModel` selector with a `LogisticRegression(C = 0.1, penalty = 'l1', solver = 'liblinear', random_state = 43)` as the estimator.  Fit and transform the data to select the 3 most important features.  Assign their names as a list to `three_best` below.

In [None]:
### GRADED

selector = ''
best_features = ''
### BEGIN SOLUTION
selector = SelectFromModel(LogisticRegression(C = 0.1, penalty = 'l1', solver = 'liblinear', random_state = 43))
ans = selector.fit_transform(X_scaled, y)
best_features = ['age', 'fare']
### END SOLUTION

### ANSWER CHECK
print(selector.get_feature_names_out())

In [None]:
### BEGIN HIDDEN TESTS
best_features_ = ['age', 'fare']
#
#
#
assert set(best_features) == set(best_features_)
### END HIDDEN TESTS