In [1]:
from lib.project_5_ADH import load_data_from_database, make_data_dict, general_model, general_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Step 2 - Identify Salient Features Using $\ell1$-penalty

### Domain and Data

The data, referred to as Madelon, is 2000 rows and 500 features (1 index, 1 target). The dataset is artificial with a two-class target (-1, 1) with continuous input (parameter) variables. 

### Problem Statement

Implement a machine learning pipeline using Logisitic Regression and the l1 penalty to select salient features programatically.

### Solution Statement

Provide a jupyter notebook with a pipeline (with regularization) that will provide the number of salient features used in the model.

### Metric

I would like to identify the number of salient features.

### Benchmark

I'd like to get as little salient features as possible.

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5_ADH.py`.

<img src="assets/identify_features.png" width="600px">

In [2]:
madelon_df = load_data_from_database('dsi_student', 'correct horse battery staple', 'joshuacook.me', 
                                     '5432', 'dsi', 'madelon')

In [3]:
madelon_df.head()

Unnamed: 0,index,feat_000,feat_001,feat_002,feat_003,feat_004,feat_005,feat_006,feat_007,feat_008,...,feat_491,feat_492,feat_493,feat_494,feat_495,feat_496,feat_497,feat_498,feat_499,label
0,0,485,477,537,479,452,471,491,476,475,...,481,477,485,511,485,481,479,475,496,-1
1,1,483,458,460,487,587,475,526,479,485,...,478,487,338,513,486,483,492,510,517,-1
2,2,487,542,499,468,448,471,442,478,480,...,481,492,650,506,501,480,489,499,498,-1
3,3,480,491,510,485,495,472,417,474,502,...,480,474,572,454,469,475,482,494,461,1
4,4,484,502,528,489,466,481,402,478,487,...,479,452,435,486,508,481,504,495,511,1


In [4]:
data_dict = make_data_dict(madelon_df, random_state = 40, test_size = 0.20)

In [5]:
data_dict = general_transformer(StandardScaler(), data_dict)

In [6]:
lg_l1 = general_model(LogisticRegression(penalty = 'l1'), data_dict)

In [7]:
lg_l1['train score'],lg_l1['test score']

(0.77562500000000001, 0.54249999999999998)

In [8]:
coefs = data_dict['processes'][1].coef_.flatten()

In [9]:
coefs_abs = [abs(coef) for coef in coefs]

In [10]:
len([num for num in coefs_abs if num > 0.0001])

472

## Results

Using the l1 penalty with logistic regression, the model identified 472 salient features. If we compare our train and test score, we can see that even these features are not such a great predictor of the entire dataset.