In [1]:
#from lib.project_5 import load_data_from_database, make_data_dict, general_model, general_transformer

# Step 2 - Identify Salient Features Using $\ell1$-penalty

**NOTE: EACH OF THESE SHOULD BE WRITTEN SOLELY WITH REGARD TO STEP 2 - Identify Features**

### Domain and Data

**We ran naive Logistic Regression on Madelone data set. We also reviewed coefficients for each feature from 500 features. Some of the top coefficients are listed below:**

![](http://localhost:8888/notebooks/dsi/dsi-workspace/project-05/images/sample_co_eff_step1.png)

### Problem Statement

**By running naive LogisticRegression we received 1.0 on train data set and `0.544` on test data set. Which suggest that the model did poorly on test data and we need to improve on test score. Moreover we need to identify important/salient features that has more importance over our target.**
  
### Solution Statement

**I will build pipeline with LogisticRegression Lasso i.e. l1 with smaller C values (smaller C value means stronger regularization) that will help identify salient features. I am planning run model on Lasso with as many C values to identify highest test score and between 8-10 features.**

### Metric

**Of the 5-10 models we have identified in the solution statement, we will use the model with higher score and we will use coefficients to identify the salient features, the higher the coefficients the better.** 

### Benchmark

**We need to identify `5-10` salient features.**

In [1]:
from sqlalchemy import create_engine
import pandas as pd
#from os import chdir; chdir('..')
from os import chdir; chdir('./lib')
from sklearn.preprocessing import StandardScaler
from project_5 import load_data_from_database, make_data_dict,general_transformer, general_model
from sklearn import linear_model
#from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,Lasso
from sklearn import metrics
from sklearn.feature_selection import SelectKBest, f_regression, f_classif
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

In [2]:
!pwd

/src/dsi/dsi-workspace/project-05/lib


** Load Data, define X and y, Split data in to train and test, and Scale the data for analysis. We will split data in to 50% train and 50% test **

In [3]:
madelon_df = load_data_from_database()
del madelon_df["index"]
# Define X and y
y = madelon_df["label"]
X = madelon_df.drop("label",axis=1)
data_dict= make_data_dict(X,y,0.5,random_state=42)
data_dict = general_transformer(StandardScaler(),data_dict)
data_dict

{'X_test': array([[ 1.12156758, -0.51026631,  0.33005212, ..., -0.52713324,
         -0.07137678,  0.47371054],
        [-0.3047583 , -0.67637944,  1.08365312, ..., -0.01450019,
         -0.07137678,  0.47371054],
        [ 0.17068366, -1.34083196, -0.64962918, ...,  1.66986554,
          0.79543799, -0.22014492],
        ..., 
        [ 0.17068366, -0.77604732, -0.09698845, ..., -1.47916605,
          0.30785468,  0.04968776],
        [ 0.80460628,  2.34687951,  0.48077232, ...,  1.96279872,
         -0.28808047,  0.82063828],
        [ 0.48764497, -0.04514955, -2.15683118, ..., -0.74683312,
         -2.02171001,  1.01337591]]),
 'X_train': array([[-1.41412287, -0.44382106,  0.02861172, ..., -0.08773348,
          0.44329449,  0.04968776],
        [-0.14627764,  0.65252559,  0.43053225, ...,  0.93753262,
          1.74351665, -0.52852513],
        [-0.78020026, -1.2743867 , -0.27282868, ...,  0.86429932,
         -0.17972863, -0.60562018],
        ..., 
        [-0.14627764,  0.287076

## Running LogisticRegression with Lasso and C values
### Round 1:  With C 1.0
1. Run logistic regression with Lasso and C as 1.0
2. Review Score


In [4]:
# run logistic regression with lasso and C as 1.0
model = linear_model.LogisticRegression(penalty = 'l1', C=1.0)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.919 Test Score: 0.533


**The variance between test and train score is significantly higher so we will continue with stronger regularization to get better score.**

### Round 2 with C as 0.1

1. Run logistic regression with Lasso and C as 0.1
2. Review Score


In [5]:
# run logistic regression with lasso and C as 0.1 for EDA purpose
model = linear_model.LogisticRegression(penalty = 'l1', C=0.1)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.798 Test Score: 0.554


**The variance between test and train score is still significantly higher so we will continue with stronger regularization to get better score.**

### Round 3 with C as 0.05

1. Run logistic regression with Lasso and C as 0.05
2. Review Score


In [6]:
# run logistic regression with lasso and C as 0.05 for EDA purpose
model = linear_model.LogisticRegression(penalty = 'l1', C=0.05)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.724 Test Score: 0.582


**The variance between test and train score is still significantly higher so we will continue with stronger regularization to get better score.**

### Round 4 with C as 0.03
1. Run logistic regression with Lasso and C as 0.03
2. Review Score
3. Review Features

In [7]:
# run logistic regression with lasso and C as 0.03 for EDA purpose
model = linear_model.LogisticRegression(penalty = 'l1', C=0.03)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.657 Test Score: 0.591


In [8]:
feature_coef = []
lr_round1 = data_dict["processes"][4]
for i,k in enumerate(X.columns):
    feature_coef.append([k, lr_round1.coef_[0][i]])
df_coef=pd.DataFrame(feature_coef)
df_coef.columns = ["feature","coef_"]

df_coef.sort_values(["coef_"],ascending=False).head(15)

Unnamed: 0,feature,coef_
475,feat_475,0.323946
48,feat_048,0.122353
378,feat_378,0.05502
307,feat_307,0.053997
46,feat_046,0.044833
424,feat_424,0.032432
329,feat_329,0.028854
282,feat_282,0.023329
116,feat_116,0.018506
136,feat_136,0.007145


**So far we ran below models**

In [9]:
data_dict["processes"][1:5]

[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.03, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_sta

**This is a good test score however we have 14 features. We will identify salient features further with higher C values **

### Round 5 with C as 0.025 
1. Run logistic regression with Lasso and C as 0.025
2. Review Score
3. Review Features

In [10]:
# run logistic regression with lasso and C as 0.025 for EDA purpose
model = linear_model.LogisticRegression(penalty = 'l1', C=0.025)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.64 Test Score: 0.594


** The test score have improved let't take a look at features **

In [11]:
feature_coef = []
lr_round1 = data_dict["processes"][5]
for i,k in enumerate(X.columns):
    feature_coef.append([k, lr_round1.coef_[0][i]])
df_coef=pd.DataFrame(feature_coef)
df_coef.columns = ["feature","coef_"]

df_coef.sort_values(["coef_"],ascending=False).head(10)

Unnamed: 0,feature,coef_
475,feat_475,0.2956
48,feat_048,0.141942
307,feat_307,0.023158
46,feat_046,0.016922
378,feat_378,0.007889
424,feat_424,0.003327
329,feat_329,0.000256
334,feat_334,0.0
331,feat_331,0.0
332,feat_332,0.0


**Our features are reduced to 7. So Let's increase C value by little to see if the score improves and we have atleast 8 features.**

### Round 6 with C as 0.027
1. Run logistic regression with Lasso and C as 0.027
2. Review Score
3. Review Features

In [12]:
# run logistic regression with lasso and C as 0.027 for EDA purpose
model = linear_model.LogisticRegression(penalty = 'l1', C=0.027)
data_dict = general_model(model, data_dict)
print "Train Score:", data_dict["train_score"], "Test Score:", data_dict["test_score"]

Train Score: 0.649 Test Score: 0.599


**The test score is really good @ 60%. Let's review features.**

In [13]:
feature_coef = []
lr_round1 = data_dict["processes"][6]
for i,k in enumerate(X.columns):
    feature_coef.append([k, lr_round1.coef_[0][i]])
df_coef=pd.DataFrame(feature_coef)
df_coef.columns = ["feature","coef_"]

df_coef.sort_values(["coef_"],ascending=False).head(10)

Unnamed: 0,feature,coef_
475,feat_475,0.308325
48,feat_048,0.135008
307,feat_307,0.036766
46,feat_046,0.029403
378,feat_378,0.027127
424,feat_424,0.016196
329,feat_329,0.013342
282,feat_282,0.009831
116,feat_116,0.003299
338,feat_338,0.0


In [14]:
data_dict["processes"][1:7]

[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.05, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
           verbose=0, warm_start=False),
 LogisticRegression(C=0.03, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
           penalty='l1', random_sta

## We ran 6 models with LogisticRegression. With penalty Lasso and C value as 0.027 we found the highest test score with 8 features. We will use this model and compare that against other models e.g. SelectKBest, KNN and GridSearchCV

## Implementation

Implement the following code pipeline using the functions you write in `lib/project_5.py`.

<img src="assets/identify_features.png" width="600px">