##### Jupyter Notebook, Step 3 - Feature Importance
- Use the results from step 2 to discuss feature importance in the dataset
- Considering these results, develop a strategy for building a final predictive model
- recommended approaches:
    - Use feature selection to reduce the dataset to a manageable size then use conventional methods
    - Use dimension reduction to reduce the dataset to a manageable size then use conventional methods
    - Use an iterative model training method to use the entire dataset

For this section, I will build a gridsearch pipeline to tune hyperparameters on the five models I have chosen. I will perform this gridsearch using the results from the 3 different feature selection methods used in notebook 2. 

The results will be appened to a list of dictionaries which I will then transform into a dataframe for readability. The top result of this notebook should be a final model that I can test on the full madelon dataset, and potentially a very large dataset from Josh's page. 

Pipeline to include: Standard Scaler, Model

Models to search through: 
### LogisticRegression

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)

### KNeighborsRegressor / KNeighborsClassifier

n_neighbors [1 through some number 10-100]
weights: 'uniform', 'distance'

### DecisionTreeClassifier

params = {
    'max_depth': [1,2,3,4,None],
    'max_features': [2,3,4,5,6,7],
    'max_leaf_nodes': [5,10,15,20,25,30,35,40,None],
    'min_samples_leaf': [1,2,3,4,5,6]
}

### SVC

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

## Steps
1. Load the datasets
2. Load the feature sets 
2a. train_test_split
3. make the pipeline (standardscaler, model), params = {' ': ,} , and gridsearchcv(model, params)
4. show results (results = pd.DataFrame(clf.cv_results_), results.sort_values('mean_test_score', ascending=False, axis=0).head(1), .best_estimator_) 
5. repeat 3-4 for all 4 models
6. Note best model and save

In [4]:
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.decomposition import PCA
import csv

In [5]:
madelon_file ='madelon_train.csv'
madelon_data = []        

with open(madelon_file) as f:
    readcsv = csv.reader(f, delimiter=' ')
    
    for row in readcsv:
        madelon_data.append(row)
        
madelon_file_target ='madelon_train_targets.csv'
madelon_data_target = []        

with open(madelon_file_target) as f:
    readcsv = csv.reader(f, delimiter=' ')
    
    for row in readcsv:
        madelon_data_target.append(row)
        
madelon1 = madelon_data

madelon_data_df = pd.DataFrame(madelon1)
madelon_targets_df = pd.DataFrame(madelon_data_target)

X = madelon_data_df
y = madelon_targets_df
X['y'] = y

X = X.drop([500],axis=1)
X['y'] = X['y'].map(int)
for column in X.columns:
    X[column] = X[column].map(int)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

y = X['y']
X = X.drop(['y'], axis=1)

In [6]:
with open('supports.pkl', 'rb') as f:
    supports = pickle.load(f)

madelon_uci = pd.read_pickle('m_uci_1.pickle')

In [7]:
supports

[0      28
 1      48
 2      64
 3     105
 4     128
 5     153
 6     241
 7     281
 8     318
 9     336
 10    338
 11    378
 12    433
 13    442
 14    451
 15    453
 16    455
 17    472
 18    475
 19    493
 Name: 0, dtype: int64,
 array([ 32,  34,  40,  47,  48,  70, 105, 128, 193, 235, 282, 378, 380,
        402, 415, 417, 420, 435, 474, 477]),
 array([  1,  32,  34,  40,  43,  47,  51,  55,  70,  73,  75,  80,  83,
         85,  93, 111, 126, 131, 141, 155, 162, 192, 193, 196, 200, 207,
        209, 213, 218, 231, 287, 295, 299, 306, 376, 387, 389, 395, 407,
        415, 417, 418, 420, 424, 430, 435, 441, 452, 461, 463, 473, 476])]

In [67]:
madelon_uci[supports[0]].shape

(440, 20)

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X[supports[0]], y, test_size=0.3, random_state=42)

In [69]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [70]:
#list(range(25))

In [71]:
params = {
    'n_neighbors': list(range(1,30)), 
    'weights': ['uniform','distance']
}           

In [72]:
knc1 = KNeighborsClassifier(n_neighbors=14, weights='distance')
knc1.fit(X_train_sc, y_train)
knc1.score(X_test_sc, y_test)

0.90500000000000003

In [73]:
X_train_sc.shape

(1400, 20)

In [74]:
knc = KNeighborsClassifier()
grd = GridSearchCV(knc, params)

In [75]:
grd.fit(X_train_sc, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [76]:
results = pd.DataFrame(grd.cv_results_)

In [77]:
results = pd.DataFrame(grd.cv_results_)
results.sort_values('mean_test_score',ascending=False)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_weights,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
11,0.001727,0.009516,0.876429,1.0,6,distance,"{'n_neighbors': 6, 'weights': 'distance'}",1,0.867238,1.0,0.873662,1.0,0.888412,1.0,3.9e-05,3e-05,0.008862,0.0
9,0.001749,0.009015,0.875,1.0,5,distance,"{'n_neighbors': 5, 'weights': 'distance'}",2,0.860814,1.0,0.875803,1.0,0.888412,1.0,3e-05,5.6e-05,0.011279,0.0
13,0.001882,0.009818,0.875,1.0,7,distance,"{'n_neighbors': 7, 'weights': 'distance'}",2,0.860814,1.0,0.882227,1.0,0.881974,1.0,2.9e-05,0.000152,0.010037,0.0
17,0.001783,0.010535,0.874286,1.0,9,distance,"{'n_neighbors': 9, 'weights': 'distance'}",4,0.856531,1.0,0.888651,1.0,0.877682,1.0,3.8e-05,0.000125,0.013335,0.0
15,0.001767,0.010442,0.872857,1.0,8,distance,"{'n_neighbors': 8, 'weights': 'distance'}",5,0.860814,1.0,0.877944,1.0,0.879828,1.0,3.5e-05,0.000291,0.008555,0.0
19,0.001742,0.011065,0.872857,1.0,10,distance,"{'n_neighbors': 10, 'weights': 'distance'}",5,0.856531,1.0,0.88651,1.0,0.875536,1.0,3.9e-05,0.000191,0.012388,0.0
12,0.00179,0.009847,0.872857,0.914286,7,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",5,0.862955,0.915327,0.875803,0.915327,0.879828,0.912206,6.6e-05,0.000156,0.007196,0.001471
7,0.001706,0.008759,0.872143,1.0,4,distance,"{'n_neighbors': 4, 'weights': 'distance'}",8,0.865096,1.0,0.869379,1.0,0.881974,1.0,4.2e-05,0.000438,0.007161,0.0
8,0.001744,0.009015,0.872143,0.921427,5,uniform,"{'n_neighbors': 5, 'weights': 'uniform'}",8,0.862955,0.916399,0.867238,0.920686,0.886266,0.927195,3e-05,0.000117,0.010128,0.004439
21,0.001757,0.011335,0.872143,1.0,11,distance,"{'n_neighbors': 11, 'weights': 'distance'}",8,0.862955,1.0,0.884368,1.0,0.869099,1.0,4.6e-05,0.000138,0.009006,0.0


In [89]:
grd.score(X_test_sc, y_test)

0.89000000000000001

In [79]:
# build a for loop to loop through gridsearch cv with different pipelines. The pipeline will change based on the values I input to 
# the params function, and also needs a list for the different models I want to use

# [PCA][]

In [85]:
pca = PCA(n_components=5)

In [86]:
pca.fit(X_train_sc)

PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [87]:
pca.explained_variance_

array([ 6.15175447,  4.74905286,  4.05084292,  2.79973138,  2.09100074])

In [95]:
X_train_pca = pca.transform(X_train_sc)
X_test_pca = pca.transform(X_test_sc)

In [107]:
knc2 = KNeighborsClassifier(n_neighbors=7, weights='distance')
knc2.fit(X_train_pca, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='distance')

In [108]:
knc2.score(X_train_pca, y_train)

1.0

In [109]:
knc2.score(X_test_pca, y_test)

0.89166666666666672

In [103]:
grd.fit(X_train_pca, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [104]:
results_pca = pd.DataFrame(grd.cv_results_)

In [105]:
results_pca.sort_values('mean_test_score',ascending=False)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_weights,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
13,0.00144,0.003871,0.882857,1.0,7,distance,"{'n_neighbors': 7, 'weights': 'distance'}",1,0.862955,1.0,0.882227,1.0,0.903433,1.0,1e-05,0.000129,0.016528,0.0
12,0.001441,0.003592,0.881429,0.913929,7,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",2,0.865096,0.912111,0.880086,0.916399,0.899142,0.913276,1.7e-05,3.1e-05,0.013929,0.00181
9,0.00144,0.003348,0.88,1.0,5,distance,"{'n_neighbors': 5, 'weights': 'distance'}",3,0.867238,1.0,0.875803,1.0,0.896996,1.0,1e-05,2.6e-05,0.012504,0.0
15,0.001427,0.003917,0.879286,1.0,8,distance,"{'n_neighbors': 8, 'weights': 'distance'}",4,0.865096,1.0,0.884368,1.0,0.888412,1.0,1.7e-05,7.6e-05,0.010174,0.0
11,0.001441,0.003578,0.878571,1.0,6,distance,"{'n_neighbors': 6, 'weights': 'distance'}",5,0.865096,1.0,0.873662,1.0,0.896996,1.0,4.6e-05,2.4e-05,0.013476,0.0
21,0.001387,0.004458,0.876429,1.0,11,distance,"{'n_neighbors': 11, 'weights': 'distance'}",6,0.869379,1.0,0.884368,1.0,0.875536,1.0,2.3e-05,0.000185,0.006154,0.0
8,0.001465,0.003217,0.875714,0.919284,5,uniform,"{'n_neighbors': 5, 'weights': 'uniform'}",7,0.869379,0.913183,0.867238,0.920686,0.890558,0.923983,2.8e-05,5e-05,0.010521,0.004519
7,0.001462,0.003168,0.875,1.0,4,distance,"{'n_neighbors': 4, 'weights': 'distance'}",8,0.867238,1.0,0.867238,1.0,0.890558,1.0,1.8e-05,6e-05,0.010989,0.0
19,0.001437,0.004263,0.875,1.0,10,distance,"{'n_neighbors': 10, 'weights': 'distance'}",8,0.856531,1.0,0.88651,1.0,0.881974,1.0,3.2e-05,0.000101,0.013197,0.0
17,0.001432,0.004068,0.874286,1.0,9,distance,"{'n_neighbors': 9, 'weights': 'distance'}",10,0.862955,1.0,0.884368,1.0,0.875536,1.0,1e-05,1.8e-05,0.00879,0.0


In [106]:
grd.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='distance')

In [110]:
grd.fit(X[supports[0]], y)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [111]:
results = pd.DataFrame(grd.cv_results_)
results.sort_values('mean_test_score',ascending=False)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_weights,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
15,0.003376,0.014355,0.88,1.0,8,distance,"{'n_neighbors': 8, 'weights': 'distance'}",1,0.88024,1.0,0.89039,1.0,0.869369,1.0,8.8e-05,0.000464,0.008579,0.0
7,0.003475,0.012014,0.8795,1.0,4,distance,"{'n_neighbors': 4, 'weights': 'distance'}",2,0.886228,1.0,0.885886,1.0,0.866366,1.0,0.000133,0.000329,0.009281,0.0
19,0.003391,0.015503,0.879,1.0,10,distance,"{'n_neighbors': 10, 'weights': 'distance'}",3,0.881737,1.0,0.888889,1.0,0.866366,1.0,5e-05,0.000357,0.009392,0.0
23,0.003417,0.016331,0.877,1.0,12,distance,"{'n_neighbors': 12, 'weights': 'distance'}",4,0.878743,1.0,0.888889,1.0,0.863363,1.0,5.5e-05,0.000509,0.010488,0.0
11,0.003547,0.013336,0.877,1.0,6,distance,"{'n_neighbors': 6, 'weights': 'distance'}",4,0.877246,1.0,0.887387,1.0,0.866366,1.0,0.000125,0.000524,0.008579,0.0
13,0.003444,0.014201,0.877,1.0,7,distance,"{'n_neighbors': 7, 'weights': 'distance'}",4,0.88024,1.0,0.885886,1.0,0.864865,1.0,7.9e-05,0.000685,0.008879,0.0
4,0.003374,0.011242,0.8745,0.930248,3,uniform,"{'n_neighbors': 3, 'weights': 'uniform'}",7,0.88024,0.927177,0.882883,0.930285,0.86036,0.933283,3.4e-05,0.000423,0.010049,0.002493
27,0.003487,0.017355,0.874,1.0,14,distance,"{'n_neighbors': 14, 'weights': 'distance'}",8,0.877246,1.0,0.89039,1.0,0.854354,1.0,0.000123,0.000363,0.014883,0.0
9,0.003564,0.012762,0.874,1.0,5,distance,"{'n_neighbors': 5, 'weights': 'distance'}",8,0.884731,1.0,0.87988,1.0,0.857357,1.0,0.000132,0.00035,0.011925,0.0
14,0.003355,0.014455,0.874,0.903999,8,uniform,"{'n_neighbors': 8, 'weights': 'uniform'}",8,0.86976,0.901652,0.89039,0.9003,0.861862,0.910045,2e-05,0.000392,0.012022,0.004311
