##### Jupyter Notebook, Step 3 - Feature Importance
- Use the results from step 2 to discuss feature importance in the dataset
- Considering these results, develop a strategy for building a final predictive model
- recommended approaches:
    - Use feature selection to reduce the dataset to a manageable size then use conventional methods
    - Use dimension reduction to reduce the dataset to a manageable size then use conventional methods
    - Use an iterative model training method to use the entire dataset

For this section, I will build a gridsearch pipeline to tune hyperparameters on the five models I have chosen. I will perform this gridsearch using the results from the 3 different feature selection methods used in notebook 2. 

The results will be appened to a list of dictionaries which I will then transform into a dataframe for readability. The top result of this notebook should be a final model that I can test on the full madelon dataset, and potentially a very large dataset from Josh's page. 

Pipeline to include: Standard Scaler, Model

Models to search through: 
### LogisticRegression

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)

### KNeighborsClassifier

n_neighbors [1 through some number 10-100]
weights: 'uniform', 'distance'

### DecisionTreeClassifier

params = {
    'max_depth': [1,2,3,4,None],
    'max_features': [2,3,4,5,6,7],
    'max_leaf_nodes': [5,10,15,20,25,30,35,40,None],
    'min_samples_leaf': [1,2,3,4,5,6]
}

### SVC

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
 
### Naive Bayes

### fit voting classifier with Logistic Regression, SVC, KNC, Naive Bayes, Decision Tree Classifier

## Steps
1. Load the datasets
2. Load the feature sets 
2a. train_test_split
3. make the pipeline (standardscaler, model), params = {' ': ,} , and gridsearchcv(model, params)
4. show results (results = pd.DataFrame(clf.cv_results_), results.sort_values('mean_test_score', ascending=False, axis=0).head(1), .best_estimator_) 
5. repeat 3-4 for all 4 models
6. Note best model and save

In [95]:
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.decomposition import PCA
import csv
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import dbscan
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier

In [2]:
madelon_file ='madelon_train.csv'
madelon_data = []        

with open(madelon_file) as f:
    readcsv = csv.reader(f, delimiter=' ')
    
    for row in readcsv:
        madelon_data.append(row)
        
madelon_file_target ='madelon_train_targets.csv'
madelon_data_target = []        

with open(madelon_file_target) as f:
    readcsv = csv.reader(f, delimiter=' ')
    
    for row in readcsv:
        madelon_data_target.append(row)
        
madelon1 = madelon_data

madelon_data_df = pd.DataFrame(madelon1)
madelon_targets_df = pd.DataFrame(madelon_data_target)

X = madelon_data_df
y = madelon_targets_df
X['y'] = y

X = X.drop([500],axis=1)
X['y'] = X['y'].map(int)
for column in X.columns:
    X[column] = X[column].map(int)

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

y = X['y']
X = X.drop(['y'], axis=1)

In [3]:
with open('supports.pkl', 'rb') as f:
    supports = pickle.load(f)

madelon_uci = pd.read_pickle('m_uci_1.pickle')

In [4]:
supports

[0      28
 1      48
 2      64
 3     105
 4     128
 5     153
 6     241
 7     281
 8     318
 9     336
 10    338
 11    378
 12    433
 13    442
 14    451
 15    453
 16    455
 17    472
 18    475
 19    493
 Name: 0, dtype: int64,
 array([ 32,  34,  40,  47,  48,  70, 105, 128, 193, 235, 282, 378, 380,
        402, 415, 417, 420, 435, 474, 477]),
 array([  1,  32,  34,  40,  43,  47,  51,  55,  70,  73,  75,  80,  83,
         85,  93, 111, 126, 131, 141, 155, 162, 192, 193, 196, 200, 207,
        209, 213, 218, 231, 287, 295, 299, 306, 376, 387, 389, 395, 407,
        415, 417, 418, 420, 424, 430, 435, 441, 452, 461, 463, 473, 476])]

In [5]:
madelon_uci[supports[0]].shape

(440, 20)

In [64]:
results = pd.DataFrame()

In [65]:
results

### Set train and test based on unsupervised learning stack

In [61]:
X_train, X_test, y_train, y_test = train_test_split(X[supports[0]], y, test_size=0.3, random_state=42)

In [62]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

### Drill down to 5 for PCA and refit as a dataframe

In [63]:
pca = PCA(n_components=5)
pca.fit(X_train_sc)

X_train_pca = pd.DataFrame(pca.transform(X_train_sc))
X_test_pca = pd.DataFrame(pca.transform(X_test_sc))

### Gridsearch best K Neighbors Classifier

In [85]:
params = {
    'n_neighbors': list(range(1,30)), 
    'weights': ['uniform','distance']
}   

knc = KNeighborsClassifier()
knc_grd = GridSearchCV(knc, params)
knc_grd.fit(X_train_pca, y_train)
results = pd.DataFrame(knc_grd.cv_results_)
results['model'] = 'KNeighborsClassifier'
results.sort_values('mean_test_score',ascending=False).head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_n_neighbors,param_weights,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score,model
13,0.002564,0.003771,0.882857,1.0,7,distance,"{'n_neighbors': 7, 'weights': 'distance'}",1,0.862955,1.0,0.882227,1.0,0.903433,1.0,4.7e-05,5.9e-05,0.016528,0.0,KNeighborsClassifier
12,0.002533,0.003759,0.881429,0.913929,7,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",2,0.865096,0.912111,0.880086,0.916399,0.899142,0.913276,2.8e-05,5.4e-05,0.013929,0.00181,KNeighborsClassifier
9,0.002439,0.003357,0.88,1.0,5,distance,"{'n_neighbors': 5, 'weights': 'distance'}",3,0.867238,1.0,0.875803,1.0,0.896996,1.0,1.5e-05,3.7e-05,0.012504,0.0,KNeighborsClassifier
15,0.00239,0.003889,0.879286,1.0,8,distance,"{'n_neighbors': 8, 'weights': 'distance'}",4,0.865096,1.0,0.884368,1.0,0.888412,1.0,2e-05,2e-05,0.010174,0.0,KNeighborsClassifier
11,0.00246,0.003534,0.878571,1.0,6,distance,"{'n_neighbors': 6, 'weights': 'distance'}",5,0.865096,1.0,0.873662,1.0,0.896996,1.0,7.5e-05,3.8e-05,0.013476,0.0,KNeighborsClassifier


In [115]:
estimators = []
estimators.append(knc_grd.best_estimator_)
knc_grd.best_estimator_

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='distance')

In [87]:
knc_grd.score(X_test_pca, y_test)

0.89166666666666672

### Gridsearch best Logistic Regression

In [89]:
lg_params = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(-2,5,10)
}   

lgr = LogisticRegression()
lgr_grd = GridSearchCV(lgr, lg_params)
lgr_grd.fit(X_train_pca, y_train)
log_results = pd.DataFrame(lgr_grd.cv_results_)
log_results['model'] = 'Logistic Regression'
results = pd.concat([log_results, results])
results.sort_values('mean_test_score',ascending=False).head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,model,param_C,param_n_neighbors,param_penalty,param_weights,params,...,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
13,0.002564,0.003771,0.882857,1.0,KNeighborsClassifier,,7,,distance,"{'n_neighbors': 7, 'weights': 'distance'}",...,0.862955,1.0,0.882227,1.0,0.903433,1.0,4.7e-05,5.9e-05,0.016528,0.0
12,0.002533,0.003759,0.881429,0.913929,KNeighborsClassifier,,7,,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",...,0.865096,0.912111,0.880086,0.916399,0.899142,0.913276,2.8e-05,5.4e-05,0.013929,0.00181
9,0.002439,0.003357,0.88,1.0,KNeighborsClassifier,,5,,distance,"{'n_neighbors': 5, 'weights': 'distance'}",...,0.867238,1.0,0.875803,1.0,0.896996,1.0,1.5e-05,3.7e-05,0.012504,0.0
15,0.00239,0.003889,0.879286,1.0,KNeighborsClassifier,,8,,distance,"{'n_neighbors': 8, 'weights': 'distance'}",...,0.865096,1.0,0.884368,1.0,0.888412,1.0,2e-05,2e-05,0.010174,0.0
11,0.00246,0.003534,0.878571,1.0,KNeighborsClassifier,,6,,distance,"{'n_neighbors': 6, 'weights': 'distance'}",...,0.865096,1.0,0.873662,1.0,0.896996,1.0,7.5e-05,3.8e-05,0.013476,0.0


In [90]:
#estimators.append(lgr_grd.best_estimator_)
lgr_grd.best_estimator_

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [91]:
lgr_grd.score(X_test_pca, y_test)

0.58666666666666667

### Gridsearch best SVC

In [92]:
svc_params = {
    'C': np.logspace(-3,7,20),
}   

svc = SVC()
svc_grd = GridSearchCV(svc, svc_params)
svc_grd.fit(X_train_pca, y_train)
svc_results = pd.DataFrame(svc_grd.cv_results_)
svc_results['model'] = 'SVC'
results = pd.concat([svc_results, results])
results.sort_values('mean_test_score',ascending=False).head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,model,param_C,param_n_neighbors,param_penalty,param_weights,params,...,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
13,0.002564,0.003771,0.882857,1.0,KNeighborsClassifier,,7.0,,distance,"{'n_neighbors': 7, 'weights': 'distance'}",...,0.862955,1.0,0.882227,1.0,0.903433,1.0,4.7e-05,5.9e-05,0.016528,0.0
12,0.002533,0.003759,0.881429,0.913929,KNeighborsClassifier,,7.0,,uniform,"{'n_neighbors': 7, 'weights': 'uniform'}",...,0.865096,0.912111,0.880086,0.916399,0.899142,0.913276,2.8e-05,5.4e-05,0.013929,0.00181
9,0.002439,0.003357,0.88,1.0,KNeighborsClassifier,,5.0,,distance,"{'n_neighbors': 5, 'weights': 'distance'}",...,0.867238,1.0,0.875803,1.0,0.896996,1.0,1.5e-05,3.7e-05,0.012504,0.0
6,0.029237,0.008094,0.88,0.95107,SVC,1.43845,,,,{'C': 1.43844988829},...,0.865096,0.950697,0.882227,0.947481,0.892704,0.955032,0.000461,0.000165,0.011378,0.003094
15,0.00239,0.003889,0.879286,1.0,KNeighborsClassifier,,8.0,,distance,"{'n_neighbors': 8, 'weights': 'distance'}",...,0.865096,1.0,0.884368,1.0,0.888412,1.0,2e-05,2e-05,0.010174,0.0


In [116]:
estimators.append(svc_grd.best_estimator_)
svc_grd.best_estimator_

SVC(C=1.4384498882876631, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [94]:
svc_grd.score(X_test_pca, y_test)

0.90833333333333333

### Gridsearch best Naive Bayes

In [97]:
# nb_params = {
#     'penalty': ['l1', 'l2'],
#     'C': np.logspace(-2,5,10)
# }   

nb = GaussianNB()
nb.fit(X_train_pca, y_train)

nb.score(X_train_pca, y_train)
# nb_grd = GridSearchCV(nb_grd, nb_params)
# nb_grd.fit(X_train_pca, y_train)
# nb_results = pd.DataFrame(nb_grd.cv_results_)
# nb_results['model'] = 'Naive Bayes'
# results = pd.concat([log_results, results])
# results.sort_values('mean_test_score',ascending=False).head()
nb.score(X_test_pca, y_test)

0.58999999999999997

### Gridsearch best Decision Tree Classifier

In [100]:
dtc = DecisionTreeClassifier() 
dtc_params = { 
    'max_depth': [1,2,3,4,None], 
    'max_features': [2,3,4,5], 
    'max_leaf_nodes': [5,10,15,20,25,30,35,40,None], 
    'min_samples_leaf': [1,2,3,4,5,6] }

dtc_grd = GridSearchCV(dtc, dtc_params)
dtc_grd.fit(X_train_pca, y_train)
dtc_results = pd.DataFrame(dtc_grd.cv_results_)
dtc_results['model'] = 'Decision Tree Classifier'
results = pd.concat([dtc_results, results])
results.sort_values('mean_test_score',ascending=False).head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,model,param_C,param_max_depth,param_max_features,param_max_leaf_nodes,param_min_samples_leaf,...,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
13,0.002564,0.003771,0.882857,1.0,KNeighborsClassifier,,,,,,...,0.862955,1.0,0.882227,1.0,0.903433,1.0,4.7e-05,5.9e-05,0.016528,0.0
12,0.002533,0.003759,0.881429,0.913929,KNeighborsClassifier,,,,,,...,0.865096,0.912111,0.880086,0.916399,0.899142,0.913276,2.8e-05,5.4e-05,0.013929,0.00181
9,0.002439,0.003357,0.88,1.0,KNeighborsClassifier,,,,,,...,0.867238,1.0,0.875803,1.0,0.896996,1.0,1.5e-05,3.7e-05,0.012504,0.0
6,0.029237,0.008094,0.88,0.95107,SVC,1.43845,,,,,...,0.865096,0.950697,0.882227,0.947481,0.892704,0.955032,0.000461,0.000165,0.011378,0.003094
15,0.00239,0.003889,0.879286,1.0,KNeighborsClassifier,,,,,,...,0.865096,1.0,0.884368,1.0,0.888412,1.0,2e-05,2e-05,0.010174,0.0


In [101]:
#estimators.append(dtc_grd.best_estimator_)
dtc_grd.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=5, max_leaf_nodes=40, min_impurity_split=1e-07,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [103]:
dtc_grd.score(X_test_pca, y_test)

0.78666666666666663

### Implement a voting classifier on the five tuned models

In [117]:
estimators_list = list(zip(['KNC','LR','SVC','DTC'], estimators))

In [118]:
estimators_list

[('KNC',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
             metric_params=None, n_jobs=1, n_neighbors=7, p=2,
             weights='distance')),
 ('LR', SVC(C=1.4384498882876631, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]

In [119]:
vc = VotingClassifier(estimators=estimators_list)
vc.fit(X_train_pca, y_train)
vc.score(X_test_pca, y_test)

0.89333333333333331

Unfortunately, despite implementing more models and combining them in a voting classifier, I still do not achieve results that are superior to a simply KNC or SVC alone. (.86 R2)

Dropping out the Logistic and the Decision Tree from the voting classifier, I now achieve a .89 R2, which is okay. 

Next I may try implementing bagging, random forest, extra forests, and xgboost to see whether they can help this at all. 

### Add Bagging, Random Forest, Extra trees

In [131]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bag = BaggingClassifier(KNeighborsClassifier(n_neighbors=14, weights='distance'), max_samples=0.5, max_features=0.5)
bag.fit(X_train_pca, y_train)
bag.score(X_test_pca, y_test)

0.745

In [132]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(X_train_pca, y_train)
rfc.score(X_test_pca, y_test)

0.82166666666666666

In [133]:
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
etc.fit(X_train_pca, y_train)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
           verbose=0, warm_start=False)

In [134]:
etc.score(X_test_pca, y_test)

0.83833333333333337

### Add Decision Tree Classifier to data

In [51]:
dtc = DecisionTreeClassifier()


In [52]:
dtc.fit(X_train_synth_1, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [53]:
dtc.score(X_train_synth_1, y_train)

1.0

In [54]:
dtc.score(X_test_synth_1, y_test)

0.90500000000000003

### build a DTC prior to KNC predictions

In [55]:
dtc = DecisionTreeClassifier() 
params = { 
    'max_depth': [1,2,3,4,None], 
    'max_features': [2,3,4,5,6,7], 
    'max_leaf_nodes': [5,10,15,20,25,30,35,40,None], 
    'min_samples_leaf': [1,2,3,4,5,6] }

In [56]:
grd_dtc = GridSearchCV(dtc, params)

In [57]:
grd_dtc.fit(X_train_sc, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'max_depth': [1, 2, 3, 4, None], 'max_features': [2, 3, 4, 5, 6, 7], 'max_leaf_nodes': [5, 10, 15, 20, 25, 30, 35, 40, None], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [58]:
grd_dtc_results = pd.DataFrame(grd_dtc.cv_results_)
grd_dtc_results.sort_values('mean_test_score',ascending=False)

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_max_depth,param_max_features,param_max_leaf_nodes,param_min_samples_leaf,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
1564,0.004157,0.000420,0.800714,0.905355,,6,,5,"{'max_depth': None, 'max_features': 6, 'max_le...",1,0.798715,0.900322,0.783726,0.903537,0.819742,0.912206,0.000246,0.000009,0.014769,0.005019
1561,0.004646,0.000424,0.799286,0.962858,,6,,2,"{'max_depth': None, 'max_features': 6, 'max_le...",2,0.788009,0.969989,0.800857,0.958199,0.809013,0.960385,0.000156,0.000005,0.008645,0.005121
1612,0.004298,0.000386,0.798571,0.894287,,7,40,5,"{'max_depth': None, 'max_features': 7, 'max_le...",3,0.792291,0.897106,0.800857,0.894962,0.802575,0.890792,0.000166,0.000011,0.004498,0.002621
1616,0.005067,0.000412,0.794286,0.941786,,7,,3,"{'max_depth': None, 'max_features': 7, 'max_le...",4,0.807281,0.945338,0.788009,0.939979,0.787554,0.940043,0.000151,0.000011,0.009196,0.002511
1609,0.004556,0.000394,0.788571,0.902139,,7,40,2,"{'max_depth': None, 'max_features': 7, 'max_le...",5,0.753747,0.878885,0.794433,0.914255,0.817597,0.913276,0.000262,0.000020,0.026389,0.016448
1542,0.003691,0.000408,0.787857,0.870003,,6,30,1,"{'max_depth': None, 'max_features': 6, 'max_le...",6,0.768737,0.856377,0.813704,0.891747,0.781116,0.861884,0.000164,0.000030,0.018972,0.015539
1615,0.004958,0.000412,0.786429,0.960358,,7,,2,"{'max_depth': None, 'max_features': 7, 'max_le...",7,0.779443,0.964630,0.796574,0.959271,0.783262,0.957173,0.000144,0.000026,0.007345,0.003140
1541,0.003618,0.000391,0.786429,0.859286,,6,25,6,"{'max_depth': None, 'max_features': 6, 'max_le...",7,0.766595,0.844587,0.805139,0.875670,0.787554,0.857602,0.000151,0.000023,0.015761,0.012745
1610,0.004239,0.000376,0.785000,0.891795,,7,40,3,"{'max_depth': None, 'max_features': 7, 'max_le...",9,0.807281,0.901393,0.760171,0.907824,0.787554,0.866167,0.000100,0.000018,0.019323,0.018311
1501,0.003617,0.000370,0.785000,0.892150,,5,40,2,"{'max_depth': None, 'max_features': 5, 'max_le...",9,0.781585,0.897106,0.792291,0.907824,0.781116,0.871520,0.000104,0.000023,0.005162,0.015230


In [59]:
grd_dtc.score(X_test_sc, y_test)

0.77666666666666662

### PCA to 5 features and Polynomial features with n^2

In [60]:
poly_pipe = 

SyntaxError: invalid syntax (<ipython-input-60-a11999593681>, line 1)

### DBSCAN

In [None]:
dbs = dbscan(X_train_sc, y_train)

In [None]:
# modify skfold for loop to do a few things
# 1) should ss, pca, df new data, skfold and store folds
# set up params and models list
# this looks like list of tuples as [(model, {params dict}),  ]
# for loop over the list of model, params
# gridsearchcv(model, params)
# store the results by concating a df
# store the best estimator settings
# add the predict column to the copy dictionary
# (possibly fit the original data / copy data both to provide individual scores and ensemble scores)
# return the final results into df and get the best score.. ? 

In [None]:
pca = PCA(n_components=5)
pca.fit(X_train_sc)

X_train_pca = pd.DataFrame(pca.transform(X_train_sc))
X_test_pca = pd.DataFrame(pca.transform(X_test_sc))

In [None]:
skfold = StratifiedKFold(n_splits=3, random_state=42)
X_train_sk = []
X_test_sk = []
y_train_sk = []
y_test_sk = []
for train, val in skfold.split(X_train, y_train):
    print ("train :", train, "val :", val)
    X_train_sk.append(X_train)

In [None]:
skfold = StratifiedKFold(n_splits=3, random_state=42)

# 1) already have X_train_sc, X_test_sc, y_train, y_test

# 2) apply pca for 5 and put back into a df 
pca = PCA(n_components=5)
pca.fit(X_train_sc)

X_train_pca = pd.DataFrame(pca.transform(X_train_sc))
X_test_pca = pd.DataFrame(pca.transform(X_test_sc))

# 3) skfold and store the train and val indices
tr_indice, test_indice = [], []

#store the train and test indicdes generated by stratified k fold
# for train_indices, val_indices in skfold.split(X_train_pca, y_train):
#     tr_indice.append(train_indices)
#     test_indice.append(val_indices)

model_performance_list = []

#Set up list of models and param dictionaries
knc = KNeighborsClassifier()
dtc = DecisionTreeClassifier()
lgc = LogisticRegression()
svc = SVC()

models_sequence = [
    (dtc, {'criterion': }),
    (lgc, ),
    (svc, ),
    (knc, )
]

# Try a bunch of different values of k
ks = list(range(1,32))
for k in ks:
    
    results = {}
    results['k'] = k
    train_scores = []
    val_scores = []
    
    for fold, (train_indices, val_indices) in enumerate(skfold.split(X_train, y_train)):
        
        # split your train data by indices into train and validation
        X_train_kf, y_train_kf = X_train.values[train_indices], y_train.values[train_indices]
        X_val_kf, y_val_kf = X_train.values[val_indices], y_train.values[val_indices]
        
        # scale the data
        scaler = StandardScaler()
        X_train_kf_scaled = scaler.fit_transform(X_train_kf)
        X_val_kf_scaled = scaler.transform(X_val_kf)
        
        # fit your model
        knc = KNeighborsClassifier(n_neighbors=k)
        knc.fit(X_train_kf_scaled, y_train_kf)
        
        # generate a train and test accuracy
        train_score = knc.score(X_train_kf_scaled, y_train_kf)
        val_score = knc.score(X_val_kf_scaled, y_val_kf)
        
        # add to your results dict
        results['fold_{}_train'.format(fold)] = train_score
        results['fold_{}_test'.format(fold)] = val_score
        results['fold_{}_model'.format(fold)] = knc
        
        # generate your mean train and validation scores
        train_scores.append(train_score)
        val_scores.append(val_score)
        results['mean_train_score'] = np.mean(train_scores)
        results['mean_val_score'] = np.mean(val_scores)
    
    model_performance_list.append(results)

cv_results = pd.DataFrame(model_performance_list)

### Fit an extra trees classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score

In [None]:
xrtc = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
xrt_score = cross_val_score(xrtc, X_train_pca, y_train)

In [None]:
xrt_score

In [None]:
knc1.fit(X_train_pca, y_train)

In [None]:
knc1.score(X_train_pca, y_train)

In [None]:
knc1.score(X_test_pca, y_test)

### fit voting classifier with SVC, KNC, Naive Bayes, Decision Tree Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier