##### Jupyter Notebook, Step 3 - Feature Importance
- Use the results from step 2 to discuss feature importance in the dataset
- Considering these results, develop a strategy for building a final predictive model
- recommended approaches:
    - Use feature selection to reduce the dataset to a manageable size then use conventional methods
    - Use dimension reduction to reduce the dataset to a manageable size then use conventional methods
    - Use an iterative model training method to use the entire dataset

For this section, I will build a gridsearch pipeline to tune hyperparameters on the five models I have chosen. I will perform this gridsearch using the results from the 3 different feature selection methods used in notebook 2. 

The results will be appened to a list of dictionaries which I will then transform into a dataframe for readability. The top result of this notebook should be a final model that I can test on the full madelon dataset, and potentially a very large dataset from Josh's page. 

Pipeline to include: Standard Scaler, Model

Models to search through: 
### LogisticRegression

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }
clf = GridSearchCV(LogisticRegression(penalty='l2'), param_grid)

### KNeighborsRegressor / KNeighborsClassifier

n_neighbors [1 through some number 10-100]
weights: 'uniform', 'distance'

### DecisionTreeClassifier

params = {
    'max_depth': [1,2,3,4,None],
    'max_features': [2,3,4,5,6,7],
    'max_leaf_nodes': [5,10,15,20,25,30,35,40,None],
    'min_samples_leaf': [1,2,3,4,5,6]
}

### SVC

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

## Steps
1. Load the datasets
2. Load the feature sets 
2a. train_test_split
3. make the pipeline (standardscaler, model), params = {' ': ,} , and gridsearchcv(model, params)
4. show results (results = pd.DataFrame(clf.cv_results_), results.sort_values('mean_test_score', ascending=False, axis=0).head(1), .best_estimator_) 
5. repeat 3-4 for all 4 models
6. Note best model and save

In [16]:
import pickle
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

In [2]:
with open('supports.pkl', 'rb') as f:
    supports = pickle.load(f)

madelon_uci = pd.read_pickle('m_uci_1.pickle')

In [3]:
madelon_uci[supports[0]].shape

(440, 20)

In [5]:
X_train, X_test, y_train, y_test = train_test_split(madelon_uci[supports[0]], madelon_uci['y'], test_size=0.3, random_state=42)

In [7]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [20]:
params = {
    'n_neighbors': list(range(1,30)), 
    'weights': ['uniform','distance']
}           

In [21]:
knr = KNeighborsClassifier()
grd = GridSearchCV(knr, params)

In [22]:
grd.fit(X_train_sc, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], 'weights': ['uniform', 'distance']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [24]:
results = pd.DataFrame(grd.cv_results_)