# 13 Model Selection Tutorial
## Grid Search for *k*-NN

To get us started we have an example that fits a *k*-NN model for the `HotelRevHelpfulness` dataset. It assesses three options:
- whether to use a StandardScaler, MinMaxScaler or no scaler. 
- what <em>k</em> to use for <em>k</em>-NN
- what weighting policy

In [2]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.datasets import load_digits
from sklearn.pipeline import Pipeline
import pandas as pd

In [3]:
hotel_rev = pd.read_csv('HotelRevHelpfulnessV2.csv')
y = hotel_rev.pop('reviewHelpfulness').values
X = hotel_rev.values

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=1/2,
                                                    random_state=42)
X_train.shape, X_test.shape

((243, 23), (243, 23))

In [5]:
kNNpipe  = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('kNN', KNeighborsClassifier())])

# Parameters for kNN are prefixed with kNN__
param_grid = {'scaler':[StandardScaler(), MinMaxScaler(),'passthrough'], 
              'kNN__n_neighbors':[1,3,5,7],
              'kNN__weights':['uniform','distance']
             }

In [6]:
grid_search = GridSearchCV(kNNpipe, param_grid=param_grid, verbose = 1)
grid_search = grid_search.fit(X_train,y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [7]:
grid_search.best_params_

{'kNN__n_neighbors': 5, 'kNN__weights': 'uniform', 'scaler': 'passthrough'}

### All grid search results
The parameter `cv_results_` gives us access to results on all options tested.  
We store the results in a data frame and print the important information. 

In [8]:
scores_df = pd.DataFrame(grid_search.cv_results_)
scores_df = scores_df.sort_values(by=['rank_test_score']).reset_index(drop='index')
scores_df [['rank_test_score', 'mean_test_score', 'param_kNN__n_neighbors', 
            'param_kNN__weights','param_scaler']]

Unnamed: 0,rank_test_score,mean_test_score,param_kNN__n_neighbors,param_kNN__weights,param_scaler
0,1,0.695748,5,uniform,passthrough
1,2,0.683503,5,distance,passthrough
2,3,0.683163,7,uniform,passthrough
3,3,0.683163,3,uniform,passthrough
4,5,0.679167,7,distance,passthrough
5,6,0.674915,3,distance,passthrough
6,7,0.654507,7,uniform,MinMaxScaler()
7,7,0.654507,7,distance,MinMaxScaler()
8,9,0.65034,7,distance,StandardScaler()
9,9,0.65034,7,uniform,StandardScaler()


## Pipelines and Naive Bayes
**Q1**  
Two Naive Bayes options for building classifiers on the Hotels dataset are GaussianNB and CategoricalNB with discretization. Scikit-learn provides a basic discretizier KBinsDiscretizer https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html.
Compare three options using cross validation:
- Gaussian Naive Bayes
- Gaussian Naive Bayes on scaled data
- Categorical Naive Bayes with discretization, try  (`KBinsDiscretizer(encode = 'ordinal'`))  

In [9]:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.model_selection import cross_val_score

Pipeline for `CategoricalNB`.  

In [10]:
CNBpipe  = Pipeline(steps=[
    ('discretize', KBinsDiscretizer(encode = 'ordinal')),
    ('naive_bayes', CategoricalNB())])

## Grid Search for Decision Trees
**Q2**  
Find the best decision tree model for the `HotelRevHelpfulness` dataset considering  `max_leaf_nodes` and the splitting `criterion`. The splitting `criterion` can be either 'gini' or 'entropy', you can select your own options for `max_leaf_nodes`.

In [21]:
from sklearn.tree import DecisionTreeClassifier

In [22]:
dt = DecisionTreeClassifier()
param_grid = {'max_leaf_nodes':[3,4,5,6,10,50],
              'criterion':['gini','entropy'],}

In [23]:
dt_gs = GridSearchCV(dt,param_grid,cv=10, verbose = 1, n_jobs = -1)
dt_gs = dt_gs.fit(X_train,y_train)

Fitting 10 folds for each of 12 candidates, totalling 120 fits


In [24]:
dt_gs.best_params_

{'criterion': 'gini', 'max_leaf_nodes': 5}

In [25]:
dt_gs.best_score_

0.6951666666666666