<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Model Selection & Evaluation

In this notebook we are going to look at strategies to divide your dataset in order to perform model selection and testing using subsets of data in ways that do not create bias in your measurement of model performance.

We are going to use a dataset which comes from a study done to try to use sonar signals to differentiate between a mine (simulated using a metal cylinder) and a rock.  Details on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

In [None]:
# Import the libraries we know we need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import KFold

import warnings
warnings.filterwarnings("ignore")

In [None]:
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
data = pd.read_csv(url, header=None)
print(data.shape)
data.head()

## Part 1: Training and test sets
In this part, you should complete the following:  
- Split your data into a feature matrix X and a target vector y  
- Split the data into a training set and a test set, using 85% of the data for training and 15% for testing (hint: use scikit-learn's train_test_split() method, already imported for you.  Name the resulting arrays `X_train, y_train, X_test, y_test`
- Train (fit) your model on the X and y training sets  
- Use your trained model to get predictions on the `X_test` test set, and name the predictions `preds`  
- Finally, run the next code cell to calculate the display the accuracy of your classifier model

In [None]:
def run_model(data,model):
    '''
    Splits the data, trains a model on the training data and then generates and returns predictions on the test set

    Inputs:
        data(DataFrame): dataframe containing the data features and labels
        model(sklearn.base.BaseEstimator): instantiated scikit-learn model object

    Returns:
        preds(np.ndarray): numpy array containing the model predictions for the test set
        y_test(np.ndarray): numpy array containing the labels for the test set
    '''
    ### BEGIN SOLUTION ###
    
    
    ### END SOLUTION ###

In [None]:
# Create an instance of the MLPClassifier algorithm and set the hyperparameter values
model = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)
                      
# Evaluate the performance of our model using the test predictions
preds,y_test = run_model(data,model)
assert len(preds) == len(y_test)
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our classifier on the test set is {:.3f}'.format(acc_test))

## Part 2: Model selection using validation sets
But what if we want to compare different models (for example, evaluate different algorithms or fine-tune our hyperparameters)?  Can we use the same strategy of training each model on the training data and then comparing their performance on the test set to select the best model?

When we are seeking to optimize models by tuning hyperparameters or comparing different algorithms, it is a best practice to do so by comparing the performance of your model options using a "validation" set, and then reserve use of the test set to evaluate the performance of the final model you have selected.  To utilize this approach we must split our data three ways to create a training set, validation set, and test set.

To illustrate this, let's compare two different models.  Complete the function below which performs the following:
- Split your data into X and y arrays and then into a training set and a test set, using 15% of data for the test set.  Store the training data as `X_train_full, y_train_full` and the test set data as `X_test, y_test`
- Now, split your training set again into a training set and a validation set, using 15% of the training set for the new validation set (and the remaining 85% is still available for training). Store the final training data as `X_train, y_train` and the validation set data as `X_val, y_val`
- Train (fit) model1 and model2 using the training data only  
- Now, use your trained model1 and model2 to generate predictions on the validation set.  Store model1's predictions as `val_preds_model1` and model2's predictions as `val_preds_model2`  
- Finally, run the code cell below to calculate the accuracy of each on the validation set.  Based on this, which model would you select as your final model?

In [None]:
def compare_models(data, model1, model2):
    '''
    Splits data into training, validation and test sets, trains two models and then generates predictions on the validation set using both models

    Inputs:
        data(DataFrame): dataframe containing the data features and labels
        model1(sklearn.base.BaseEstimator): instantiated scikit-learn model object
        model2(sklearn.base.BaseEstimator): instantiated scikit-learn model object

    Returns:
        val_preds_model1(np.ndarray): numpy array containing the model predictions for the test set
        val_preds_model2(np.ndarray): numpy array containing the model predictions for the test set
        y_val(np.ndarray): numpy array containing the labels for the validation set
        y_test(np.ndarray): numpy array containing the labels for the test set
    '''
    ### BEGIN SOLUTION ###

    

    ### END SOLUTION ###

In [None]:
# Create an instance of each model we want to evaluate

model1 = MLPClassifier(hidden_layer_sizes=(100,50,10),activation='tanh',
                      solver='sgd',learning_rate_init=0.001,max_iter=2000, random_state=0)

model2 = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=2000, random_state=0)

# Calculate the validation accuracy of each model
val_preds_model1, val_preds_model2, y_val, y_test = compare_models(data, model1, model2)
acc_val_model1 = np.sum(val_preds_model1==y_val)/len(y_val)
acc_val_model2 = np.sum(val_preds_model2==y_val)/len(y_val)

print('Accuracy of model1 on the validation set is {:.3f}'.format(acc_val_model1))
print('Accuracy of model2 on the validation set is {:.3f}'.format(acc_val_model2))

Now that we've chosen our final model, we can use the test set to evaluate it's performance.  Before we do that, let's retrain our model using the training plus validation data.

In [None]:
# Train our selected model on the training plus validation sets
preds,y_test = run_model(data,model2)

# Evaluate its performance on the test set
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))

## Part 3: Model selection using cross-validation

A common approach to comparing and optimizing models is to use cross-validation rather than a single validation set to compare model performace.  We will then select the better model based on the cross-validation performance and use the test set to determine its performance.

In [None]:
def run_crossvalidation(X_train,y_train,models):
    '''
    Performs k-folds cross validation on an arbitrary number of models provided as inputs and returns the mean cross-validation accuracy of each

    Inputs:
        X_train(np.ndarray): numpy array containing the training set features
        y_train(np.ndarray): numpy array containing the training set labels
        models(list): list of instantiated scikit-learn model objects

    Returns:
        crossval_accs(list): list containing the mean cross-validation accuracy of each model across the validation folds
    '''

    ### BEGIN SOLUTION ###

    
        
    ### END SOLUTION ###
            

In [None]:
# Let's set aside a test set and use the remainder for training and cross-validation
X = data.iloc[:,:60].to_numpy()
y = data.iloc[:,60].to_numpy()
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Set up the two models we want to compare: a neural network model and a KNN model
model_a = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=1000,random_state=0)

model_b = KNeighborsClassifier(n_neighbors=5)

models = [model_a, model_b]

accs = run_crossvalidation(X_train,y_train,models)
for model,acc in zip(models,accs):
    print('For model: {}'.format(model))
    print('Mean cross-validation accuracy across all folds is {:.3f} \n'.format(acc))

In [None]:
# Let's set aside a test set and use the remainder for training and cross-validation
X = data.iloc[:,:60].to_numpy()
y = data.iloc[:,60].to_numpy()
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=0,test_size=0.15)

# Set up the two models we want to compare: a neural network model and a KNN model
model_a = MLPClassifier(hidden_layer_sizes=(100,50),activation='relu',
                      solver='sgd',learning_rate_init=0.01,max_iter=1000,random_state=0)

model_b = KNeighborsClassifier(n_neighbors=5)

models = [model_a, model_b]

# Cross-validation using cross_val_score
from sklearn.model_selection import cross_val_score
for model in models:
    scores = cross_val_score(model,X_train,y_train,scoring="accuracy",cv=3)
    mean_score = np.mean(scores)
    print(model)
    print('Mean cross-validation accuracy across all folds is {:.3f} \n'.format(mean_score))

As we can see above, the cross-validation accuracy of model_a is higher than model_b, so we will use model_a.  Let's now evaluate the performance of model_a on the test set

In [None]:
# Train our selected model on the training plus validation sets
preds,y_test = run_model(data,model_a)

# Evaluate its performance on the test set
acc_test = np.sum(preds==y_test)/len(y_test)
print('Accuracy of our model on the test set is {:.3f}'.format(acc_test))