# Part One: Split your data

This part is exactly what we did to split the data in the example provided in the lesson. Import the data, separate the target labels from the input data, and use `test_train_split`

In [13]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, KFold

# Load the data
heart_dataset = pd.read_csv('../../datasets/uci-heart-disease/heart.csv')

# Split the data into input and labels
labels = heart_dataset['target']
input_data = heart_dataset.drop(columns=['target'])

# Split the data into training and test
training_data, test_data, training_labels, test_labels = train_test_split(
    input_data, 
    labels, 
    test_size=0.20
)

## Bonus Challenge: K-Fold Cross Validation

K-Fold validation is a method that makes `k` different training and validation sets. It's purpose is to provide a more rigorous form of validation, by training and holding out different data each time. 

There are many kinds of "Cross Validation" and k-fold is the simplist form. For more details and some very helpful visualizations of how k-fold works see: https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py

In [14]:
# 5 splits means each validation set will be 20% of the data. 4 splits would be 25%, 3 splits would be 33%...
kf = KFold(n_splits = 5)

# Loop over the "splits" provides us with a list of indicies for each set of data.
# We can use pandas .iloc to select the data represented by those indicies.
for train_index, validation_index in kf.split(input_data):
    X_train, X_test = input_data.iloc[train_index], input_data.iloc[validation_index]
    y_train, y_test = labels.iloc[train_index], labels.iloc[validation_index]
    
    # Now, we'd do our model training and evaluation inside this loop.
    # Train the model once for each different subest of training/validation

## Part 2: Explore KNN

Build at least 6 different versions of the KNN model using different hyperparameters and evaluate their performance on the validation data.

In [20]:
# There are nearly infinite number of correct solutions to this section.
model_one = KNeighborsClassifier()
model_two = KNeighborsClassifier(weights='distance')
model_three = KNeighborsClassifier(n_neighbors=10, weights='distance')
model_four = KNeighborsClassifier(n_neighbors=10, weights='distance', p=1)
model_five = KNeighborsClassifier(n_neighbors=10, weights='distance', p=3)
model_six = KNeighborsClassifier(n_neighbors=3, p=1)

# Lets make it easier to train each one by putting them in a list...
models = [
    model_one,
    model_two,
    model_three,
    model_four,
    model_five,
    model_six
]

for index, model in enumerate(models):
    model.fit(training_data, training_labels)
    print(f'{index}: {model.score(test_data, test_labels):.3f}')

0: 0.623
1: 0.639
2: 0.705
3: 0.721
4: 0.689
5: 0.738


## Applying K-Fold

Above we used K-Fold Cross Validation to create several validation sets. Lets see that applied to the training process:

In [27]:
# Each model trains and scores on each of the "k folds"
# we'll see it's score on each set, as well as the average score
for i, model in enumerate(models):
    validation_scores = []
    for train_index, validation_index in kf.split(input_data):
        X_train, X_validation = input_data.iloc[train_index], input_data.iloc[validation_index]
        y_train, y_validation = labels.iloc[train_index], labels.iloc[validation_index]
        
        model.fit(X_train, y_train)
        score = model.score(X_validation, y_validation)
        validation_scores.append(score)
    
    # We've fit and scored this model on each set now.
    pretty_scores = ', '.join('{:.2f}'.format(score) for score in validation_scores)
    print(f'{i}:\n   {pretty_scores}\n   {np.mean(validation_scores):.3f}')

0:
   0.48, 0.64, 0.67, 0.38, 0.32
   0.497
1:
   0.48, 0.62, 0.69, 0.38, 0.33
   0.501
2:
   0.54, 0.61, 0.64, 0.35, 0.35
   0.497
3:
   0.54, 0.66, 0.67, 0.38, 0.38
   0.527
4:
   0.54, 0.64, 0.64, 0.33, 0.28
   0.487
5:
   0.61, 0.67, 0.66, 0.40, 0.42
   0.550


Notice that our average score with K-fold accuracy is less impressive than our single shot accuracy. The data splitting proccess is not trivial.

## Bonus: Using Grid Search

To use grid search we specify all the values of different parameters that we're interested in. The grid search interface will automatically test ALL combinations of the provided parameters. The default for Grid Search is to automatically apply k-fold cross 

In [33]:
from sklearn.model_selection import GridSearchCV

tuned_parameters = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['distance', 'uniform'],
    'p': [1, 2, 3]
}

# These two lines will result in every possible combo of the above paramters to be fit and scored
# which will take a LONG TIME with large datasets.
grid_s_classifier = GridSearchCV(KNeighborsClassifier(), tuned_parameters)

# NOTE: GridSearchCV performs k-fold cross validation by default. 
# This is why we use the whole dataaset, rather than the pre-split data we made above
grid_s_classifier.fit(input_data, labels) 

print("Best parameters set found on development set:")
print()
print(grid_s_classifier.best_params_, grid_s_classifier.best_score_)
print()
print("Grid scores on development set:")
print()
means = grid_s_classifier.cv_results_['mean_test_score']
stds = grid_s_classifier.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_s_classifier.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Best parameters set found on development set:

{'n_neighbors': 9, 'p': 1, 'weights': 'uniform'} 0.6932786885245901

Grid scores on development set:

0.650 (+/-0.108) for {'n_neighbors': 3, 'p': 1, 'weights': 'distance'}
0.654 (+/-0.107) for {'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}
0.608 (+/-0.155) for {'n_neighbors': 3, 'p': 2, 'weights': 'distance'}
0.614 (+/-0.129) for {'n_neighbors': 3, 'p': 2, 'weights': 'uniform'}
0.618 (+/-0.171) for {'n_neighbors': 3, 'p': 3, 'weights': 'distance'}
0.621 (+/-0.157) for {'n_neighbors': 3, 'p': 3, 'weights': 'uniform'}
0.673 (+/-0.068) for {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
0.687 (+/-0.074) for {'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}
0.647 (+/-0.102) for {'n_neighbors': 5, 'p': 2, 'weights': 'distance'}
0.644 (+/-0.108) for {'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
0.631 (+/-0.104) for {'n_neighbors': 5, 'p': 3, 'weights': 'distance'}
0.631 (+/-0.102) for {'n_neighbors': 5, 'p': 3, 'weights': 'uniform'}
0.670

# Part Three: Another Model

Similar to previous sections, there are many correct solutions to this part. Below is a grid search cross validation for a decision tree.

In [36]:
from sklearn.tree import DecisionTreeClassifier

tuned_parameters = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 10, 20],
    'min_samples_split': [2, 4, 8, 16, 32],
    'max_leaf_nodes': [None, 10, 20, 40, 80]
}

# These two lines will result in every possible combo of the above paramters to be fit and scored
# which will take a LONG TIME with large datasets.
grid_tree = GridSearchCV(DecisionTreeClassifier(), tuned_parameters)

# NOTE: GridSearchCV performs k-fold cross validation by default. 
# This is why we use the whole dataaset, rather than the pre-split data we made above
grid_tree.fit(input_data, labels) 

print("Best parameters set found on development set:")
print()
print(grid_tree.best_params_, grid_tree.best_score_)
print()
print("Grid scores on development set:")
print()
means = grid_tree.cv_results_['mean_test_score']
stds = grid_tree.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_tree.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
          % (mean, std * 2, params))

Best parameters set found on development set:

{'criterion': 'entropy', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 32} 0.8150273224043716

Grid scores on development set:

0.759 (+/-0.106) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 2}
0.746 (+/-0.118) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 4}
0.772 (+/-0.067) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 8}
0.755 (+/-0.100) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 16}
0.799 (+/-0.066) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': None, 'min_samples_split': 32}
0.795 (+/-0.115) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 10, 'min_samples_split': 2}
0.792 (+/-0.113) for {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 10, 'min_samples_split': 4}
0.795 (+/-0.105) for {'criterion': 'gin