<div class="alert alert-block alert-warning">

# K-Nearest Nerighbor Exercises

Create a new notebook, knn_model, and work with the titanic dataset to answer the following:

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler

from acquire import new_titanic_data
from prepare import prep_titanic, split_data

#### Acquire 

In [2]:
# Acquire data
titanic = prep_titanic(new_titanic_data())
titanic.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,male,22.0,1,0,7.25,Southampton,0
1,1,1,female,38.0,1,0,71.2833,Cherbourg,0
2,1,3,female,26.0,0,0,7.925,Southampton,1
3,1,1,female,35.0,1,0,53.1,Southampton,0
4,0,3,male,35.0,0,0,8.05,Southampton,1


In [3]:
titanic['sex'] = titanic.sex.map({'male': 1, 'female': 0})
titanic['embark_town'] = titanic.embark_town.map({'Southampton': 0, 'Queenstown': 1, 'Cherbourg': 2})
titanic['age'] = titanic.age.astype(int)

In [4]:
# take a look
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone
0,0,3,1,22,1,0,7.25,0,0
1,1,1,0,38,1,0,71.2833,2,0
2,1,3,0,26,0,0,7.925,0,1
3,1,1,0,35,1,0,53.1,0,0
4,0,3,1,35,0,0,8.05,0,1


#### Prepare

In [5]:
# Train, validate, split data
train, validate, test = split_data(titanic, 'survived')

#### Isolate the target variable

In [6]:
# we know what our X and y are, let's be explicit about defining them
X_train = train.drop(columns='survived')
y_train = train.survived

X_val = validate.drop(columns='survived')
y_val = validate.survived

X_test = test.drop(columns='survived')
y_test = test.survived

#### Create the baseline

In [7]:
# write a function to compute the baseline for a classification model

def establish_baseline(y_train):
    #  establish the value we will predict for all observations
    baseline_prediction = y_train.mode()

    # create a series of predictions with that value, 
    # the same length as our training set
    y_train_pred = pd.Series((baseline_prediction[0]), range(len(y_train)))

    # compute accuracy of baseline
    cm = confusion_matrix(y_train, y_train_pred)
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp+tn)/(tn+fp+fn+tp)
    return accuracy

In [8]:
establish_baseline(y_train)

0.5943775100401606

<div class="alert alert-block alert-success">

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [24]:
mms = MinMaxScaler()

X_train[['age', 'fare']] = mms.fit_transform(X_train[['age', 'fare']])
X_val[['age', 'fare']] = mms.transform(X_val[['age', 'fare']])

X_train.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embark_town,alone
381,3,0,0.014085,0,2,0.030726,2,0
567,3,0,0.408451,0,4,0.041136,0,0
296,3,1,0.323944,0,0,0.01411,2,1
155,1,1,0.71831,0,1,0.119804,2,0
521,3,1,0.309859,0,0,0.015412,0,1


In [16]:
# MAKE the thing
knn = KNeighborsClassifier()

# FIT the thing
knn.fit(X_train, y_train)

# USE the thing
y_train_pred = knn.predict(X_train)

<div class="alert alert-block alert-success">

2. Evaluate your results using the model score, confusion matrix, and classification report.

In [11]:
# accuracy score of train set
train_score = knn.score(X_train, y_train)
train_score

0.8514056224899599

In [12]:
#confusion matrix
cm = confusion_matrix(y_train, y_train_pred)
pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], columns=['Pred 0', 'Pred 1'])

Unnamed: 0,Pred 0,Pred 1
Actual 0,133,15
Actual 1,22,79


In [13]:
#classification report
pd.DataFrame(classification_report(y_train, y_train_pred, output_dict=True))

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.858065,0.840426,0.851406,0.849245,0.85091
recall,0.898649,0.782178,0.851406,0.840413,0.851406
f1-score,0.877888,0.810256,0.851406,0.844072,0.850455
support,148.0,101.0,0.851406,249.0,249.0


<div class="alert alert-block alert-success">

3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [18]:
def print_cm_metrics(cm):
    tn, fp, fn, tp = cm.ravel()

    accuracy = (tp + tn)/(tn + fp + fn + tp)

    true_positive_rate = tp/(tp + fn)
    false_positive_rate = fp/(fp + tn)
    true_negative_rate = tn/(tn + fp)
    false_negative_rate = fn/(fn + tp)

    precision = tp/(tp + fp)
    recall = tp/(tp + fn)
    f1_score = 2*(precision*recall)/(precision+recall)

    support_pos = tp + fn
    support_neg = fp + tn

    output = {
    'metric' : ['accuracy', 'true_positive_rate', 'false_positive_rate', 'true_negative_rate', 'false_negative_rate', 'precision', 'recall', 'f1_score', 'support_pos', 'support_neg']
    ,'score' : [accuracy, true_positive_rate, false_positive_rate, true_negative_rate, false_negative_rate, precision, recall, f1_score, support_pos, support_neg]
}

    return pd.DataFrame(output)

In [19]:
print_cm_metrics(cm)

Unnamed: 0,metric,score
0,accuracy,0.851406
1,true_positive_rate,0.782178
2,false_positive_rate,0.101351
3,true_negative_rate,0.898649
4,false_negative_rate,0.217822
5,precision,0.840426
6,recall,0.782178
7,f1_score,0.810256
8,support_pos,101.0
9,support_neg,148.0


<div class="alert alert-block alert-success">

4. Run through steps 1-3 setting k to 10

In [25]:
def knn_scale_fit_predict(k, X_train, y_train, X_validate):
    #scale
    mms = MinMaxScaler()

    X_train[['age', 'fare']] = mms.fit_transform(X_train[['age', 'fare']])
    X_val[['age', 'fare']] = mms.transform(X_val[['age', 'fare']])

    X_train.head()       

    # MAKE the thing
    knn = KNeighborsClassifier(n_neighbors=k)

    # FIT the thing
    knn.fit(X_train, y_train)

    # USE the thing
    y_train_pred = knn.predict(X_train)
    y_validate_pred = knn.predict(X_validate)
    
    return knn, y_train_pred, y_validate_pred


In [26]:
def evaluate_clf(model, X, y, y_pred):
    # model score
    accuracy = model.score(X, y)

    # confusion matrix
    cm = confusion_matrix(y, y_pred)
    cmdf = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], 
                       columns=['Pred 0', 'Pred 1'])

    # classification report
    crdf = pd.DataFrame(classification_report(y, y_pred, output_dict=True))
    
    # confusion matrix metrics
    metrics = print_cm_metrics(cm)
    
    return accuracy, cmdf, crdf, metrics

In [29]:
def print_results():
    print(f"""KNN where K = {k}
    ********Train Evaluation********
    Accuracy: {accuracy_t}
    Confusion Matrix:
    {cmdf_t}
    Classification Report:
    {crdf_t}
    Metrics: 
    {met_t}
    ________________________________________________
    ********Validate Evaluation********
    Accuracy: {accuracy_v}
    Confusion Matrix:
    {cmdf_v}
    Classification Report:
    {crdf_v}
    Metrics: 
    {met_v}
    """)

In [31]:
k = 10
knn, y_train_pred, y_validate_pred = knn_scale_fit_predict(k, X_train, y_train, X_val)
accuracy_t, cmdf_t, crdf_t, met_t = evaluate_clf(knn, X_train, y_train, y_train_pred)

accuracy_v, cmdf_v, crdf_v, met_v = evaluate_clf(knn, X_val, y_val, y_validate_pred)

print_results()

KNN where K = 10
    ********Train Evaluation********
    Accuracy: 0.8313253012048193
    Confusion Matrix:
              Pred 0  Pred 1
Actual 0     137      11
Actual 1      31      70
    Classification Report:
                        0           1  accuracy   macro avg  weighted avg
precision    0.815476    0.864198  0.831325    0.839837      0.835239
recall       0.925676    0.693069  0.831325    0.809372      0.831325
f1-score     0.867089    0.769231  0.831325    0.818160      0.827395
support    148.000000  101.000000  0.831325  249.000000    249.000000
    Metrics: 
                    metric       score
0             accuracy    0.831325
1   true_positive_rate    0.693069
2  false_positive_rate    0.074324
3   true_negative_rate    0.925676
4  false_negative_rate    0.306931
5            precision    0.864198
6               recall    0.693069
7             f1_score    0.769231
8          support_pos  101.000000
9          support_neg  148.000000
    ________________________

<div class="alert alert-block alert-success">

5. Run through steps 1-3 setting k to 20

In [32]:
k = 20
knn, y_train_pred, y_validate_pred = knn_scale_fit_predict(k, X_train, y_train, X_val)
accuracy_t, cmdf_t, crdf_t, met_t = evaluate_clf(knn, X_train, y_train, y_train_pred)

accuracy_v, cmdf_v, crdf_v, met_v = evaluate_clf(knn, X_val, y_val, y_validate_pred)

print_results()

KNN where K = 20
    ********Train Evaluation********
    Accuracy: 0.8192771084337349
    Confusion Matrix:
              Pred 0  Pred 1
Actual 0     133      15
Actual 1      30      71
    Classification Report:
                        0           1  accuracy   macro avg  weighted avg
precision    0.815951    0.825581  0.819277    0.820766      0.819857
recall       0.898649    0.702970  0.819277    0.800809      0.819277
f1-score     0.855305    0.759358  0.819277    0.807332      0.816387
support    148.000000  101.000000  0.819277  249.000000    249.000000
    Metrics: 
                    metric       score
0             accuracy    0.819277
1   true_positive_rate    0.702970
2  false_positive_rate    0.101351
3   true_negative_rate    0.898649
4  false_negative_rate    0.297030
5            precision    0.825581
6               recall    0.702970
7             f1_score    0.759358
8          support_pos  101.000000
9          support_neg  148.000000
    ________________________

<div class="alert alert-block alert-success">

6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

* Of the two k options (10 & 20) 10 nearest neighbors works best on the in-sample and out-of-sample data. However, the f1 score is better on the model where k = 20, so it is a bit more balanced.

<div class="alert alert-block alert-success">

7. Which model performs best on our out-of-sample data from validate?

In [None]:
metrics = []

for k in range(1,21):
    knn, y_train_pred, y_val_pred = knn_scale_fit_predict(k, X_train, 
                                                    y_train, 
                                                    X_val)
    train_acc = knn.score(X_train, y_train)
    val_acc = knn.score(X_val, y_val)
    
    output = {
            "k": k,
            "train_accuracy": train_acc,
            "validate_accuracy": val_acc
    }

    metrics.append(output)
    
eval_df = pd.DataFrame(metrics)
eval_df['difference'] = eval_df['train_accuracy'] - eval_df['validate_accuracy']

eval_df

* k=12 as the best model, as the difference is closest to zero