# K-Nearest Neighbor Exercises

Create a new notebook, `knn_model`, and work with the titanic dataset to answer the following:

In [1]:
#data manipulation
import pandas as pd
import numpy as np

#visualization
import matplotlib.pyplot as plt
import seaborn as sns

#stats is great
from scipy import stats

#my own files with my own functions
import acquire
import prepare

# os is operating system stuff, few things I know
# env is my py file to access SQL databases
import os
import env

# If I decide to retrieve other datasets but they'll be raw
from pydataset import data

# ML stuff: (modeling imports)
from sklearn.model_selection import train_test_split

# The big 4 for classification
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression #logistic not linear!
from sklearn.neighbors import KNeighborsClassifier #pick the classifier one

# Evaluation metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
df = acquire.get_titanic_data()

this file exists, reading csv


In [3]:
df = prepare.clean_titanic(df)

In [4]:
train, validate, test = prepare.splitting_data(df, 'survived', seed=123)

In [5]:
train, validate, test = prepare.preprocess_titanic(train, validate, test)

In [6]:
### We want everything EXCEPT the target variable
X_train = train.drop(columns = 'survived')
X_validate = validate.drop(columns = 'survived')
X_test = test.drop(columns = 'survived')

In [7]:
### We want ONLY the target variable
y_train = train.survived
y_validate = validate.survived
y_test = test.survived

### Operations above ^ are from decision tree and will be used for the rest of the Machine Learning Models/Ensemble Methods

## 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [9]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [10]:
y_pred = knn.predict(X_train)

## 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [11]:
knn.score(X_train, y_train)

0.7798594847775175

In [15]:
pd.crosstab(y_train, y_pred, rownames=['actual'], colnames=['pred'])

pred,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,210,44
1,50,123


In [14]:
confusion_matrix(y_train, y_pred)

array([[210,  44],
       [ 50, 123]])

In [16]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       254
           1       0.74      0.71      0.72       173

    accuracy                           0.78       427
   macro avg       0.77      0.77      0.77       427
weighted avg       0.78      0.78      0.78       427



## 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [17]:
prepare.compute_class_metrics(y_train, y_pred)

Accuracy: 0.7798594847775175

True Positive Rate/Sensitivity/Recall/Power: 0.7109826589595376
False Positive Rate/False Alarm Ratio/Fall-out: 0.1732283464566929
True Negative Rate/Specificity/Selectivity: 0.8267716535433071
False Negative Rate/Miss Rate: 0.28901734104046245

Precision/PPV: 0.7365269461077845
F1 Score: 0.7235294117647059

Support (0): 173
Support (1): 254


## 4. Run through steps 1-3 setting k to 10

In [18]:
knn10 = KNeighborsClassifier(n_neighbors = 10)
knn10.fit(X_train, y_train)
y_pred = knn10.predict(X_train)
prepare.compute_class_metrics(y_train, y_pred)

Accuracy: 0.7189695550351288

True Positive Rate/Sensitivity/Recall/Power: 0.47398843930635837
False Positive Rate/False Alarm Ratio/Fall-out: 0.1141732283464567
True Negative Rate/Specificity/Selectivity: 0.8858267716535433
False Negative Rate/Miss Rate: 0.5260115606936416

Precision/PPV: 0.7387387387387387
F1 Score: 0.5774647887323944

Support (0): 173
Support (1): 254


## 5. Run through steps 1-3 setting k to 20

In [19]:
knn20 = KNeighborsClassifier(n_neighbors = 20)
knn20.fit(X_train, y_train)
y_pred = knn20.predict(X_train)
prepare.compute_class_metrics(y_train, y_pred)

Accuracy: 0.7236533957845434

True Positive Rate/Sensitivity/Recall/Power: 0.4797687861271676
False Positive Rate/False Alarm Ratio/Fall-out: 0.11023622047244094
True Negative Rate/Specificity/Selectivity: 0.889763779527559
False Negative Rate/Miss Rate: 0.5202312138728323

Precision/PPV: 0.7477477477477478
F1 Score: 0.5845070422535211

Support (0): 173
Support (1): 254


## 6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

> the model with 5 nearest neighbors performed the best

## 7. Which model performs best on our out-of-sample data from validate?

In [20]:
knn.score(X_train, y_train)

0.7798594847775175

In [21]:
knn10.score(X_train, y_train)

0.7189695550351288

In [22]:
knn20.score(X_train, y_train)

0.7236533957845434

> it performs best on the model with the default 5 nearest neighbors