# k-Nearest-Neighbors
k-Nearest-Neighbors is a classification method which determines the category of a data point by majority vote. This can be applied to anomaly/hospitalization detection as you can check for similarity to other anomalies/hospitalizations.

For more information, check this article here:

https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors

In [68]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score, recall_score, precision_score, classification_report

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split

random_state = 42

## Data Prep
We make sure that the training dataset and the testing dataset has the same ratio of hospitalizations, as there could be important anomalies which could be removed by random chance.

In [69]:
df = pd.read_csv('../data/frequencies.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,ProviderID_127318.0,ProviderID_203236159.0,ProviderID_486278.0,ServiceTypeName_AC Agency with Choice,ServiceTypeName_AC Attendant Care,ServiceTypeName_AC Companion Care,ServiceTypeName_AC Family Non Resident U7,ServiceTypeName_AC Family Resident,ServiceTypeName_AC Family Resident U7,...,TaskCodeGroupName_Toileting:Bathroom (58),TaskCodeGroupName_Toileting:Urinal (60),TaskCodeGroupName_Transfer:Chair (61),TaskCodeGroupName_Transfer:Gait Belt (62),TaskCodeGroupName_Transfer:Hoyer (63),TaskCodeGroupName_Transfer:Transfer (65),TaskCodeGroupName_Transfer:Walker (66),TaskCodeGroupName_Transfer:Wheelchair (67),TaskCodeGroupName_Transportation:Client Errands - Do Not Transport Client (68),hasHospitilization
0,0,0.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.0,0.0
1,1,0.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,0.0
2,2,0.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,0.0
3,3,0.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,34.0,0.0
4,4,0.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.0,0.0


In [70]:
def balanceData(df, percentTrain):
    global random_state
    
    positive = shuffle(df[df['hasHospitilization'] == 1], random_state = random_state)
    negative = shuffle(df[df['hasHospitilization'] == 0], random_state = random_state)
    
    sizePos = int(len(positive) * percentTrain)
    sizeNeg = int(len(negative) * percentTrain)
    
    trainPos = positive[:sizePos]
    trainNeg = negative[:sizeNeg]
    testPos = positive[sizePos:]
    testNeg = negative[sizeNeg:]
    
    train = pd.concat([trainPos, trainNeg])
    test = pd.concat([testPos, testNeg])
    
    train_x = train.loc[:, train.columns != 'hasHospitilization'].to_numpy()
    train_y = train['hasHospitilization'].to_numpy()
    test_x = test.loc[:, test.columns != 'hasHospitilization'].to_numpy()
    test_y = test['hasHospitilization'].to_numpy()
    
    return train_x, train_y, test_x, test_y

In [71]:
train_x,train_y, test_x, test_y = balanceData(df, .7)

In [72]:
#y = df['hasHospitilization']
#x = df.drop(['hasHospitilization'], axis=1)
#train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.2)

#splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2)

# for train_index, test_index in splitter.split(x,y):
#    train_x, test_x = x[train_index], x[test_index]
#    train_y, test_y = y[train_index], y[test_index]


## Training/Predicting with the Model
We have yet to try any hyperparameters. This will be updated when optimal hyperparameters are tested for.

* n_neighbors: The number of neighbors to include when making a decision as to which category a data point is part of
* ball_tree: Faster than kd_tree for higher dimensional data like ours
* n_jobs: number of processors used for prediction, -1 means all processors available will be used

In [73]:
model = KNeighborsClassifier(n_neighbors=3, algorithm='ball_tree', n_jobs= -1)
model.fit(train_x,train_y)

KNeighborsClassifier(algorithm='ball_tree', n_jobs=-1, n_neighbors=3)

In [74]:
preds = model.predict(test_x)

In [75]:
accuracy_score(test_y, preds)

0.9990021382751247

## Metrics
Using standard anomaly detection metrics (precision, recall, f1 score), we can see that kNN performs very well at detecting anomalies.

In [76]:
print("Precision Score: ", precision_score(test_y, preds))
print("Recall Score: ", recall_score(test_y, preds))
print("F1 Score: ", f1_score(test_y, preds))
print(classification_report(test_y, preds))

Precision Score:  0.8333333333333334
Recall Score:  0.9259259259259259
F1 Score:  0.8771929824561403
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     13976
         1.0       0.83      0.93      0.88        54

    accuracy                           1.00     14030
   macro avg       0.92      0.96      0.94     14030
weighted avg       1.00      1.00      1.00     14030

