# K-Nearest Neighbors

Memorizes observations from a test set to predict classification labels for the new unseed observations, based on how similar the new features are to those in the training set.

### Popular usecases:
1. Stock price prediction
2. Credit risk analysis
3. Predictive trip planning
4. Recomendations systems

### Prerequisites:
1. Data has little noise
2. Dataset is labeled
3. Dataset contains only relevant features
4. Dataset has distinguishable subgroups

Not for large datasets, since it will be very time consuming.

## Setup

In [1]:
import numpy as np
import pandas as pd
import scipy
import urllib
import sklearn

import matplotlib.pyplot as plt
from pylab import rcParams

from sklearn import neighbors, preprocessing, metrics
from sklearn.model_selection import train_test_split

In [3]:
from sklearn.neighbors import KNeighborsClassifier

In [2]:
np.set_printoptions(precision=4, suppress=True)
%matplotlib inline
rcParams['figure.figsize'] = 7, 4
plt.style.use('seaborn-whitegrid')

## Importing data

In [4]:
address = './Data/mtcars.csv'
cars = pd.read_csv(address)
cars.columns = ['car_names', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

X_prime = cars[['mpg', 'disp', 'hp', 'wt']].values
y = cars.iloc[:,9].values

In [5]:
X = preprocessing.scale(X_prime)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=17)

## Building the model

In [7]:
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
print(clf)

KNeighborsClassifier()


## Evaluating the model predictions

In [9]:
y_pred = clf.predict(X_test)

print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.67      0.80         3

    accuracy                           0.86         7
   macro avg       0.90      0.83      0.84         7
weighted avg       0.89      0.86      0.85         7



#### Recall
is the measure of the model's completeness.

for case 1 only 67% of the returned result were relevant
only 83% for the entire dataset were truly relevant