# k-Nearest Neighbor
-  Distance functions (Nearest Neighbor Classiﬁcation) allow us to identify the points closest
to a given target, or the nearest neighbors (NN) to a given point. 
- The advantages of NN include simplicity, interpretability and non-linearity.
- Supervised classifier that memorizes observations from a test set and predicts calssifications for new incoming unlabeled observations.
- Nothing for large datasets, takes too much time


**k-Nearest Neighbors**: <br>
Given a positive integer k and a point x 0 , the KNN
classiﬁer ﬁrst identiﬁes k points in the training data most similar to x 0 , then estimates the conditional probability of x 0 being in class j as the fraction of the k points whose values belong to j. The optimal value for k can be found using cross validation.

**Assumptions**<br>
the dataset has:<br>
- little noise
- data is labeled
- contains relevant features
- distinguishable subgroups


## Setting up for classification analysis

In [2]:
import numpy as np
import pandas as pd
import scipy
import urllib
import sklearn

import matplotlib.pyplot as plt
from pylab import rcParams

from sklearn import neighbors
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

In [3]:
np.set_printoptions(precision=4, suppress=True) 
%matplotlib inline
rcParams['figure.figsize'] = 7, 4
plt.style.use('seaborn-whitegrid')

## Importing your data

In [4]:
address = 'data/mtcars.csv'

cars = pd.read_csv(address)
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']

X_prime = cars[['mpg', 'disp', 'hp', 'wt']].values
y = cars.iloc[:,9].values # am variable in column at index 9

## scaling

In [5]:
X = preprocessing.scale(X_prime) # scaling

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=17)

## Building and training your model with training data

In [10]:
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, y_train)

{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}


## Model Parameters

In [11]:
clf.get_params() 

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 5,
 'p': 2,
 'weights': 'uniform'}

## Evaluating your model's predictions

**Recall**: measure of model's completeness.
- Recall(also known as sensitivity) $\frac{tp}{tp+fn}$ 
- is the fraction of relevant instances that were retrieved. 
- 67% of all points from label 1 were recognized as such

**Macro AVG**:<br>
of the entire dataset 83%  ($\frac{1+0.67}{2}$) of the results were truly relevant - average Recall

**Precison**: measure of the model's relevancy
- Precision (aka. positive predictive value) $\frac{tp}{tp+fp}$ 
- is the fraction of correct predictions among the retrieved instances

high precision + low recall<br>
= few results returned but many of the label predictions that are returned are correct<br>
high accuracy but low completion


In [12]:
y_pred= clf.predict(X_test)
y_expect = y_test

print(metrics.classification_report(y_expect, y_pred))

              precision    recall  f1-score   support

           0       0.80      1.00      0.89         4
           1       1.00      0.67      0.80         3

    accuracy                           0.86         7
   macro avg       0.90      0.83      0.84         7
weighted avg       0.89      0.86      0.85         7

