# K-Nearest Neighbors

K-Nearest Neighbors is an algorithm that uses euclidean distance to determine the distance between two points. Clusters are determined by their distance to one another, or once clusters are defined which one has the most samples within a cluster closest to any added point.

Euclidean distance is mathematically defined as below:

distance between two points = $\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}$

1. Do the math in the parentheses which will give you the difference in distance among the x and y axes. The difference will be denoted with delta: $\Delta$
 - $P_\Delta = \sqrt{(x_\Delta)^2 + (y_\Delta)^2}$
2. Then square those differences.
 - $P_\Delta = \sqrt{x_\Delta^2 + y_\Delta^2}$
3. Add the resulting squares.
 - $P_\Delta = \sqrt{x_\Delta^2 + y_\Delta^2}$ = $\sqrt{z}$
4. Take the square root of the final sum, that is your euclidean distance between two points.
 - $P_\Delta = \sqrt{z}$
 
It is tempting to take the easy route and cancel out the exponents with the square root, but that does not work for this equation.

## Problem Statement:

Determine whether or not someone likely bought a car based on their age and income.

## Index
1. [Data Preprocessing](http://localhost:8888/notebooks/MachineLearningModels/K-Nearest%20Neighbors.ipynb#Data-Preprocessing)
2. [KMeans Model Training](http://localhost:8888/notebooks/MachineLearningModels/K-Nearest%20Neighbors.ipynb#Train-the-KMeans-Cluster-model)
3. [Predictions](http://localhost:8888/notebooks/MachineLearningModels/K-Nearest%20Neighbors.ipynb#Predictions-and-final-results)
4. [Conclusion](http://localhost:8888/notebooks/MachineLearningModels/Classification/K-Nearest%20Neighbors.ipynb#Conclusion)

## Data Preprocessing

### Importing libraries

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Import the dataset

In [11]:
dataset = pd.read_csv('data/Social_Network_Ads.csv')

x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Split the dataset

In [5]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

### Feature Scaling

Not using a trig function this time, but smaller numbers improve the distance calculation.

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

## Train the KMeans Cluster model

This relies on the distance between neighbors to form its clusters. In particular, it uses the distance between the five nearest points across both clusters to make its decision.

In [16]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(x_train, y_train)

KNeighborsClassifier()

## Predictions and final results

### Get the group of our prediction

In [17]:
print(classifier.predict(sc.transform([[30,87000]])))

[0]


### Compare our prediction clusters with their real clusters

In [18]:
y_pred = classifier.predict(x_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [0 0]
 [0 1]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [1 0]
 [1 1]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [1 1]]


### Making the Confusion Matrix

The confusion matrix gives us a picture of the accuracy of the model by showing the values of: true positives, false positives, false negatives, and true negatives in a 2x2 matrix.

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[54  4]
 [ 1 21]]


0.9375

## Conclusion

Upon further review we could classify the people who are and are not purchasing cars based on income and age, and adjust our marketing strategy accordingly. 