# K-Nearest Neighbors

## Introduction

In this lesson, you'll learn about a supervised learning algorithm, **_K-Nearest Neighbors_**; and how you can use it to make predictions for classification and regression tasks!

## Objectives

You will be able to:

* Describe how KNN makes classifications


## What is K-Nearest Neighbors?

**_K-Nearest Neighbors_** (or KNN, for short) is a supervised learning algorithm that can be used for both **_Classification_** and **_Regression_** tasks. However, in this section, we will cover KNN only in the context of classification. KNN is a distance-based classifier, meaning that it implicitly assumes that the smaller the distance between two points, the more similar they are. In KNN, each column acts as a dimension. In a dataset with two columns, we can easily visualize this by treating values for one column as X coordinates and and the other as Y coordinates. Since this is a **_Supervised learning algorithm_**, you must also have the labels for each point in the dataset, or else you wouldn't know what to predict!

## Fitting the model

KNN is unique compared to other classifiers in that it does almost nothing during the "fit" step, and all the work during the "predict" step. During the "fit" step, KNN just stores all the training data and corresponding labels. No distances are calculated at this point. 

## Making predictions with K

All the magic happens during the "predict" step. During this step, KNN takes a point that you want a class prediction for, and calculates the distances between that point and every single point in the training set. It then finds the `K` closest points, or **_Neighbors_**, and examines the labels of each. You can think of each of the K-closest points getting to 'vote' about the predicted class. Naturally, they all vote for the same class that they belong to. The majority wins, and the algorithm predicts the point in question as whichever class has the highest count among all of the k-nearest neighbors.

In the following animation, K=3: 

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/knn.gif'>

[gif source](https://gfycat.com/wildsorrowfulchevrotain)

## Distance metrics

When using KNN, you can use **_Manhattan_**, **_Euclidean_**, **_Minkowski distance_**, or any other distance metric. Choosing an appropriate distance metric is essential and will depend on the context of the problem at hand.

## Evaluating model performance

How to evaluate the model performance depends on whether you're using the model for a classification or regression task. KNN can be used for regression (by averaging the target scores from each of the K-nearest neighbors), as well as for both binary and multicategorical classification tasks. 

Evaluating classification performance for KNN works the same as evaluating performance for any other classification algorithm -- you need a set of predictions, and the corresponding ground-truth labels for each of the points you made a prediction on. You can then compute evaluation metrics such as **_Precision, Recall, Accuracy, F1-Score_** etc. 

## Summary
Great! Now that you know how the KNN classifier works, you'll implement KNN using Python from scratch in the next lab.

In [62]:
#Import the required libraries
import pandas as pd
import numpy as np


# read the dataset 
df = pd.read_csv("./data/Iris.csv")

# view the head
print(df.head())

X = df.iloc[:,:-1].values # means select all columns except the last one
y = df.iloc[:,5].values #selects all the values from the 5th column of the DataFrame

   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa


In [66]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# this shol be added in the preprocessing stage)
# from sklearn.preprocessing import LabelEncoder
# Encode labels to integers
# label_encoder = LabelEncoder()
# y = label_encoder.fit_transform(y)
# y_train = label_encoder.fit_transform(y_train)
# y_test = label_encoder.fit_transform(y_test)


from sklearn.neighbors import KNeighborsClassifier , KNeighborsTransformer
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

# Predict and evaluate
y_pred = classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred))
print (confusion_matrix (y_test, y_pred))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        18
Iris-versicolor       0.93      1.00      0.97        14
 Iris-virginica       1.00      0.92      0.96        13

       accuracy                           0.98        45
      macro avg       0.98      0.97      0.98        45
   weighted avg       0.98      0.98      0.98        45

[[18  0  0]
 [ 0 14  0]
 [ 0  1 12]]
