# K Nearest Neighbors with Python

You've been given a classified data set from a company! They've hidden the feature column names but have given you the data and the target classes. 

We'll try to use KNN to create a model that directly predicts a class for a new data point based off of the features.

Let's grab it and use it!

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

Set index_col=0 to use the first column as the index.

In [None]:
df = pd.read_csv("Classified Data",index_col=0)

In [None]:
df.head()

## Standardize the Variables

Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale. That is why while using KNN we standardize everything on a standard scale.

In [None]:
from sklearn.preprocessing import StandardScaler # It will look a lot like as if it was a ML model.

In [None]:
scaler = StandardScaler() # Create an instance of StandardScale as we would for some ML algorithm.

In [None]:
scaler.fit(df.drop('TARGET CLASS',axis=1))
# As we want to fit this to our training data and not the actual TARGET CLASS that's why drop for that.

In [None]:
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
# Transform method performs the standardization by centering and scaling.
# Gives us a scaled version of values in df excluding TARGET CLASS

In [None]:
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1]) # Creating a feature DataFrame
# Passing in our scaled_features array as data and for column names we specify df.columns but the last TARGET CLASS so [:-1]
df_feat.head()

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = scaled_features
Y = df['TARGET CLASS']
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],test_size=0.30,random_state = 101)

## Using KNN

Remember that we are trying to come up with a model to predict whether someone will TARGET CLASS or not. We'll start with k=1.
And then we will move on towards using Elbow Method to choose the ideal K value.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
# By specifying n_neighbors it means number of neighbors we want for this model. we set it as 1 for this time.

In [None]:
knn.fit(X_train,y_train) # Fitting on our training data.

In [None]:
pred = knn.predict(X_test)
pred# Shows to which class a particular prediction belongs to.

## Predictions and Evaluation

Let's Evaluate KNN model! After this we do elbow method to find out an optimal k value.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(confusion_matrix(y_test,pred))

In [None]:
print(classification_report(y_test,pred))

So far model looks good with an accuracy of TP+TN/Total = (151+126)/300 = 0.92.
- Let's see if we can even further improve the model.

## Choosing a K Value

Let's go ahead and use the elbow method to pick a good K Value:

In [None]:
error_rate = [] # First we set error_rate as an empty list. Then we iterate on many models using different k value.
# And we plot out the error rate and see which one has the lowest error rate.

# Will take some time
for i in range(1,40): # Check every k value from 1 to 40 for all of those values we will call KNeighborsClassifier, with
    # n_neighbors = i, then we fit that model, then we say pred_i off of test set and then to error_rate we append
    # mean of pred_i != y_test. AKA Average error rate being added to error_rate list for every iteration from 1 to 40.
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,5)) # A little larger than usual for better understanding purpose.
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10) # Plot of range(1,40) vs error_rate
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.grid(True)

- We notice that we start with error of around 0.075 and it went up for k = 2 and then it went drastically down for k=3 and so on.

Here we can see that that after arouns K>23 the error rate just tends to hover around 0.06-0.05 Let's retrain the model with that and check the classification report!

In [None]:
# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

In [None]:
# NOW WITH K=17
knn = KNeighborsClassifier(n_neighbors=17)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

In [None]:
# NOW WITH K=23
knn = KNeighborsClassifier(n_neighbors=23)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

# Great job!

We were able to squeeze some more performance out of our model by tuning to a better K value!