# K Nearest Neighbors

In this Notebook, we will implement a K Nearest Neighbors classifier. The aim of this Notebook is to get familiar with Scikit-Learn most important function (to scale the features, to split the dataset, to create a classifier, to check the accuracy score, confusion matrix and so on) rather than focusing on analyzing a specific problem and dataset.

If you are already quite familiar with Scikit-Learn, you can skip this Notebook and start with the other two.


## Import Libraries
**Import pandas, seaborn, and the usual libraries.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data
** Read the 'KNN_Project_Data.csv' file into a dataframe **

In [None]:
df = pd.read_csv('KNN_Project_Data')

**Check the head of the dataframe.**

In [None]:
df.head() 

**Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.**

In [None]:
sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm')

## Standardize the Variables


** Import StandardScaler from Scikit learn.**

In [None]:
from sklearn.preprocessing import StandardScaler

** Create a `StandardScaler()` object called scaler.**

In [None]:
scaler = StandardScaler()

** Fit scaler to the features.**

In [None]:
scaler.fit(df.drop('TARGET CLASS',axis=1))

**Use the `.transform()` method to transform the features to a scaled version.**

In [None]:
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))

**Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.**

In [None]:
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

## Train Test Split

**Use `train_test_split` to split your data into a training set and a testing set. **

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
                                                    test_size=0.30)

## Build the KNN Model

**Check and Import `KNeighborsClassifier` from scikit learn.**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Create a KNN model instance with n_neighbors=1.**

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

**Fit the KNN model to the training data**

In [None]:
knn.fit(X_train,y_train)

## Predictions and Evaluations

**Use the `predict` method to predict values using your KNN model and X_test.**

In [None]:
pred = knn.predict(X_test)

** Compute the confusion matrix, accuracy score and classification report.**

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [None]:
# Confusion matrix
print(confusion_matrix(y_test,pred))

In [None]:
# Accuracy score
print('Test accuracy score: '+ str(accuracy_score(y_test,pred)))

In [None]:
# Classification report
print(classification_report(y_test,pred))

## Choosing a K Value

** Create a for loop that trains various KNN models with different K values, then keep track of the error_rate for each of these models with a list.**

In [None]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Now that you have computed the error_rate for each model, let's plot the Error Rate vs. K Value

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

## Retrain with new K Value

**Retrain your model with the best K value and re-compute the confusion matrix, accuracy score and classification report.**

In [None]:
# NOW WITH K=30
knn = KNeighborsClassifier(n_neighbors=30)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=30')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print('Test accuracy score: '+ str(accuracy_score(y_test,pred)))
print('\n')
print(classification_report(y_test,pred))