# Machine Learning with K Nearest Neighbors

In this project, we'll use the KNN algorithm to classify instances from a fake dataset into one or the other target class.

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<class 'ModuleNotFoundError'>: No module named 'seaborn'

## Data

In [5]:
df = pd.read_csv('../data/KNN_Project_Data.csv')

In [6]:
df.head()

Unnamed: 0,XVPM,GWYH,TRAT,TLLZ,IGGA,HYKR,EDFS,GUUB,MGJM,JHZC,TARGET CLASS
0,1636.670614,817.988525,2565.995189,358.347163,550.417491,1618.870897,2147.641254,330.727893,1494.878631,845.136088,0
1,1013.40276,577.587332,2644.141273,280.428203,1161.873391,2084.107872,853.404981,447.157619,1193.032521,861.081809,1
2,1300.035501,820.518697,2025.854469,525.562292,922.206261,2552.355407,818.676686,845.491492,1968.367513,1647.186291,1
3,1059.347542,1066.866418,612.000041,480.827789,419.467495,685.666983,852.86781,341.664784,1154.391368,1450.935357,0
4,1018.340526,1313.679056,950.622661,724.742174,843.065903,1370.554164,905.469453,658.118202,539.45935,1899.850792,0


## Standardizing the Variables
Because of the type of data we're dealing with, it's important to standardize the variables before training our model. Skewed distribution of variables makes it harder for our model to deal with it.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
df.columns

Of course, we don't need to scale the target class, so we'll ignore that during scaler fitting.

In [None]:
scaler.fit(df.drop('TARGET CLASS',axis=1))

Now we'll use the .transform() method to transform the features to a scaled version.

In [None]:
scaled_feats = scaler.transform(df.drop('TARGET CLASS',axis=1))

In [None]:
#Converting the scaled features to a dataframe
scaled_df = pd.DataFrame(scaled_feats)
scaled_df.columns =['XVPM', 'GWYH', 'TRAT', 'TLLZ', 'IGGA', 'HYKR', 'EDFS', 'GUUB', 'MGJM',
       'JHZC']

In [None]:
scaled_df.head()

## Model Building

In [None]:
#Begin with splitting the data into training and test sets
from sklearn.model_selection import train_test_split

In [None]:
X = scaled_df
y = df['TARGET CLASS']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Fitting our model.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train,y_train)

## Predictions and Evaluations

In [None]:
predictions = knn.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
print(classification_report(y_test,predictions))

## Choosing a K Value

A major part of building a ML model with KNN, is choosing a K value to improve the performance of our model. Let's go ahead and use the elbow method to do that.

We will create a loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list.

In [None]:
error_rate =[]
for i in range(1,50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    predictions = knn.predict(X_test)
    error_rate.append(np.mean(predictions != y_test))

Now, to make it easier to see what values of K had lower error rates, we'll create a plot using the information from our for loop.

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,50),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Looking at the plot,31 seems to be reasonable value for K. So let's retrain the model using that.

## Retrain with new K Value

In [None]:
knn = KNeighborsClassifier(n_neighbors=31)
knn.fit(X_train,y_train)
predictions = knn.predict(X_test)

In [None]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

Hence our model performs better!

This concludes this project.