# Experiment 9: Understanding Classification using KNN

## Task 1

You must predict whether a person will have diabetes or not using KNN classifier

### Import Libraries

Import all the important libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, confusion_matrix,accuracy_score,precision_score
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

### Import Dataset

Load and view the provided dataset ‘diabetes.csv’

In [2]:
df=pd.read_csv("data\\diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Data Cleaning

Perform data cleaning by replacing empty values with the mean of respective column so that it won’t affect the outcome. Also split the dependent variables (features) and independent variables (label) of the dataset.

In [3]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [4]:
y=df.Outcome
df.drop(columns=["Outcome"],axis=1,inplace=True)

In [5]:
colName=df.columns
for col in colName:
    df[col]=df[col].replace(0,df[col].mean())
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.0,79.799479,33.6,0.627,50
1,1.0,85.0,66.0,29.0,79.799479,26.6,0.351,31
2,8.0,183.0,64.0,20.536458,79.799479,23.3,0.672,32
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21
4,3.845052,137.0,40.0,35.0,168.0,43.1,2.288,33


### Normalization

Split data into training set and test set. Perform feature scaling to the training and test set of independent variables for reducing the size to smaller values 


In [6]:
scaler=StandardScaler()
df[["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"]]=scaler.fit_transform(df[["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"]])
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.536251,0.865276,-0.021044,0.872057,-0.417768,0.167255,0.468492,1.425995
1,-1.140353,-1.205989,-0.516583,0.248678,-0.417768,-0.851535,-0.365061,-0.190672
2,1.206893,2.015979,-0.681762,-0.630654,-0.417768,-1.331821,0.604397,-0.105584
3,-1.140353,-1.07448,-0.516583,-0.3747,-0.265107,-0.633222,-0.920763,-1.041549
4,-0.186348,0.503626,-2.663916,0.872057,0.530423,1.549899,5.484909,-0.020496


In [7]:
X_train,X_test,y_train,y_test=train_test_split(df,y,test_size=0.2,random_state=26)

## Define The Model

Define the K Nearest Neighbor model with the training set.	Fit your defined model and predict the test results

In [8]:
cls=KNeighborsClassifier(n_neighbors=11,p=2,metric="euclidean")
cls.fit(X_train,y_train)

### Prediction 

Evaluate the model using the confusion matrix, f1_score and accuracy score by comparing the predicted and actual test values.

In [9]:
y_pred=cls.predict(X_test)
accuracy=accuracy_score(y_pred,y_test)
precision=precision_score(y_pred,y_test)
accuracy,precision

(0.7792207792207793, 0.6274509803921569)

### Confusion Matrix

In [10]:
cm=confusion_matrix(y_test,y_pred)
cm

array([[88, 15],
       [19, 32]], dtype=int64)

### F1 Score

In [11]:
f1=f1_score(y_test,y_pred)
f1

0.6530612244897959

## Task 2: Cosine Similarity

Using the above implemented code, vary the model by using cosine similarity measure instead of Euclidean and determine which one is producing better values in terms of accuracy and f1_score.

In [12]:
cls=KNeighborsClassifier(n_neighbors=11,p=2,metric="cosine")
cls.fit(X_train,y_train)

In [13]:
y_pred2=cls.predict(X_test)
accuracy2=accuracy_score(y_pred2,y_test)
precision2=precision_score(y_pred2,y_test)
accuracy2,precision2

(0.7857142857142857, 0.6078431372549019)

### Confusion Matrix

In [14]:
cm=confusion_matrix(y_test,y_pred2)
cm

array([[90, 13],
       [20, 31]], dtype=int64)

### f1 Score

In [15]:
f1=f1_score(y_test,y_pred)
f1

0.6530612244897959