In [None]:
6. kNN (k- Nearest Neighbors)
It can be used for both classification and regression problems. 
However, it is more widely used in classification problems in the industry. 
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. 
The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function.

These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. 
First three functions are used for continuous function and fourth one (Hamming) for categorical variables. 
If K = 1, then the case is simply assigned to the class of its nearest neighbor. 
At times, choosing K turns out to be a challenge while performing kNN modeling.

KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information!

Things to consider before selecting kNN:

KNN is computationally expensive
Variables should be normalized else higher range variables can bias it
Works on pre-processing stage more before going for kNN like an outlier, noise removal

In [None]:
'''

'''
# importing required libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')

# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)

# Now, we need to predict the missing target variable in the test data
# target variable - Survived

# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']

# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

'''
Create the object of the K-Nearest Neighbor model
You can also add other parameters and test your code here
Some parameters are : n_neighbors, leaf_size
Documentation of sklearn K-Neighbors Classifier: 

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

 '''
model = KNeighborsClassifier()  

# fit the model with the training data
model.fit(train_x,train_y)

# Number of Neighbors used to predict the target
print('\nThe number of neighbors used to predict the target : ',model.n_neighbors)

# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 

# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(test_x)
print('Target on test data',predict_test) 

# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

In [48]:
"""
KNN using pima-indians-diabetes dataset
"""
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

column_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# load dataset
dataset = pd.read_csv('../pima-indians-diabetes.csv', header=None, names=column_names)

#split dataset in features and target variable
feature_columns = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = dataset[feature_columns] # Features
y = dataset.label # Target variable

dataset.info(verbose=True)

dataset.describe().T


## split X and y into training and testing sets with our test data taking 25% & train data 75%
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25)

# shape of the dataset
# print('Shape of training data :',X_train.shape)
# print('Shape of testing data :',X_test.shape)

classifier = KNeighborsClassifier().fit(X_train, y_train)

# predict the target on the train dataset
predict_train = classifier.predict(X_train)
print('\n Target on train data \n',predict_train, sep="\n") 

# predict the target on the test dataset
predict_test = classifier.predict(X_test)
print('Target on test data',predict_test, sep="\n")

plt.figure(figsize=(12,5))
p = sns.lineplot(range(1,15),accuracy_train,marker='*',label='Train Score')
p = sns.lineplot(range(1,15),accuracy_test,marker='o',label='Test Score')

print('\n The number of neighbors used to predict the target : ',classifier.n_neighbors)

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('Accuracy_score on train dataset : ', accuracy_train)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy_score on test dataset : ', accuracy_test)

dataset.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnant    768 non-null int64
glucose     768 non-null int64
bp          768 non-null int64
skin        768 non-null int64
insulin     768 non-null int64
bmi         768 non-null float64
pedigree    768 non-null float64
age         768 non-null int64
label       768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

 Target on train data 

[0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0
 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 1 0
 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0
 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 0
 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 0 0 1 1
 0 1 0 0 0 1

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
