In [1]:
import warnings 
warnings.filterwarnings('ignore')

## K-Nearest-Neighbors

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations (x,y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.

In this module we will explore the inner workings of KNN, choosing the optimal K values and using KNN from scikit-learn.

## Overview

1.Read the problem statement.

2.Get the dataset.

3.Explore the dataset.

4.Pre-processing of dataset.

5.Visualization

6.Transform the dataset for building machine learning model.

7.Split data into train, test set.

7.Build Model.

8.Apply the model.

9.Evaluate the model.

10.Finding Optimal K value

11.Repeat 7,8,9 steps.

## Problem statement

### Dataset

The data set we’ll be using is the Iris Flower Dataset which was first introduced in 1936 by the famous statistician Ronald Fisher and consists of 50 observations from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals.

**Attributes of the dataset:** https://archive.ics.uci.edu/ml/datasets/Iris

**Train the KNN algorithm to be able to distinguish the species from one another given the measurements of the 4 features.**

## Question 1

Import the data set and print 10 random rows from the data set

In [114]:
import pandas as pd
data = pd.read_csv("iris.csv")
data.sample(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
26,27,5.0,3.4,1.6,0.4,Iris-setosa
113,114,5.7,2.5,5.0,2.0,Iris-virginica
11,12,4.8,3.4,1.6,0.2,Iris-setosa
132,133,6.4,2.8,5.6,2.2,Iris-virginica
51,52,6.4,3.2,4.5,1.5,Iris-versicolor
88,89,5.6,3.0,4.1,1.3,Iris-versicolor
140,141,6.7,3.1,5.6,2.4,Iris-virginica
44,45,5.1,3.8,1.9,0.4,Iris-setosa
55,56,5.7,2.8,4.5,1.3,Iris-versicolor
54,55,6.5,2.8,4.6,1.5,Iris-versicolor


## Data Pre-processing

## Question 2 - Estimating missing values

*Its not good to remove the records having missing values all the time. We may end up loosing some data points. So, we will have to see how to replace those missing values with some estimated values (median) *

In [115]:
#getting list of numerical columns
list_of_cols = [x for x in data.columns if data[x].dtype!='object']

for each_col in list_of_cols:
    data[each_col].fillna(data[each_col].median(),inplace = True)

In [116]:
data.PetalWidthCm[pd.isna(data.PetalWidthCm)]

Series([], Name: PetalWidthCm, dtype: float64)

## Question 3 - Dealing with categorical data

Change all the classes to numericals (0to2).

In [117]:
list_of_uniques = list(data.Species.unique())
dict_of_classes = {}
for each_val in list_of_uniques:
    dict_of_classes[each_val] = list_of_uniques.index(each_val) 
    data.Species[data.Species==each_val] = list_of_uniques.index(each_val)


{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}


Question 4

Observe the association of each independent variable with target variable and drop variables from feature set having correlation in range -0.1 to 0.1 with target variable.

In [127]:
data.Species = data.Species.astype('int64')
data.corr()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,1.0,0.702734,-0.392693,0.872346,0.890676,0.942753
SepalLengthCm,0.702734,1.0,-0.109369,0.87112,0.815986,0.775061
SepalWidthCm,-0.392693,-0.109369,1.0,-0.420713,-0.35651,-0.417318
PetalLengthCm,0.872346,0.87112,-0.420713,1.0,0.962043,0.944477
PetalWidthCm,0.890676,0.815986,-0.35651,0.962043,1.0,0.952513
Species,0.942753,0.775061,-0.417318,0.944477,0.952513,1.0


## Question 5

*Observe the independent variables variance and drop such variables having no variance or almost zero variance(variance < 0.1). They will be having almost no influence on the classification.*

In [128]:
data.var()

Id               1938.000000
SepalLengthCm       0.676645
SepalWidthCm        0.185552
PetalLengthCm       3.076516
PetalWidthCm        0.577141
Species             0.675322
dtype: float64

## Question 6

*Plot the scatter matrix for all the variables.*

## Split the dataset into training and test sets

## Question 7

*Split the dataset into training and test sets with 80-20 ratio.*

In [213]:
from sklearn.model_selection import train_test_split
train,test= train_test_split(data,train_size = 0.8,test_size = 0.2)


## Question 8 - Model

*Build the model and train and test on training and test sets respectively using **scikit-learn**. Print the Accuracy of the model with different values of **k=3,5,9**.*

**Hint:** For accuracy you can check **accuracy_score()** in scikit-learn

In [187]:
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import zscore
from sklearn.preprocessing import Imputer
from sklearn.metrics import accuracy_score
import numpy as np


In [216]:
y_train = train['Species']
y_test = test['Species']
x_train_z = train[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']].apply(zscore)
x_test_z = test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']].apply(zscore)

In [222]:
NNH = KNeighborsClassifier(n_neighbors = 3, weights = 'uniform', metric = 'euclidean')
NNH.fit(x_train_z,y_train)
probablities = NNH.predict_proba(x_train_z)
prediction = [np.where(each_predicted==max(each_predicted))[0].item() for each_predicted in probablities]
correct_prediction = 0
incorrect_prediction = 0
idx=0
for each_val in train['Species']:
    if(each_val==prediction[idx]):
        correct_prediction+=1
    else:
        incorrect_prediction+=1
    idx+=1
print('Accuracy for k=3 on training data {}'.format((correct_prediction/(correct_prediction+incorrect_prediction))*100))

Accuracy for k=3 on training data 96.69421487603306


In [224]:
NNH.fit(x_test_z,y_test)
probablities = NNH.predict_proba(x_test_z)
prediction = [np.where(each_predicted==max(each_predicted))[0].item() for each_predicted in probablities]
correct_prediction = 0
incorrect_prediction = 0
idx=0
for each_val in test['Species']:
    if(each_val==prediction[idx]):
        correct_prediction+=1
    else:
        incorrect_prediction+=1
    idx+=1
print('Accuracy for k=3 on test data {}'.format((correct_prediction/(correct_prediction+incorrect_prediction))*100))

Accuracy for k=3 on test data 96.7741935483871


## Question 9 - Finding Optimal value of k.

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the **optimal number of neighbours** from the above list using the Miss classification error

Hint:

Misclassification error (MSE) = 1 - Test accuracy score. Calculated MSE for each model with neighbours = 1,3,5...19 and find the model with lowest MSE

In [228]:
def classifier(x,label,k):
    NNH = KNeighborsClassifier(n_neighbors = k, weights = 'uniform', metric = 'euclidean')
    NNH.fit(x,label)
    probablities = NNH.predict_proba(x)
    prediction = [np.where(each_predicted==max(each_predicted))[0].item() for each_predicted in probablities]
    return(prediction)

def calculate_accuracy(x,predicted_values):
    correct_prediction = 0
    incorrect_prediction = 0
    idx = 0
    for each_val in x['Species']:
        if(each_val==predicted_values[idx]):
            correct_prediction+=1
        else:
            incorrect_prediction+=1
        idx+=1
    
    return(correct_prediction/(correct_prediction+incorrect_prediction))

dict_of_k = {}

for k in range(1,21,2):
    p = classifier(x_train_z,y_train,k)
    dict_of_k[k] = calculate_accuracy(test,p) 
print(dict_of_k)
    
    
    

{1: 0.2903225806451613, 3: 0.25806451612903225, 5: 0.25806451612903225, 7: 0.25806451612903225, 9: 0.25806451612903225, 11: 0.25806451612903225, 13: 0.25806451612903225, 15: 0.25806451612903225, 17: 0.25806451612903225, 19: 0.25806451612903225}


## Question 10

*Plot misclassification error vs k (with k value on X-axis) using matplotlib.*

### Question 11: Read the data given in bc2.csv file

### Question 12: Observe the no.of records in dataset and type of each feature 

### Question 13: Use summary statistics to check if missing values, outlier and encoding treament is necessary

### Check Missing Values

### Question 14: Check how many `?` there in Bare Nuclei feature (they are also unknown or missing values). Replace them with the top value of the describe function of Bare Nuclei feature.

#### Check include='all' parameter in describe function

### Question 15: Find the distribution of target variable (Class) 

#### Plot the distribution of target variable using histogram

### convert the datatype of Bare Nuclei to `int`

### Question 16: Standardization of Data

### Question 17: Plot Scatter Matrix to understand the distribution of variables and check if any variables are collinear and drop one of them.

### Question 18: Divide the dataset into feature set and target set

### Divide the Training and Test sets in 70:30 

## Question 19 - Finding Optimal value of k

Run the KNN with no of neighbours to be 1,3,5..19 and *Find the **optimal number of neighbours** from the above list using the Mis classification error

Hint:

Misclassification error (MSE) = 1 - Test accuracy score. Calculated MSE for each model with neighbours = 1,3,5...19 and find the model with lowest MSE

### Question 20: Print the optimal number of neighbors