# KNN Algorithm - Classification

In this lab, you will learn about how to train and test KNN classifier.  

### Dataset

You will be using Iris-Flower dataset for this lab. The Iris dataset has 4 attributes, i.e., each example in the dataset has four features, in 4-dimensional space:

 1. sepal length in cm 
 2. sepal width in cm 
 3. petal length in cm 
 4. petal width in cm 




In addition, the dataset has 3 classes or categories of flowers:

1. Iris Setosa 
2. Iris Versicolour 
3. Iris Virginica

So each example falls into one of the 3 above mentioned classes. The classification task here is no longer a __binary classification problem__ but a __multi-class classification problem__. You can read more about the dataset at: https://archive.ics.uci.edu/ml/datasets/iris

### Imports of necessary packages 

In [1]:
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split
import numpy as np

Please refer the following documentations for detail information about `numpy` and `scikit-learn` library:

- [numpy](https://numpy.org/)
- [scikit-learn](https://scikit-learn.org/stable/)


## 1. Load dataset

In [7]:
# load the iris dataset
iris_data = load_iris() 

# extract input features (data)
X = iris_data.data

# get the label of the dataset
y = iris_data.target

### Let us understand the dataset

#### How many data points are in the dataset?

In [13]:
print('Number of data points in X = ', len(X))
print('Number of data points in y = ', len(y))

Number of data points in X =  150
Number of data points in y =  150


#### What are the name of the features (or columns) in the dataset?

In [3]:
# column names in our dataset/list of features
iris_data.feature_names 

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

#### What are the labels in the dataset?

In [8]:
# list of labels
iris_data.target_names 

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

#### Display the first five input data points 

In [9]:
X[:5,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

#### Display the labels of the first five input data points 

In [11]:
y[:5]

array([0, 0, 0, 0, 0])

In [9]:
"""
What we see as the output below can be interpreted as:

sepal length (cm) -> 5.1
sepal width (cm) -> 3.5
petal length (cm) -> 1.4
petal width (cm) -> 0.2

label -> 0 i.e. 'setosa' (value at index 0 in the array data.target_names)
"""
# print the features of the 1st example/data point from the dataset 
print("Features: ", X[0])
# print the label of the 1st example/data point from the dataset
print("Label: ", y[0])  

Features:  [5.1 3.5 1.4 0.2]
Label:  0


## 2. Split the dataset into Train, Test & Validation datasets


### Split the dataset into train set ($80$ %) and test set ($20$ %)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 

In [11]:
# number of examples in the train and test set
print(len(X_train), len(X_test)) 

120 30


In [14]:
print("X_train = ")
print(X_train[0:5])
print('----------------------')
print()
print("y_train = ")
print(y_train[0:5])

X_train = 
[[4.4 3.2 1.3 0.2]
 [4.4 2.9 1.4 0.2]
 [6.7 3.1 4.7 1.5]
 [4.8 3.4 1.9 0.2]
 [5.6 2.9 3.6 1.3]]
----------------------

y_train = 
[0 0 1 0 1]


In [15]:
print("X_test = ")
print(X_test[0:5])
print('----------------------')
print()
print("y_test = ")
print(y_test[0:5])

X_test = 
[[5.4 3.4 1.5 0.4]
 [6.7 3.1 4.4 1.4]
 [6.1 2.6 5.6 1.4]
 [6.7 3.3 5.7 2.5]
 [5.5 4.2 1.4 0.2]]
----------------------

y_test = 
[0 1 2 2 0]


Split the **training** set again into **training** and **validation** set by using the library function, with 20% of the training set examples going inside the validation set

In [16]:
X_train, X_validation, y_train, y_validation = \
train_test_split(X_train, y_train, test_size=0.2) 

In [17]:
# number of examples in the train, validation and test set
print(len(X_train), len(X_validation), len(X_test)) 

96 24 30


In [18]:
print("X_train = ")
print(X_train[0:5])
print('----------------------')
print()
print("y_train = ")
print(y_train[0:5])

X_train = 
[[5.6 3.  4.5 1.5]
 [6.1 3.  4.9 1.8]
 [6.6 2.9 4.6 1.3]
 [4.9 2.5 4.5 1.7]
 [5.9 3.  5.1 1.8]]
----------------------

y_train = 
[1 2 1 2 2]


In [19]:
print("X_validation = ")
print(X_validation[0:5])
print('----------------------')
print()
print("y_validation = ")
print(y_validation[0:5])

X_validation = 
[[6.7 3.1 5.6 2.4]
 [6.  3.  4.8 1.8]
 [7.9 3.8 6.4 2. ]
 [5.1 3.4 1.5 0.2]
 [5.8 4.  1.2 0.2]]
----------------------

y_validation = 
[2 2 2 0 0]


## 3. Use scikit-learn to build the KNN model

In [21]:
# import KNN model 
from sklearn.neighbors import KNeighborsClassifier as KNN 
# import function computing accurary of the model
from sklearn.metrics import accuracy_score 

### Train KNN model

Assume k = 3.

In [22]:
# number of neighbors
K = 3 
# initialize KNN model with K as the number of neighbors
model = KNN(n_neighbors=K) 
# fit the model/train the model
model.fit(X_train, y_train) 


KNeighborsClassifier(n_neighbors=3)

### Get the `predictions` for all examples in the validation set

In [24]:
predictions = model.predict(X_validation)

### Get the `accuracy` of the model in the validation dataset

In [25]:
accuracy = accuracy_score(predictions, y_validation)

print(accuracy)

0.9583333333333334


###  Make a prediction on the validation set

In [26]:
i = 5
X_validation[i]

array([5. , 3.6, 1.4, 0.2])

In [27]:
model.predict([X_validation[i]])

array([0])

In [28]:
y_validation[i]

0

## Do it yourself
You can see that using __K = 3__, results in a fairly good accuracy in the validation set, but it may not be the optimal value. What you need to do now is to **find out the best value for K** from a set of values which you must define yourself. Run the above process for each value of K and find out which value of K gives the maximum accuracy on the validation set. 

Then by using the best value for K, calculate the overall accuracy of the model on the test set.

#### Compute accuracy score for k = 3, 4, 5, 6, 7

In [29]:
K = {3,4,5,6,7}
accuracy = {}
for k in K:
    ## write your code 
    
    
# print the accuracy score
print(accuracy)

{3: 0.9583333333333334, 4: 0.9583333333333334, 5: 0.9583333333333334, 6: 0.9166666666666666, 7: 0.9166666666666666}


#### Find the `best value of k` based on `accuracy score`

In [31]:
max_accuracy = max(accuracy.values())
best_k = None
for key,value in accuracy.items():
    if value == max_accuracy:
        best_k = key
        break

print(best_k)

3


#### Compute `accuracy score` on the test dataset

In [32]:
final_model = KNN(n_neighbors=best_k)
final_model.fit(X_train,y_train)

## Write your code


# Print accuracy score on test dataset
print(accuracy_test)

1.0


#### What are your observations?