### For implementing the KNN algorithm for classification, we will be using the Iris-Flower dataset. 

Each example in the dataset has 4 attributes:
1. sepal length in cm 
2. sepal width in cm 
3. petal length in cm 
4. petal width in cm 

So each example is 4-dimentional. The dataset has 3 classes:
1. Iris Setosa 
2. Iris Versicolour 
3. Iris Virginica

So each example falls into one of the 3 above mentioned classes. The classification task here is no longer a __binary classification problem__ but a __multi-class classification problem__. You can read more about the dataset at: https://archive.ics.uci.edu/ml/datasets/iris

In [40]:
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split
import numpy as np

data = load_iris() #load the iris dataset
X = data.data
y = data.target
print(len(X)) #print number of examples
print(X)

150
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.

In [0]:
#split into train set and test set first by using the library function, 20% of the data goes to the test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 

In [32]:
print(len(X_train),len(X_test)) #number of examples in the train and test set

120 30


In [0]:
#split the training set again into training and vaidation set by using the library function, 
#with 20% of the training set examples going inside the validation set
X_train, X_validation, y_train, y_validation = train_test_split(X_train,y_train,test_size=0.2) 

In [34]:
print(len(X_train),len(X_validation),len(X_test)) #number of examples in the train, validation and test set

96 24 30


## Use scikit-learn to build the KNN model

In [0]:
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import accuracy_score

In [28]:
K = 3 #number of neighbors
model = KNN(n_neighbors=K) #initialize KNN model with n as the number of neighbors
model.fit(X_train,y_train) #fit the model/train the model
predictions = model.predict(X_validation) #get the predictions for all examples in the validation set
accuracy = accuracy_score(predictions,y_validation) #get the accuaracy on the validation set by using the built in function accuracy score
print(accuracy)

0.9583333333333334


## Do it yourself
You can see that using __K = 3__, results in a fairly good accuracy in the validation set, but it may not be the optimal value. What you need to do now is to find out the best value for K from a set of values which you must define yourself. Run the above process for each value of K and find out which value of K gives the maximum accuracy on the validation set. 

Then by using the best value for K, calculate the overall accuracy of the model on the test set.

In [29]:
second_predictions =  model.predict(X_test)
acc = accuracy_score(second_predictions, y_test) 
print(acc)

0.9666666666666667


In [35]:
for i in range(1,20):
  model2 = KNN(n_neighbors=i)
  model2.fit(X_train, y_train)
  predictions1 = model.predict(X_validation)
  accuracy1 = accuracy_score(predictions1, y_validation)


  predictions2 = model.predict(X_test)
  accuracy2 = accuracy_score(predictions2, y_test)
  print(accuracy1 , accuracy2)

1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
1.0 0.9333333333333333
