<a href="https://colab.research.google.com/github/GabeMaldonado/UoL_Study_Materials/blob/main/KNN_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# K-nearest neighbours classification
## Instructions:
* Go through the notebook and complete the tasks. 
* Make sure you understand the examples given. If you need help, refer to the Essential readings or the documentation link provided, or go to the Topic 2 discussion forum. 
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done.
 
**Task 1:**
Run the cell below to load our data. Notice the last line, where we add some random Gaussian noise to our data to make the task more challenging (data in real life usually contains some form of noise).


In [1]:
%matplotlib inline

from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()

#view a description of the dataset 
print(iris.DESCR)

#Set X a samples times features matrix, Y equal to the targets
X=iris.data 
y=iris.target 


#we add some random noise to our data to make the task more challenging
X=X+np.random.normal(0,0.4,X.shape)


.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

**Task 2:**
1.	How many data samples do we have?
2.	Print the value below using shape on ```X``` appropriately. 

In [2]:
#Enter code here
print(X.shape[0])

150


**Task 3:**
1.	How many features do we have?
2.	Print the value below using shape on ```X``` appropriately. 


In [3]:
#Enter code here
print(X[1])

[4.88829723 3.50930049 1.58082384 0.40790052]


**Task 4:**
1.	How many classes do we have?
2.	Print the value below using ```np.unique``` appropriately. 


In [4]:
#Enter code here
print(np.unique(y))

[0 1 2]


**Task 5:**
1.	How many samples do we have that belong to class 1?
2.	Print this in the cell below using the ```np.where``` function appropriately. 


In [10]:
#Enter code here
len(np.where(y==1)[0])

50

**Task 6:** 

Assume we want to generate a list of shuffled indices of our data. Use the function ```numpy.random.permutation``` to do that. In the cell below, you can already see how to create a list of indices that is not shuffled.


In [14]:
L=list(range(X.shape[0]))
print(L)
#Enter code here
print("\n")

print(np.random.permutation(L))


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149]


[ 50 136  52 114  81  94 146  84  96 120 105  13  82  43  36  35  12  48
  37   1  74  42 149   6 104  51   0  91 125 101  69  31  14  87  16 126
   8 106  20 140 100  44 123 145  59  58  54  93 133  18  65  72  30  29
 112  67  62  86 127  41  22  77  76  88   5  11  98 147 121 144  53  19
  45  32   7 115  27  34 131  55  21 103  60 143  10 137 116 110 

**Task 7:**
Here is an example of using the k-NN classifier. We split our data to training and testing (with a 0.2 percentage for our test data), fit on the training data, test on the testing data. 
Go through the code and make sure you understand it.
Now do the same for the next cell, which prints the confusion matrix and the total accuracy. 
You can find some documentation to help you here: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html. 
Note that for this lab, we use the Euclidean distance along with 10 neighbours.


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

#split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define knn classifier, with 5 neighbors and use the euclidian distance
knn=KNeighborsClassifier(n_neighbors=10, metric='euclidean')
#define training and testing data, fit the classifier
knn.fit(X_train,y_train)
#predict values for test data based on training data
y_pred=knn.predict(X_test)
#print values
print(y_test) # true values
print(y_pred) # predicted values


[0 2 2 0 2 0 0 1 0 0 2 1 2 0 2 0 0 2 0 0 0 2 2 1 0 0 0 1 1 1]
[0 2 2 0 2 0 0 1 0 0 2 1 2 0 2 0 0 1 0 0 0 2 2 1 0 0 0 2 1 1]


In [16]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))


[[15  0  0]
 [ 0  5  1]
 [ 0  1  8]]
0.9333333333333333


**Task 8:**
Write your <b>own</b> functions that return the confusion matrix given the true and predicted labels, as well as the accuracy. To do so, fill in the code in the next two cells. 


In [25]:

#create a matrix with entries equal to zero, and subsequently build the confusion matrix
#the method should return the confusion matrix in a numpy array
def myConfMat(y_ground,y_pred,classno):

  C = np.zeros((classno, classno), dtype=np.int)
  for i in range(0, len(y_test)):
    C[y_ground[i], y_pred[i]] += 1
    return C

#note: len(np.unique(y))  indicates the dimensions of the confusion matrix (why?)
print(myConfMat(y_test,y_pred, len(np.unique(y))))



[[1 0 0]
 [0 0 0]
 [0 0 0]]


In [26]:
# model answers

#use the numpy function where to return the accuracy given the true/predicted labels.  i.e., #correct/#total
def myAccuracy(y_ground,y_pred):

  correct = np.where(y_ground==y_pred, 1, 0)
  total = len(y_ground)
  return sum(correct)/total
    
    
print('accuracy: %.2f' % myAccuracy(y_test,y_pred))



accuracy: 0.93


**Optional task:**</span> Write your own functions to calculate class-relative precision and recall. Compare these to the sklearn functions ``precision_score`` and ``recall_score`` on your y_test and y_pred values.

In [27]:
#hint: you can use the output from your myConfMat function above

def myPrecision(y_ground,y_pred):
  classes = np.unique(y_ground)
  precision = np.zeros(classes.shape) 

  C = myConfMat(y_test, y_pred, len(classes))

  for i in classes:
    precision[i] = C[i,i] / sum(C[:, i])

  return precision
            
def myRecall(y_test,y_pred):

  classes = np.unique(y_pred)
  recall = np.zeros(classes.shape) 

  C = myConfMat(y_test, y_pred, len(classes))

  for i in classes:
    recall[i] = C[i,i] / sum(C[i,:])
    return recall

print('classes:      %s' % np.unique(y_pred) )    
print('my precision: %s' % myPrecision(y_test,y_pred))
print('my recall:    %s' % myRecall(y_test,y_pred))


classes:      [0 1 2]
my precision: [ 1. nan nan]
my recall:    [1. 0. 0.]


  # Remove the CWD from sys.path while we load stuff.


In [23]:
from sklearn.metrics import precision_score, recall_score 
# check that your functions do the same thing as the library versions

print('library precision: %s' % precision_score(y_test,y_pred,average=None))
print('library recall: %s' % recall_score(y_test,y_pred,average=None))



library precision: [1.         0.83333333 0.88888889]
library recall: [1.         0.83333333 0.88888889]
