# Nearest Neighbour

### Nearest Neighbour for iris

First I load iris and split it into training and test sets.

In [5]:
import numpy as np

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()
from sklearn.model_selection import train_test_split
iris_X_train, iris_X_test, iris_y_train, iris_y_test = train_test_split(iris['data'], iris['target'], random_state=0)

What are the sizes of the training and test sets?

In [2]:
iris_X_train.shape

(112, 4)

In [3]:
iris_X_test.shape

(38, 4)

In this section I implement the basic Nearest Neighbour.

First I wrote a simple function for computing Euclidean distance and test it.

In [6]:
def dist(x1,x2):
    return np.linalg.norm(x1-x2)
dist(np.array([0,0]),np.array([1,1]))

1.4142135623730951

In [8]:
import math  
def sq_length(x,n):
    """Euclidean length"""
    v = 0
    for i in range(n):
        v = v+x[i]**2
        return math.sqrt(v)
sq_length([1,2],2)

1.0

An alternative to using the function np.linalg.norm, computing the Euclidean norm of a vector, is the function computes the squared Euclidean norm, but this will not affect the output of the Nearest Neighbour algorithm based on it.)

The next function finds the nearest neighbour to x in X. We also have a very primitive test of the code.

In [9]:
import math
def nearest(x,X):
    current_record = math.inf
    for n in range(X.shape[0]):
        current_dist = dist(x,X[n])
        if current_dist < current_record:
            current_record = current_dist
            current_record_holder = n
    return current_record, current_record_holder
nearest(np.array([1,1,0]),np.zeros((3,3)))

(1.4142135623730951, 0)

Now we can go over the test set applying the function "nearest" to each test sample.  Along the way we compute the number ot errors and print all errors.

In [11]:
import time
start = time.time()
n_iris_test = iris_X_test.shape[0]
n_errors = 0
prediction = np.zeros(n_iris_test,dtype=int)
for n in range(n_iris_test):
    prediction[n] = iris_y_train[nearest(iris_X_test[n],iris_X_train)[1]]
    if prediction[n] != iris_y_test[n]:
        n_errors = n_errors + 1
print("Error:",n)
print("Number of errors:",n_errors)
print("Error rate:",n_errors / n_iris_test)
print(time.time() - start,"seconds")

Error: 37
Number of errors: 1
Error rate: 0.02631578947368421
0.030513763427734375 seconds


And let's see our predictions and the actual labels.

In [12]:
prediction

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2])

In [13]:
iris_y_test

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 1])

### Nearest Neighbour for ionosphere

Let us now load the ionosphere dataset:

In [15]:
is_data = np.genfromtxt("ionosphere.txt", delimiter=",", usecols=np.arange(34))
is_target = np.genfromtxt("ionosphere.txt", delimiter=",", usecols=34, dtype='int')

Splitting it into training and test sets:

In [16]:
is_X_train, is_X_test, is_y_train, is_y_test = train_test_split(is_data, is_target, random_state=0)

What are the sizes of the training and test sets?

In [17]:
is_X_train.shape

(263, 34)

In [18]:
is_X_test.shape

(88, 34)

The program for ionospehere is very similar, but now we do not bother remembering the predictions.

In [19]:
start = time.time()
n_is_test = is_X_test.shape[0]
n_errors = 0
for n in range(n_is_test):
    prediction = is_y_train[nearest(is_X_test[n],is_X_train)[1]]
    if prediction != is_y_test[n]:
        print("Error:",n)
    n_errors = n_errors + 1
print("Number of errors:",n_errors)
print("Error rate:",n_errors / n_is_test)
print(time.time() - start,"seconds")

Error: 2
Error: 6
Error: 9
Error: 34
Error: 42
Error: 47
Error: 50
Error: 54
Error: 55
Error: 69
Error: 75
Error: 83
Error: 85
Number of errors: 88
Error rate: 1.0
0.13743853569030762 seconds


### Summary of the results

The results for random state 0 are: 97% accuracy on iris and 85% accuracy for ionosphere.