### Loading the datasets
This loads the iris dataset from ```sklearn.datasets``` and uses ```np.genfromtxt``` to load the ionosphere data from a text file. We need to split the information loaded from the file into the samples (data) and their labels (target). Each row represents a labelled sample; the first 34 columns are features, and the last column is the label of the sample.

In [1]:
import numpy as np
from sklearn.datasets import load_iris 

iris = load_iris()
ionosphere_data =  np.genfromtxt("ionosphere.txt", delimiter=",", usecols=np.arange(34))
ionosphere_target = np.genfromtxt("ionosphere.txt", delimiter=",", usecols=34, dtype='int')

### Splitting datasets into training and test sets

This splits our loaded iris and ionosphere datasets into training and test sets using ```train_test_split``` (75% training, 25% test). (Birthday is 0308, so random_state becomes 308 by omitting leading zeros).

In [2]:
from sklearn.model_selection import train_test_split

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(iris['data'],
                                                                        iris['target'], 
                                                                        random_state=308)

# print(iris['data'][:3])
# print(iris['target'][:3])

X_train_ionosphere, X_test_ionosphere, y_train_ionosphere, y_test_ionosphere = train_test_split(ionosphere_data,
                                                                                                ionosphere_target, 
                                                                                                random_state=308)

# print(ionosphere_data[:3])
# print(ionosphere_target[:3])

### Euclidean distance/metric

Now, to implement the nearest neighbour algorithm, we need to be able to measure the distance between data points. 

(Using the Euclidean distance/metric ||x − x*||)

In [3]:
import math

def getDistance(training_sample, test_sample):
   
    distance = 0
    for i, j in zip(training_sample, test_sample):
        distance += (i - j) ** 2
       
    return math.sqrt(distance)

### Nearest Neighbour algorithm

One Nearest Neighbour:
   >Given a **training set** and **test sample**, we can predict that the *label of a test sample* will be the *same as the label of the nearest training sample* in the set (in terms of Euclidean distance).

In [4]:
def oneNN(training_set, training_labels, test_sample):
    
    """ Finds the distances between each of the training_labels in the training_set 
        and the test_sample. It then finds the minimum distance (which belongs to the 
        nearest training_sample in the test set) and uses that label as the predicted 
        label of the test sample. Returns a tuple consisting of the nearest neigbour, 
        (the training_sample), its distance from the test_sample and the predicted_label"""
    
    training_distances = {}
    
    for training_sample, label in zip(training_set, training_labels):

        distance = getDistance(training_sample, test_sample)
        training_distances[distance] = (training_sample, label)
        
    least_distance = min(training_distances)
    nearest_neighbour, predicted_label = training_distances[least_distance]
    
    return nearest_neighbour, least_distance, predicted_label

In [5]:
from statistics import mode, StatisticsError
import random

def kNN(training_set, training_labels, test_sample, k):
    
    training_distances = {}
    
    for training_sample, label in zip(training_set, training_labels):

        distance = getDistance(training_sample, test_sample)
        training_distances[distance] = (training_sample, label)
        
    k_nearest_neighbours_labels = []
    
    for i in range(k): 
        least_distance = min(training_distances)
        k_nearest_neighbours_labels.append(training_distances.pop(least_distance)[1])
    
    try:
        predicted_label = mode(k_nearest_neighbours_labels)
    
    except StatisticsError:
        predicted_label = random.choice(k_nearest_neighbours_labels)
        print('Choosing randomly out of nearest neighbours due to tie in majority vote.')
        
    return predicted_label

In [6]:
# Test values

ts = [[2, 2, 2],
      [4, 4, 4]]

tl = ['a', 'b']
s = [1, 1, 1]
print(oneNN(ts, tl, s))

test_iris = np.array([6.7, 3, 5.5, 2.1])
print(oneNN(iris['data'], iris['target'], test_iris))

print(kNN(ts, tl, s, 1))
print(kNN(iris['data'], iris['target'], test_iris, 3))

([2, 2, 2], 1.7320508075688772, 'a')
(array([6.8, 3. , 5.5, 2.1]), 0.09999999999999964, 2)
['a']
a
[2, 2, 2]
2


### Making predictions and finding the accuracy

To make predictions, we need to find the nearest neighbour (and therefore the predicted label) for every test sample in the test set. Once we have a list of all the predicted labels, we can compare it to the actual labels for the test set; to get the accuracy, we find the ratio of the number of correct predictions out of the total predictions.

In [7]:
def getPredictions(X_train, y_train, X_test):
    
    """ Makes and returns a list containing the predicted labels for each of the test_samples 
        in the X_test set. It predicts the labels by using the one Nearest Neighbour algorithm."""
    
    predicted_labels = [
        oneNN(X_train, y_train, test_sample)[2]
        for test_sample in X_test
    ]
        
    return predicted_labels

def getAccuracy(predicted_labels, y_test):
    
    return np.mean(predicted_labels == y_test)
        

In [23]:
ionosphere_predictions = getPredictions(X_train_ionosphere, y_train_ionosphere, X_test_ionosphere)
ionosphere_accuracy = getAccuracy(ionosphere_predictions, y_test_ionosphere)
ionosphere_test_error = 1 - ionosphere_accuracy


iris_predictions = getPredictions(X_train_iris, y_train_iris, X_test_iris)
iris_accuracy = getAccuracy(iris_predictions, y_test_iris)
iris_test_error = 1 - iris_accuracy



In [32]:
print("Ionosphere test_error_rate: %f%%" %(ionosphere_test_error * 100))
print("Iris test_error_rate: %f%%" %(iris_test_error * 100))

Ionosphere test_error_rate: 14.772727%
Iris test_error_rate: 5.263158%


# Ionosphere test_error_rate
14.77272727272727%
= 14.77%

# Iris test_error_rate
5.2631578947368474%
= 5.26%

### Calculating the conformity score

Split the labelled samples by their class/label. Then find the distance of the nearest sample of a different class to the test sample, and the distance of the nearest sample of the same class and divide the two. This will be the conformity score.

In [10]:
def split_by_label(X_train, y_train):
    
    """ Divide the training set by its label.
        Returns a dictionary of labels mapped to
        a list of corresponding training_samples. """
    
    labels = {}

    for training_sample, label in zip(X_train, y_train):
    
        if label in labels:
            labels[label].append(training_sample)

        else:
            labels[label] = [training_sample]
            
    return labels
            
ionosphere_labels = split_by_label(X_train_ionosphere, y_train_ionosphere)
iris_labels = split_by_label(X_train_iris, y_train_iris)
        
print(list(iris_labels))
print(list(ionosphere_labels))

[1, 0, 2]
[-1, 1]


In [11]:
def get_conformity_score(sample, label, labelled_samples):
    
    """ Finds and returns the nearest sample of a different class, 
        nearest sample of the same class and the conformity score, 
        found by nearest_different / nearest_same. """
    
    # Get nearest sample of same class/label
    nearest_same = min(
        getDistance(training_sample, sample) 
        for training_sample in labelled_samples[label])
    
    # Get all different classes/labels
    different_labels = list(labelled_samples)
    different_labels.remove(label)
    different_samples = []
    
    # Get all samples of a different class/label
    for diff in different_labels:
        different_samples += labelled_samples[diff]
    
    # Get nearest sample of different class/label
    nearest_different = min(
        getDistance(training_sample, sample) 
        for training_sample in different_samples)
    
    conformity_score = (nearest_different / nearest_same) if (nearest_same != 0) else(
        0 if (nearest_different == 0) else math.inf )
    
    return nearest_different, nearest_same, conformity_score

In [12]:
def getConformityScores(X_train, y_train, test_sample, test_label):
    
    """ Loops through a training set and returns a list containing 
        the conformity scores for each of the samples in the set, 
        provided the training set and its training labels. """
    
    conformity_scores = []
    
    # Add test sample and label to array for which conformity scores will be calculated
    training_samples = np.append(X_train, [test_sample], axis=0)
    training_labels = np.append(y_train, [test_label], axis=0)

    for i in range(len(training_samples)):
        
        X_copy = list(training_samples.copy())
        y_copy = list(training_labels.copy())
        
        sample = X_copy.pop(i)
        label = y_copy.pop(i)

        labelled_samples = split_by_label(X_copy, y_copy)

        score = get_conformity_score(sample, label, labelled_samples)
        conformity_scores.append(score[2])
        
    return conformity_scores

In [21]:
# Test values

training_set = np.array([[0, 3],
                    [2, 2],
                    [3, 3],
                    [-1, 1],
                    [-1, -1],
                    [0, 1]])

training_labels = [1, 1, 1, -1, -1, -1]

test_sample = [0, 0]
pred_label = 1

conformities = getConformityScores(training_set, training_labels, test_sample, pred_label)
print(conformities)

[0.8944271909999159, 1.5811388300841895, 2.5495097567963922, 1.4142135623730951, 0.7071067811865476, 1.0, 0.35355339059327373]


In [1]:
def getPValue(training_set, training_labels, test_sample, pred_label):
    
    """ Returns the p_value for the test sample by finding 
        its rank (using its conformity score). The rank is its 
        positionin the list of sorted conformity scores."""
    
    conformities = getConformityScores(training_set, training_labels, test_sample, pred_label)
    
    if conformities[-1] != math.inf:
        temp = conformities.copy()
        temp.sort()
        rank = temp.index(conformities[-1]) + 1
        return rank / len(conformities)
    
    # If the conformity score is infinity, it will have the highest rank.
    else:
        return len(conformities)

In [16]:
getPValue(training_set, training_labels, test_sample, pred_label)

0.14285714285714285

In [33]:
def average_false_p_value(X_train, X_test, y_train, y_test, label_space):
    
    """ Calculates the average false p_value given a 
        test and training set, and all possible labels. """
    
    false_p_values = []
    
    for test_sample, test_label in zip(X_test, y_test):
        
        # Calculate all p_values for each sample
        for label in label_space:
            
            # Only considering false p_values
            if label != test_label:
                p_value = getPValue(X_train, y_train, test_sample, label)
                false_p_values.append(p_value)
        
    average_false_p_value = sum(false_p_values) / len(false_p_values)
    return(average_false_p_value)


# Average False P value of ionosphere

0.058109504132231454

In [18]:
ionosphere_labels = [1, -1]
average_false_p_value_ionosphere = average_false_p_value(X_train_ionosphere, 
                                                         X_test_ionosphere, 
                                                         y_train_ionosphere, 
                                                         y_test_ionosphere, 
                                                         ionosphere_labels)
print(average_false_p_value_ionosphere)

0.058109504132231454


# Average False P value of iris

0.011760596180717292

In [20]:
iris_labels = [0, 1, 2]
average_false_p_value_iris = average_false_p_value(X_train_iris, 
                                                  X_test_iris, 
                                                  y_train_iris, 
                                                  y_test_iris, 
                                                  iris_labels)
print(average_false_p_value_iris)

0.011760596180717292
