## <font color='magenta'>Classifying spine injuries using nearest neighbor</font>


### <font color='blue'>Step 1: Load in the data, split into training set and test set</font>

Include some standard modules and then load in the data

In [None]:
import numpy as np
from numpy.linalg import norm
# Load data set and code labels as 0 = ’NO’, 1 = ’DH’, 2 = ’SL’
labels = [b'NO', b'DH', b'SL']
data = np.loadtxt('spine-data.txt', converters={6: lambda s: labels.index(s)})

Hmm, what is the size of this data set?

In [None]:
data.shape

Let's look at a sample data point (the second one). The very last entry is the label.

In [None]:
data[1,]

Create a training set of the first 250 points and a test set of the remaining 60 points

In [None]:
train_data = data[:250, :data.shape[1] - 1]
test_data = data[-60:, :data.shape[1] - 1]
train_labels = data[:250, data.shape[1] - 1]
test_labels = data[-60:, data.shape[1] - 1]

As a sanity check, let's look at the sizes of the newly-created training and test sets.

In [None]:
print(train_data.shape, train_labels.shape)
print(test_data.shape, test_labels.shape)

### <font color='blue'>Step 2: Procedures for nearest neighbor classification</font>

The `find_NN` procedure takes as input both a data point `x` and a number `norm_order` which specifies the norm to use. We call `np.linalg.norm` to compute norms.

In [None]:
## Takes a vector x and returns the index of its nearest neighbor in train_data
def find_NN(x, norm_order):
    # Compute distances from x to every row in train_data
    distances = [norm(x - train_data[i,], ord=norm_order) for i in range(len(train_labels))]
    # Get the index of the smallest distance
    return np.argmin(distances)

## Takes a vector x and returns the class of its nearest neighbor in train_data
def NN_classifier(x, norm_order):
    # Get the index of the the nearest neighbor
    index = find_NN(x, norm_order)
    # Return its class
    return train_labels[index]

### <font color='blue'>Step 3: Run the code and get classifications and error rates</font>

Compute error rates for nearest neighbor classification using L1 distance and using L2 distance.

In [None]:
# do NN classification for both distances
l1_test_predictions = [NN_classifier(test_data[i,], 1) for i in range(len(test_labels))]
l2_test_predictions = [NN_classifier(test_data[i,], 2) for i in range(len(test_labels))]

# print errors
l1_err_positions = np.not_equal(l1_test_predictions, test_labels)
l1_error = float(np.sum(l1_err_positions))/len(test_labels) 

print(f"l1 total misclassified was {np.sum(l1_err_positions)}")
print(f"l1 error was {l1_error}")

l2_err_positions = np.not_equal(l2_test_predictions, test_labels)
l2_error = float(np.sum(l2_err_positions))/len(test_labels) 

print(f"l2 total misclassified was {np.sum(l2_err_positions)}")
print(f"l2 error was {l2_error}")

Finally compute confusion matrices

In [None]:
# compute confusion matrices for both distances
l1_confusion = np.zeros((3, 3))
for i in range(0,60):  
    l1_confusion[int(test_labels[i]), int(l1_test_predictions[i])] += 1

l2_confusion = np.zeros((3, 3))
for i in range(0,60):
    l2_confusion[int(test_labels[i]), int(l2_test_predictions[i])] += 1

print("l1 confusion matrix:")
print(l1_confusion)

print("l2 confusion matrix:")
print(l2_confusion)