## <font color='magenta'>Classifying wineries using nearest neighbor</font>

#### Load the `wine` dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = np.loadtxt('wine.data',delimiter=',')
features = data[:,1:]
labels = data[:,0].astype(int)

Now let's check the size of the data set.

In [None]:
print(features.shape)

Let's also print out the labels of the points. Notice that they are 1,2,3, and that the points are not in random order!


In [None]:
labels

#### Procedure for 1-NN classifier with Euclidean distance

The function `classifier` takes three inputs:
* `x`, a query point
* `features`, a training set of data points
* `labels`, the labels of the training points

It returns the label of the nearest neighbor of `x` in the training set.

A word about the code: `x-features` looks strange because `x` is a vector while `features` is an entire data set. And what is the effect of `axis=1`? Things to look into...

In [None]:
def classifier(x,features,labels):
    dist = np.sqrt(np.sum((x-features)**2,axis=1))
    return labels[np.argmin(dist)]

#### Procedure for estimating confusion matrix using leave-one-out cross validation

The procedure `leave_one_out` takes two inputs:
* `features`, a training set of data points
* `labels`, the labels of the training points

It returns a confusion matrix for the 1-NN classifier based on this training set, estimated using leave-one-out cross-validation.

In [None]:
def leave_one_out(features,labels):
    confusion = np.zeros((3,3))
    for i in range(len(features)):
        x = features[i]
        y = labels[i]
        # exclude x from features and labels
        train_features = np.delete(features,i,axis=0)
        train_labels = np.delete(labels,i)
        yhat = classifier(x,train_features,train_labels)
        confusion[y-1,yhat-1] += 1 # labels are 1,2,3 but matrix is zero-indexed
    return confusion

#### Procedure for estimating confusion matrix using k-fold cross validation

The procedure `k_fold` takes the same inputs as `leave_one_out`, with an additional parameter:
* `k`, the number of folds

In [None]:
def k_fold(k,features,labels):
    confusion = np.zeros((3,3))
    n = features.shape[0]
    for i in range(k):
        test_ids = np.arange(i*n/k,(i+1)*n/k,dtype=int)
        train_ids = np.setdiff1d(np.arange(n),test_ids)
        for j in test_ids:
            confusion[labels[j]-1,classifier(features[j],features[train_ids],labels[train_ids])-1] += 1
    return confusion

#### Estimate confusion matrix and accuracy using LOOCV

In [None]:
conf_mat = leave_one_out(features,labels)
accuracy = np.sum(np.diag(conf_mat))/np.sum(conf_mat)
print("confusion matrix:")
print(conf_mat)
print("accuracy: {:.2f}".format(accuracy))

#### Estimate confusion matrix and accuracy using k-fold cross-validation

In [None]:
k = np.arange(2,100,5)
accuracy = np.zeros(len(k))
for i in range(len(k)):
    accuracy[i] = np.sum(np.diag(k_fold(k[i],features,labels)))/np.sum(k_fold(k[i],features,labels))
plt.plot(k,accuracy)
plt.xlabel('k')
plt.ylabel('accuracy')
plt.show()

#### Normalizing the data

The `wine` data has features with widely varying ranges. Will nearest neighbor classification performance improve if we normalize the data so that each feature takes on (roughly) the same range of values? Let's try two normalization methods.

#### Method 1: (x-mean)/std

First let's try normalizing each feature so that it has mean zero and variance one.

In [None]:
# normalize features by subtracting the mean and dividing by the standard deviation
def normalize1(features):
    return (features - np.mean(features,axis=0))/np.std(features,axis=0)
features_norm = normalize1(features)
# rerun the cross validations
conf_mat = leave_one_out(features_norm,labels)
accuracy = np.sum(np.diag(conf_mat))/np.sum(conf_mat)
print("Leave One Out Cross Validation")
print("confusion matrix:")
print(conf_mat)
print("accuracy: {:.2f}".format(accuracy))

print("k-fold Cross Validation")
k = np.arange(2,100,5)
accuracy = np.zeros(len(k))
for i in range(len(k)):
    accuracy[i] = np.sum(np.diag(k_fold(k[i],features_norm,labels)))/np.sum(k_fold(k[i],features_norm,labels))
plt.plot(k,accuracy)
plt.xlabel('k')
plt.ylabel('accuracy')
plt.show()

#### Method 2: Map x into [0,1]

Another form of normalization is to linearly remap each feature so that it takes values in the range [0,1].

In [None]:
# normalize features by projecting coordinates into [0,1]
def normalize2(features):
    return (features - np.min(features,axis=0))/(np.max(features,axis=0) - np.min(features,axis=0))
features_norm = normalize2(features)
# rerun the cross validations
conf_mat = leave_one_out(features_norm,labels)
accuracy = np.sum(np.diag(conf_mat))/np.sum(conf_mat)
print("Leave One Out Cross Validation")
print("confusion matrix:")
print(conf_mat)
print("accuracy: {:.2f}".format(accuracy))

print("k-fold Cross Validation")
k = np.arange(2,100,5)
accuracy = np.zeros(len(k))
for i in range(len(k)):
    accuracy[i] = np.sum(np.diag(k_fold(k[i],features_norm,labels)))/np.sum(k_fold(k[i],features_norm,labels))
plt.plot(k,accuracy)
plt.xlabel('k')
plt.ylabel('accuracy')
plt.show()

Conclusion: normalization does improve the classification accuracy.