In this assignment the data considered contain information on the occupancy of buildings relevant for potential energy saving. The data will be treated in three different ways: classification will be performed using a nearest neighbor classification algorithm, model selection will be performed using cross-validation and lastly the data will be standard normalized for preprocessing. 

The data are split into a training set for training a model and a testing set for testing the accuracy of the model. For each data set the first five columns contain data on temperature, relative humidity, light, CO2 and humidity ratio and work as the X-values. The sixth and last column of each data set contain the occupancy data and work as the y-values. 

In [1]:
#import packages
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

In [2]:
#load data
dataTrain = np.loadtxt('.\OccupancyTrain.csv', delimiter=',')
dataTest = np.loadtxt('.\OccupancyTest.csv', delimiter=',')

#split input variables and labels
XTrain = dataTrain[:, :-1]
YTrain = dataTrain[:, -1]
XTest = dataTest[:, :-1]
YTest = dataTest[:, -1]

# Exercise 1 {-}

As mentioned above classification of the data is performed using a nearest neighbor classifier. The nearest neighbor classification function is defined below. The function assigns a classifier to each point in the data set based on what classfier the k nearest neighbors of the point have been assigned. The k nearest neighbors are defined as the k points with the shortest euclidian distance to the point in question. 

In [3]:
#define k-NN function 
def kNN(XTrain, YTrain, XTest, k):
    outputs = []
    for i in range(len(XTest)):
        distances = []
        for n in range(len(XTrain)):
            #compute distance to each neighbor and append to list of distances
            distances.append(np.sqrt(sum((XTrain[n][m]-XTest[i][m])**2 for m in range(len(XTrain[0])))))
        #classifiers for k nearest neighbors 
        k_nearest = [YTrain[neighbor_index] for neighbor_index in np.argsort(distances)[:k]]    
        #choose classifier most common in k nearest neighbors and append to list of outputs
        outputs.append(np.argmax(np.bincount(k_nearest)))
    
    return np.round(outputs)

The accuracy of the classification function is tested below for k = 1 i.e. where the classifier for only the single nearest neighbor determines the classifier for a data point. The testing of the accuracy of the model works by comparing the predicted classfiers from the output of the k-NN function to the actual recorded classifiers. 

In [4]:
#classification accuracy
accTrain = accuracy_score(YTrain, kNN(XTrain, YTrain, XTrain, 1))
accTest = accuracy_score(YTest, kNN(XTrain, YTrain, XTest, 1))
print('Training results of 1-NN classifier: ' + str(accTrain))
print('Test results of 1-NN classifier: ' + str(accTest))

Training results of 1-NN classifier: 1.0
Test results of 1-NN classifier: 0.9775


The training accuracy of the 1-NN classifier implemented above is 1.0 which would be expected as the nearest neighbor for each datapoint is itself and only the single nearest neighbor is considered in 1-NN. The test accuracy is 0.9775, so the model will predict the correct classifier for the test data in this dataset in 97.75% of cases. 

# Exercise 2 {-}

In exercise 1 the accuracy of the nearest neighbor classification was tested for k = 1. However another choice of the hyperparameter k might give a higher accuracy. To determine the best hyperparameter the method of cross validation is used. Specifically 5-fold cross-validation is used meaning that the training data is split into 5 parts whereof 4 parts act as the new training data and the fifth part acts as the new testing data. The best value, k_best, of the hyperparamter k is chosen from {1,3,5,7,9,11}.

In [5]:
#model selecting using cross validation 
def model_selection(k_values, XTrain, YTrain):
    accuracies = []
    #splitting training data into 5 parts
    cv = KFold(n_splits=5)
    for k in ks:
        k_specific_accuracies = []
        for train_index, test_index in cv.split(XTrain):
            XTrainCV, XTestCV, YTrainCV, YTestCV = XTrain[train_index], XTrain[test_index], YTrain[train_index], YTrain[test_index]
            #test accuracy of kNN for given k 
            k_specific_accuracies.append(accuracy_score(YTestCV, kNN(XTrainCV, YTrainCV, XTestCV, k)))
        accuracies.append(np.round(np.mean(k_specific_accuracies),4))

    #find best k as the one with highest mean accuracy
    best_k = ks[np.argmax(accuracies)]
    return best_k

ks = [1,3,5,7,9,11]
print(model_selection(ks, XTrain, YTrain))


3


For each value of k the training data was split into five parts. For each of five rounds XTrainCV, XTestCV, YTrainCV and YTestCV  were defined and the kNN function was run with XTrainCV, YTrainCV, XTestCV and k as input to give YTestCV. Then the accuracy of the kNN function for each of the five rounds was computed by comparing the output of the kNN function to YTestCV. Then the average accuracy for the given k was computed as the mean for each of the five rounds. The procedure was repeated for each k-value, and the k-value with highest average accuracy (lowest 0-1 loss) was chosen as k_best. 

The value of k_best was computed to be 3, so according to the model, taking the 3 nearest neighbors into consideration gives the most accurate classifier for this dataset. 

# Exercise 3 {-}

In exercise 2 the value of the hyperparameter k_best was found to be 3. The accuracy of the k_best-NN classifier is now tested on the complete data set. 

In [6]:
#training and test accuracy of k_best
k_best = 3
accTrain_kbest = accuracy_score(YTrain, kNN(XTrain, YTrain, XTrain, k_best))
accTest_kbest = accuracy_score(YTest, kNN(XTrain, YTrain, XTest, k_best))
print('Training results of k_best-NN classifier: ' + str(accTrain_kbest))
print('Test results of k_best-NN classifier: ' + str(accTest_kbest))

Training results of k_best-NN classifier: 0.9933333333333333
Test results of k_best-NN classifier: 0.9875


The training accuracy of k_best = 3 is 0.9933.
The test accuracy of k_best = 3 is 0.9875. 

# Exercise 4 {-}

As a preprocessing step normalization of the data is now performed. The data is normalized to generate zero-mean, unit variance input data. 

In [7]:
#normalize training data
XTrainN = np.array([(XTrain[:,i]-np.mean(XTrain[:,i]))/np.std(XTrain[:,i]) for i in range(len(XTrain[0]))]).T
XTestN = np.array([(XTest[:,i]-np.mean(XTrain[:,i]))/np.std(XTrain[:,i]) for i in range(len(XTest[0]))]).T

As the data have now been normalized the selection of the best value of the hyperparamter k_best is repeated. 

In [8]:
#model-selecting using normalized training data
ks = [1,3,5,7,9,11]
print(model_selection(ks, XTrainN, YTrain))

3


Having determined the k_best value for the normalized data the accuracy testing of the k_best-NN model is also repeated. 

In [9]:
#training and test accuracy of k_best for normalized data
k_best = 3
accTrain_kbest_N = accuracy_score(YTrain, kNN(XTrainN, YTrain, XTrainN, k_best))
accTest_kbest_N = accuracy_score(YTest, kNN(XTrainN, YTrain, XTestN, k_best))
print('Training results of k_best-NN classifier for normalized data: ' + str(accTrain_kbest_N))
print('Test results of k_best-NN classifier for normalized data: ' + str(accTest_kbest_N))

Training results of k_best-NN classifier for normalized data: 0.9933333333333333
Test results of k_best-NN classifier for normalized data: 0.9875


After normalization of the data k_best is still found to be 3 through cross-validation. The training accuracy of k_best = 3 is still 0.9933 and the test accuracy of k_best = 3 is still 0.9875, so the accuracies have not changed after normalization of the data. 

Considering the three different ways one could have applied the preprocessing from scikit-learn:
Version 1 is the correct version for this use. In version 1 the mean and standard deviation from the training set is used to normalize both the training set and the test set. 
In version 2 and version 3 the mean and the standard deviation for both the training set and the test set is used for normalization. This is flawed since we don't want to use any information from the test set for cross validation and the model selection. 