## Lab 5 - People in a Room
#### A K-Nearest Neighbor Algorithm Lab by Nikhil Deo

#### Introduction
This lab uses a k-nearest neighbor algorithm to predict whether or not there is a person in a room. The dataset includes certain factors about the room's environment including humidity, CO2, and light. This dataset was provided by Emma Anderson. This lab is regarding two questions: whether or not the number of nearest neighbors affects the accuracy and how valuable each attribute is to the accuracy of the algorithm.

In [1]:
import csv
import random
import math
import operator

def euclidianDistance(item1, item2, attributes):
    distance = 0
    for x in range(attributes-2):
        distance+=(item1[x]-item2[x])**2
    return math.sqrt(distance)

def getNeighbors(trainingset, test, k):
    distances = []
    length = len(test) - 1
    for x in range(len(trainingset)):
        dist = euclidianDistance(test, trainingset[x], length)
        distances.append((trainingset[x], dist))
#sort on distance, not datapoint    
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

def getresponse(neighbors):
    classvotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classvotes:
            classvotes[response]+=1
        else:
            classvotes[response] = 1
    sortedvotes = sorted(classvotes.items(), key=operator.itemgetter(1), reverse = True)
    return sortedvotes[0][0]

def getaccuracy(testset, predictions):
    correct = 0
    for x in range(len(testset)):
        if testset[x][-1] == predictions[x]:
            correct+=1
    return (correct/float(len(testset)))*100.0

def loadDataset(filename, split, trainingset, testset):
    with open(filename, 'r') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        del dataset[0]
        for z in range(len(dataset)):
            del dataset[z][0]
            del dataset[z][0]
        for x in range(1, len(dataset)-2):
            for y in range(0, len(dataset[0])):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingset.append(dataset[x])
            else:
                testset.append(dataset[x])
                
def main(k, dictionary):
    trainingset = []
    testset = []
    split = 0.9
    loadDataset('datatraining.txt', split, trainingset, testset)
    predictions = []
    for x in range(len(testset)):
        neighbors = getNeighbors(trainingset, testset[x], k)
        result = getresponse(neighbors)
        predictions.append(result)
#        print('Predictions:' + str(result) + 'Actual:' + str(testset[x][-1]))
    accuracy = getaccuracy(testset, predictions)
#    print(k, 'Accuracy:' + str(accuracy) + '%')
    dictionary["k =", k] = str(accuracy)

accuracyList = {}
for x in range(3, 10):
    main(x, accuracyList)

### Part 1: Does the value of 'k' change the accuracy of the algorithm?

In [2]:
t = 0
for majorkey in accuracyList:
    if float(accuracyList[majorkey]) > float(t):
        t = accuracyList[majorkey]
        l = majorkey

print(t)
print(l)
print("")
print("Dictionary accuracyList:")
for x in accuracyList:
    print(x, ":", accuracyList[x])

99.15458937198068
('k =', 3)

Dictionary accuracyList:
('k =', 3) : 99.15458937198068
('k =', 4) : 98.33729216152018
('k =', 9) : 98.95697522816167
('k =', 7) : 98.55421686746988
('k =', 8) : 99.12935323383084
('k =', 6) : 98.2843137254902
('k =', 5) : 98.64532019704434


#### Discussion
While there is variation between accuracy of the algorithm for various k-values, there is no trend. After running the code a number of times, I have found no value of 'k' that consistently makes the algoritm more accurate than the rest. The above code uses a dictionary containing the accuracy of the algorithm for 7 values of k and compares each to find the k-value that results in the most accurate algorithm. I found that essentially every time I ran the code, a different k-value would have the highest accuracy with no trend. Additionally, each value for k yilds a similar accuracy percentage. Every k-value returned an accuracy between 98% and 99.8%. This clearly indicates that the number of neighbors tested for does not make the algorithm any more or less accurate. It is also important to note that the accuracy percentage remains exceptionally high throughout with no k-value lowering the percentage to below 98% (that I've seen in my tests, at least).

### Question 2: Can some attributes be removed while still having an accuracy greater than 90%? Which attribute seems to be the most important?

In [3]:
import csv
import random
import math
import operator

def euclidianDistance(item1, item2, attributes):
    distance = 0
    for x in range(attributes-2):
        distance+=(item1[x]-item2[x])**2
    return math.sqrt(distance)

def getNeighbors(trainingset, test, k):
    distances = []
    length = len(test) - 1
    for x in range(len(trainingset)):
        dist = euclidianDistance(test, trainingset[x], length)
        distances.append((trainingset[x], dist))
#sort on distance, not datapoint    
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

def getresponse(neighbors):
    classvotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classvotes:
            classvotes[response]+=1
        else:
            classvotes[response] = 1
    sortedvotes = sorted(classvotes.items(), key=operator.itemgetter(1), reverse = True)
    return sortedvotes[0][0]

def getaccuracy(testset, predictions):
    correct = 0
    for x in range(len(testset)):
        if testset[x][-1] == predictions[x]:
            correct+=1
    return (correct/float(len(testset)))*100.0

def loadDataset(filename, split, trainingset, testset, remAttribute, whichAttribute):
    with open(filename, 'r') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        whichAttribute.append(dataset[0][remAttribute+2])
        del dataset[0]
        for z in range(len(dataset)):
            del dataset[z][0]
            del dataset[z][0]
        for z in range(len(dataset)):
            del dataset[z][remAttribute]
        for x in range(1, len(dataset)-2):
            for y in range(0, len(dataset[0])):
                dataset[x][y] = float(dataset[x][y])
            if random.random() < split:
                trainingset.append(dataset[x])
            else:
                testset.append(dataset[x])
                
def main(remAttribute, dictionary):
    trainingset = []
    testset = []
    split = 0.9
    whichAttribute = []
    loadDataset('datatraining.txt', split, trainingset, testset, remAttribute, whichAttribute)
    predictions = []
    k = 3
    for x in range(len(testset)):
        neighbors = getNeighbors(trainingset, testset[x], k)
        result = getresponse(neighbors)
        predictions.append(result)
#        print('Predictions:' + str(result) + 'Actual:' + str(testset[x][-1]))
    accuracy = getaccuracy(testset, predictions)
#    print('Accuracy:' + str(accuracy) + '%')
    dictionary[whichAttribute[0]] = float(accuracy)

newAccuracies = {}
for x in range(0,4):
    main(x, newAccuracies)

print(newAccuracies)

{'HumidityRatio': 96.86323713927227, 'Humidity': 98.57328145265889, 'CO2': 96.47058823529412, 'Light': 98.53658536585365}


#### Discussion
There are two questions in the portion of this lab: can attributes be removed without dropping the accuracy percentage below 90% and which attribute is the most important. I decided that because no details were given about the room (e.g whether it is a bedroom or an office), the date and time would be nonfactors in determining the occupancy of the room. In fact, my code removes these two columns of data in this question and the preceding question due to their irrelevance to the results. Therefore, the four attributes I tested removing were humidity, CO2, humidity ratio, and light. To test the accuracy without each of these attributes, I had the function "loadDataset" remove a certain column of data depending on the inputted attribute which was determined through the for loop that actually ran main(). Using a dictionary (printed above), I can see the accuracies of the algorithm when each attribute was removed. It can be concluded that the attribute that causes the accuracy to lower the most will be the most important attribute as it is clearly fundamental to the accuracy of the algorithm. It is important to note that I copied the whole algorithm here simply because it was hard to keep track of which functions I was changing and wanted to ensure accurate results.

I found that no attribute is more important than the others. There was no attribute that I removed that affected the accuracy of the algorithm more than any of the others. After a number of tests, there was no clear trend indicating one was the most important. One interesting trend was that removing light consistently affected the algorithm's accuracy the least. What this means is that removing any one of these attributes does not really affect the accuracy of the algorithm but when looking at the attributes, light is the least important. This is likely because every time you remove one attribute, the remaining three attributes are good enough to make up for the lack of the missing one. Because no one attribute is more important than the others, removing any one attribute does not make a considerable difference to the accuracy. In fact, the lowest accuracy I saw from all of my test only decreased the accuracy by a mere 3% from a general average of 98.5%