##  k Nearest-Neighbors

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

1. In k-NN classification, the output is a class membership. An object is classified by a **plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors** (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

2. In k-NN regression, the output is the property value for the object. This value is the **average of the values of its k nearest neighbors**.

The k-NN algorithm is among the simplest of all machine learning algorithms. 

In [48]:
#CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases.
import csv
import random
import math
#The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python.
import operator

In [49]:
# Split the data into training and test data
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
    with open(filename, 'rb') as csvfile:
        #rb: 'r' for reading, 'w' for writing, opening a binary file 'b'
        lines = csv.reader(csvfile)
        print('lines',type(lines))
        dataset = list(lines)
        print('dataset',type(dataset),'len(dataset)',len(dataset))
        
        for x in range(len(dataset)):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y])
                #Return the next random floating point number in the range [0.0, 1.0)
            if random.random() < split:
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])
                
                
trainingSet=[]
testSet=[]
split = 0.67

loadDataset('/home/johan/repos/GitHub/Introduction-to-Machine-Learning/Datasets/iris.data', split, trainingSet, testSet)
print 'Train set: ',(len(trainingSet))
print 'Test set: ' , repr(len(testSet))   

('lines', <type '_csv.reader'>)
('dataset', <type 'list'>, 'len(dataset)', 150)
Train set:  98
Test set:  52


In [50]:
def euclideanDistance(instance1, instance2, length):
    distance = 0
    for x in range(length):
        distance += pow((instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

In [58]:
def getNeighbors(trainingSet, testInstance, k):
    distances = []
#The value of -1 is important. It is beacuse I am evaluating a point, hence, the lengh of the datset is N-1
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        dist = euclideanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
    print ('distances',distances[0])
#The sort() method sorts the elements of a given list in a specific order.       
    distances.sort(key = operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

In [59]:
trainingSet=[]
testSet=[]
split = 0.67

loadDataset('/home/johan/repos/GitHub/Introduction-to-Machine-Learning/Datasets/iris.data', split, trainingSet, testSet)
print 'Train set: ',(len(trainingSet))
print 'Test set: ' , repr(len(testSet))    
predictions=[]
k = 3

for x in range(len(testSet)):
    neighbors = getNeighbors(trainingSet, testSet[x], k)

('lines', <type '_csv.reader'>)
('dataset', <type 'list'>, 'len(dataset)', 150)
Train set:  92
Test set:  58
('distances', [([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], 0.5385164807134502)])
('distances', [([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], 0.5385164807134502), ([4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], 0.509901951359278)])
('distances', [([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], 0.5385164807134502), ([4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], 0.509901951359278), ([4.6, 3.1, 1.5, 0.2, 'Iris-setosa'], 0.648074069840786)])
('distances', [([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], 0.5385164807134502), ([4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], 0.509901951359278), ([4.6, 3.1, 1.5, 0.2, 'Iris-setosa'], 0.648074069840786), ([5.0, 3.6, 1.4, 0.2, 'Iris-setosa'], 0.1414213562373093)])
('distances', [([4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], 0.5385164807134502), ([4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], 0.509901951359278), ([4.6, 3.1, 1.5, 0.2, 'Iris-setosa'], 0.648074069840786), ([5.0, 3.6, 1.4, 0.2, 'Iris-setosa'], 0.141421356237

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [52]:
def getResponse(neighbors):
    # Creating a list with all the possible neighbors
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

In [53]:
def getAccuracy(testSet, predictions):
    correct = 0
    for x in range(len(testSet)):
        if testSet[x][-1] == predictions[x]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

In [56]:
def main():
    trainingSet=[]
    testSet=[]
    split = 0.67
    
    loadDataset('/home/johan/repos/GitHub/Introduction-to-Machine-Learning/Datasets/iris.data', split, trainingSet, testSet)
    print 'Train set: ',(len(trainingSet))
    print 'Test set: ' , repr(len(testSet))    
    predictions=[]
    k = 3
    
    for x in range(len(testSet)):
        neighbors = getNeighbors(trainingSet, testSet[x], k)
        result = getResponse(neighbors)
        predictions.append(result)
        #print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
    accuracy = getAccuracy(testSet, predictions)
    print 'Accuracy: ', accuracy

main()

 

('lines', <type '_csv.reader'>)
('dataset', <type 'list'>, 'len(dataset)', 150)
Train set:  105
Test set:  45


AttributeError: 'list' object has no attribute 'shape'

In [None]:
def main():
    trainingSet=[]
    testSet=[]
    split = 0.67
    
    loadDataset('/home/johan/repos/GitHub/Introduction-to-Machine-Learning/Datasets/iris.data', split, trainingSet, testSet)
    print 'Train set: ' + repr(len(trainingSet))
    print 'Test set: ' + repr(len(testSet))    
    predictions=[]
    k = 3
    for x in range(len(testSet)):
        neighbors = getNeighbors(trainingSet, testSet[x], k)
        result = getResponse(neighbors)
        predictions.append(result)
        #print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
    accuracy = getAccuracy(testSet, predictions)
    print 'Accuracy: ', accuracy

main()