## kNN classifier
The k Nearest Neighbour classifier is the easiest classifier which requires no training at all. When a test data point comes in, the Euclidean distances between this data point and all data points in the training set will be calculated and the majority class of the training points with top K smallest distances will be the class label of this test point. 

### Load the data set
A girl called Helen want to build a classifier to tell if she would like to date a man(WTF?) based on three features:

1. frequent flier miles earned for each year
2. percentage of time spent on playing video games
3. liters of ice cream cosumed for each year

1 means do not want to date him at all, 2 means a little, 3 means want to date him very much.

She has a data set which contains these three features for 1000 men(WOW). Let's build a kNN classifier. First let's load the data set. It is stored as a raw text in GitCafe. The loaded data is stored as pandas data frame.

In [95]:
import pandas as pd
import matplotlib.pyplot as plt
from numpy  import *
url="https://coding.net/u/HongHuangNeu/p/Machine-Learning-Notes-Data/git/raw/master/DatingData/datingTestSet2.txt"
df=pd.read_csv(url,sep='\t')
df[:3] #show the first 3 rows

Unnamed: 0,fre_flier_miles_earned_per_year,per_of_time_spent_playing_video_games,liters_of_ice_cream_consumed_per_year,label
0,40920,8.326976,0.953952,3
1,14488,7.153469,1.673904,2
2,26052,1.441871,0.805124,1


Let's play with this the data set. In the following we convert the lables as indexes of the data points in pandas data frame.

In [96]:
indexed_df = df.set_index(['label'])
indexed_df[:3]

Unnamed: 0_level_0,fre_flier_miles_earned_per_year,per_of_time_spent_playing_video_games,liters_of_ice_cream_consumed_per_year
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,40920,8.326976,0.953952
2,14488,7.153469,1.673904
1,26052,1.441871,0.805124


### Visualize the data set
In the following cell I do a scatter plot on two dimensions: 'liters_of_ice_cream_consumed_per_year' and 'per_of_time_spent_playing_video_games'. It takes a long time to render this scatter plot, so they are commented out for now. Intersted readers can download this ipython notebook, uncomment these statements and see the visualization yourself.

In [97]:
#cmap = {1.0: 'red', 2.0: 'blue', 3.0: 'yellow'}

#df.plot(x='liters_of_ice_cream_consumed_per_year', y='per_of_time_spent_playing_video_games', kind='scatter', c=[cmap.get(c, 'black') for c in df.label])

#plt.show()

### Get the features and lables

In [98]:
matrix=df.values #the numpy array of the pandas data frame
labels=matrix[:,3]   #the 3rd column is the label column
features=matrix[:,0:3] #the 0th to 2nd column is the feature columns


### Define the classification function
Define a classifier function according to the principles of kNN classifier

In [99]:
def classify(x,trainingSet,labels,k):
    numberOfRows=trainingSet.shape[0];
    # repeat x for numberOfRows times 
    xs=tile(x,(numberOfRows,1))
    #calculate the difference of coordinates for each dimension
    diff=trainingSet-xs
    #calculate the square of the differences
    square=diff*diff
    #calculate distances
    distances=(square.sum(axis=1))**0.5
    
    #sort the distances
    sortedIndexes=distances.argsort()
    #calculate the frequencies of classes in top k
    classCount={}
    for i in range(k):
        label=labels[sortedIndexes[i]]
        if label in classCount:
            classCount[label]=classCount[label]+1
        else:
            classCount[label]=1
    #sort the resulted map by value
    sortedClassLabels=sorted(classCount.items(),key=lambda x: x[1],reverse=True)
    return sortedClassLabels[0][0]


    

Let's play with this classifier first. Here we use 3-NN

In [100]:
#play with this classifier first
x=array([1,2,3]) #vector to classify
classify(x,features,labels,3)

2.0

### Normalization
kNN classifier is very sensitive to scaling, features with large values have large influence on the outcome, so we need to scale the features. Here we can use the following formula to scale the features to [0,1]:
$$newValue=\frac{oldValue-minValue}{maxValue-minValue}$$

In [101]:
# this function process training set
def norm(dataSet):
    #min value of each feature
    minValue=dataSet.min(axis=0)
    
    #max value of each feature
    maxValue=dataSet.max(axis=0)
    
    numberOfRows=dataSet.shape[0]
    minValues=tile(minValue,(numberOfRows,1))
    maxValues=tile(maxValue,(numberOfRows,1))
    diff=maxValues-minValues
    
    #we need a new data set, instead of changing the stuffs in place
    normedDataSet=zeros(shape(dataSet))
    
    #apply the formula
    normedDataSet=(dataSet-minValues)/diff
    return normedDataSet,minValue,maxValue

#this function process the data point to classify
#it takes the min values and max values calculated from the training set
def getValue(testPoint,minValue,maxValue):
    normedTestSet=zeros(shape(testPoint))
    diff=maxValue-minValue
    normedTestPoint=(testPoint-minValue)/diff
    return normedTestPoint
    
    

call this norm function to normalize the data point and do the classification

In [102]:
normedTrainSet,minValue,maxValue=norm(features)
newX=getValue(x,minValue,maxValue)
classify(newX,normedTrainSet,labels,3)


2.0

## Test the classifier
Use part of the data set as test set, use the rest as training data set to train the classifier.

In [104]:
def testClassifier(ratio,dataSet,labels,k):
    normalizedDataSet,minValue,maxValue=norm(dataSet)
    numberOfRows=dataSet.shape[0]
    numberOfTestPoints=int(numberOfRows*ratio)
    totalCount=0
    errorCount=0
    for i in range(numberOfTestPoints):
        #the ith test point
        x=normalizedDataSet[i]
        #apply the classifier
        label=classify(x,normalizedDataSet[numberOfTestPoints:numberOfRows],labels[numberOfTestPoints:numberOfRows],k)
        totalCount=totalCount+1
        if(label!=labels[i]):
            errorCount=errorCount+1
        
    errorRate=errorCount/totalCount
    return errorRate



Use to 10% of the data points as test data set to test the 3-NN classifier

In [106]:
testClassifier(0.1,features,labels,3)

0.05