In [None]:
# This Notebook uses the following libraries:
import numpy as np
import pandas as pd
import scipy.stats
import psycopg2 as pg
import pandas as pd
import pandas.io.sql as psqlg
from sklearn.neighbors import KNeighborsClassifier


# This line suppresses a warning about a future deprecation in
# the KNeighborsClassifier functions; you should ignore it
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [None]:
RoostCoords = pd.read_csv('data/RoostCoords.csv')
RoostCoords.head()

In [None]:
roostyear = RoostCoords.loc[RoostCoords['year']==2013]

In [None]:
HibernationCoords = pd.read_csv('data/HibernationCoords.csv')
HibernationCoords.head()

In [None]:
hibyear = HibernationCoords.loc[HibernationCoords['year']==2013]

Step 1: Select a value of k to examine.
Step 2: Select one of the n training data points as the validation data. The remaining n-1 data points are used as a training set.
Step 3: Build a k-NN classifier with the n-1 training data points, and use this to predict the class of the validation data point. Check the predicted class against the actual class of the test data.
Step 4: Repeat Steps 2 and 3 for each of the n labelled data points by choosing a different data point as validation data and using the rest of the n-1 data instances as training data.
Step 5: Calculate an error rate as a ratio of incorrect classifications (f) to the total number of points in the test dataset (n), i.e. error rate = f/n.
Step 6: With a different value of k, repeat Steps 2 to 5. Repeat this step until all values of k are examined.
Step 7: Choose the value of k with the lowest error rate as an empirical optimal value. If there is a tie, choose the smallest k.

The task of implementing the leave-one-out algorithm here is best carried out in two stages.
For stage 1, we will develop a function which takes a single member of a dataset, and uses the remaining data to classify it with the k-NN algorithm.
For stage 2, we will develop a second function which uses the function from stage 1 to calculate how many members of the dataset were correctly classified.
We have provided a description of the working function, and suggested solutions for both these two stages, which you can use. However, you will gain much more benefit if you attempt to write the function yourself before looking at our proposed solution, even if you do not manage to build complete working functions yourself.

In [None]:
def classify_single_case(trainingData_df, targetValues_ss, ix, k):
    '''Use k-NN to classify the member of trainingData_df with index
       ix using a k-nearest neighbours classifier. The classifier is
       trained on the data in trainingData_df and the classes in
       targetValues_ss, with the data point indexed by ix omitted.
       Returns the class assigned to the data point with index ix.
    '''

    # Create a classifier instance to do k-nearest neighbours
    myClassifier = KNeighborsClassifier(n_neighbors=k,
                                        metric='euclidean',
                                        weights='uniform')

    # Now apply the classifier to all data points except
    # the one indexed by ix
    myClassifier.fit(trainingData_df.drop(ix, axis='index'),
                     targetValues_ss.drop(ix))

    # Return the class predicted by the trained classifier:

    return myClassifier.predict(trainingData_df.loc[ix])[0]

The use of latitude and longitude for the classifier wasn't proving to be reliable and as such I amended the data to use coordinates instead to see if that improved matters. However this provided poorer results and instead I will try some iterations to see if it improves over time.

In [None]:
%load_ext sql
%sql postgresql://test:test@localhost:5432/tm351test

In [None]:
conn = pg.connect(dbname='tm351test', host='localhost', user='test', password='test', port=5432)

In [None]:
dfh = pd.read_sql_query('select * from HibernationBats',conn)
dfh.head(20)

In [None]:
dfr = pd.read_sql_query('select * from RoostBats',conn)
dfr.head(20)

In [None]:
dfr1 = pd.read_sql_query("select * from RoostBats\nWHERE RoostBats.year = '2010';",conn)
dfr1.head(20)

In [None]:
nobatsr = dfr1.loc[dfr1['commonname'] != 'Bat']
nobatsr.head(20)

In [None]:
#nobatr = hibyear.loc[hibyear['commonName'] != 'Bat']
#nobatr.head(20)

In [None]:
'''

Predict the class of the data point with index 17, using a k-NN classifier
with k=3

The actual class of the data point with this index is 'Lesser Horseshoe Bat'

'''

# Use the two columns 'Exercise time (hours)' and 
# 'Sleep time (hours)' for the training data
trainingData_df = dfr[['latitude', 'longitude']]

# Use the column 'Patient group' as the target values
targetValues_ss = dfr['commonname']

# Return the predicted value of the data point with index 17 for k=3:
classify_single_case(trainingData_df,
                     targetValues_ss,
                     17,
                     3)

In [None]:
len(nobatsr)

The function classified the data in the same way for values of k between 1 and 5, I will now try a subset of the whole dataset.
Next, to obtain a list of predicted values for some k, apply the function classify_single_case to the training data for each data point, the predicted values for k=3 are:

In [None]:
'''

Predict the class of the data point with index 17, using a k-NN classifier
with k=3

The actual class of the data point with this index is 'Lesser Horseshoe Bat'

'''

# Use the two columns 'Exercise time (hours)' and 
# 'Sleep time (hours)' for the training data
trainingData_df1 = dfr1[['latitude', 'longitude']]

# Use the column 'Patient group' as the target values
targetValues_ss1 = dfr1['commonname']

# Return the predicted value of the data point with index 17 for k=3:
classify_single_case(trainingData_df1,
                     targetValues_ss1,
                     17,
                     3)

In [None]:
[classify_single_case(trainingData_df1,
                      targetValues_ss1,
                      i,
                      3)
 for i in trainingData_df1.index]

To identify the number of discrepencies between the predicted values and the actual values, compare the Series of predicted classes with the Series of actual classes (where True means the predicted class is the same as the actual class, and False means that they are different):

In [None]:
[classify_single_case(trainingData_df1,
                      targetValues_ss1,
                      i,
                      3)
 for i in trainingData_df1.index] == targetValues_ss1

In [None]:
list([classify_single_case(trainingData_df1,
                           targetValues_ss1,
                           i,
                           3)
      for i in trainingData_df1.index] == targetValues_ss1).count(True)

To find the optimum value of k we want the value that gets the prediction correct most often. To determine this value, carry out the above calculation for a range of values of k values from 1 to 7:

In [None]:
for k in range(1, 15):
    print('{}\t{}'.format(k,
                          list([classify_single_case(trainingData_df1,
                                                     targetValues_ss1,
                                                     i,
                                                     k)
                                for i in trainingData_df1.index
                               ] == targetValues_ss1
                              ).count(True)))

k-NN is very susceptible to outliers: some unusual or extreme points in a dataset can easily lead a classifier to misclassify a new point, or at least to classify it in an unintuitive way. For example, in Figure 20.12 a 3-nearest neighbours classifier would classify the new point (shown by a green triangle) as Class A, whereas looking at the classes independently, the new point seems to be a more natural fit with Class B than with Class A (Figure 20.13). Of course, it could be that the new point is also an outlier: perhaps the outlying points are a result of a quirk of your measuring apparatus? Or are the points accurate measurements of a case which you had not predicted, suggesting that they require further analysis? It is often the case that investigating the borderline cases can give the greatest insight into the data’s meaning.