K-Nearest Neighbors
Author: Carleton Smith; W.P.G. Peterson

Revised: Jessica Cervi

Expected time = 3 hours

Total points = 95 points

Assignment Overview
K-Nearest Neighnours (KNN) is a reasonably simple algorithm that is easy to grasp and can be very effective. This assignment will test your abilities in three different parts. In the first part, you will familiarize yourself with the problem and data. In the second part, you will code a KNN Classifier from scratch, evaluate performance, and compare to Scikit-Learn's implementation. Finally, in the last part you will interpret results and explain findings.

In the first part, we will revisit Bayes' formula and evaluate your ability to calculate simple Bayesian posterior probabilities. In the second part, we will ask you to build functions that calculate the parameters of Bayesian posteriors for Bayesian Linear Regression.

This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. Remember to run your code from each cell before submitting your assignment. Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting it. You should view your Vocareum submission as if you are delivering a final project to your manager or client.

Vocareum Tips

Do not add arguments or options to functions unless asked specifically. This will cause an error in Vocareum.
Do not use a library unless asked explicitly in the question.
You can download the Grading Report after submitting the assignment. It will include the feedback and hints on the incorrect questions.
Learning Objectives
Have a firm understanding of KNN algorithm
Practice running through the data science workflow to solve a problem
Demonstrate how to translate a mathematical algoritm into effective code
Understand common pitfalls when working with distances
Determine class balance in classification problems
Use Euclidian-distance to find the distance between vectors
Code KNN from scratch
Implement KNN using sklearn

-Nearest Neighbors
Importing the data set and exploratory data analysis
For this assignment, we will be using a database from the UC Irvine Machine Learning Repositiory that can be downloaded from the following link: Human Activity Recognition Using Smartphones Data Set.

Please see the Data Folder to explore the data files further.

In this assignment, we will try to classify the type of activity a person is performing based on measurements collected from a smartphone. The activities include:

Walking
Walking_Upstairs
Walking_Downstairs
Sitting
Standing
Laying
Before coding an algorithm, we will take a look at our data using Python's pandas. For visualizations, we will use the matplotlib ans seaborn libraries.

Let's import the necessary libraries and load the datasets. We will be using using the pandas pd.read_table() function.

In [None]:
# Import the necessary modules

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
FEATURE_NAMES = './data/features.txt'
TRAIN_DATA = './data/X_train.txt'
TRAIN_LABELS = './data/y_train.txt'

In [None]:
# Load the data into a `pandas` DataFrame objects

# read feature names
feats = pd.read_csv(FEATURE_NAMES, sep='\n', header=None)

# read in training data
har_train = pd.read_csv(TRAIN_DATA, sep='\s+', header=None)

# read in training labels
har_train_labels = pd.read_csv(TRAIN_LABELS, sep='\n', header=None, names=["label"], squeeze = True)

In [None]:
har_train.head()

* Question 1:
* 5 points

Find out how many rows and columns are in har_train. Assign your answer to the variable ans1.

In [None]:
### GRADED

### YOUR ANSWER BELOW

ans1 = None

### BEGIN SOLUTION
ans1 = har_train.shape
### END SOLUTION

Next, we visualize the first 5 rows of the DataFrame of feature names feats.

In [None]:
feats.head()

Finally, we assign each feature name in feats to each column of har_train using the .columns attribute and we visuaize the first 5 rows.

In [None]:
har_train.columns = feats.iloc[:, 0].values
har_train.head()

Question 2:
10 points

Use the iloc() function to extract the first 20 features of the dataframe har_train. Save this new dataframe to first_twenty.

Next, using the seaborn library create a heatmap for the correlation matrix.

First you have to create the correlation matrix from the pandas dataframe (save it in a dataframe called corr) and then plot it using seaborn with these customizations:

Set the seaborn style to white.
Generate a mask using np.triu(np.ones_like()) with the dtype as boolean to only show the lower triangle of the correlation matrix. Save it in a variable called mask.
Set up the figure with matplotlib with figsize=(11,9). Use fig, ax = ...
Generate a custom diverging colormap for the heatmap with the arguments (220, 10, as_cmap=True). Save it in a variable called cmap.
Draw the heatmap with the mask and correct aspect ratio, using the arguments corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}).
Finally, use fig.tight_layout() just before saving the plot to produce a nicely centered graph.
You can find more information about how to create a heatmap using seaborn here.

The final plot should look like this:



Save your plot as a png with the name "plot2.png" in the folder "results".

In [None]:
### GRADED

### YOUR ANSWER BELOW


### BEGIN SOLUTION

# extract the first 20 features
first_twenty = har_train.iloc[:, :20] 
# compute correlation matrix
corr = first_twenty.corr()

sns.set(style="white")

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Set up the matplotlib figure
fig, ax = plt.subplots(figsize=(11, 9))

# Generate custom colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
fig.tight_layout()
plt.savefig("results/plot2.png")
plt.close()
### END SOLUTION

As you know, exploratory data analysis (EDA) is used to develop an idea what the data we are about to work with looks like. In particular, looking for Null values and correlated features are important steps in order to:

See if any features will not be useful in models because of null values.
See if any model assumptions are violated by correlated features (such as in linear / logistic regression).
For the next question, we will use the dataframe containing the the target variable (har_train_labels).

* Question 3:
* 5 points

How many times does the majority class appear in our data? How many times does the minority class appear in our target data? Assign these values to ans_maj and ans_min, respectively.

In [None]:
### GRADED


### YOUR ANSWER BELOW


ans_maj = None
ans_min = None

### BEGIN SOLUTION
ans_maj = har_train_labels.value_counts().max()
ans_min = har_train_labels.value_counts().min()
### END SOLUTION

As an observation, a large imbalance in the distribution of the target variable categories can cause machine learning algorithms to perform differently. In this case, the algorithm can perform well with the majority class and poorly on the minority class.

As a final note, EDA should be performed at the beginning of each project and should be tailored to your specific problem to develop and understanding of the data for a particular purpose. Sometimes this could be a time consuming process when the data are large with many features.

Note that the above example shows just a few of the actions you can perform for EDA

Having performed EDA, we define a "test" data-set that will help us evaluate different models.

The attribute train_test_split from the sklearn.model_selection module provides an easy way to do this.

For this exercise we set test_size=.3 and random_state=24 for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split

# give to targets and observations conventional names
y = har_train_labels 
X = har_train

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=24)

Code K-Nearest Neighbours (KNN)
Note: The following example was adapted from example 2.1.2 in Chapter 2 of Machine Learning in Action by Peter Harrington.

Before fitting a model using KNN and the built-in modules of the sklearn package, we will code our own version of KNN from scratch.

As you know, KNN calculates the distance between the value (or a vector) that we want to classify and all other values (or vectors) in the training data-set. Then, the "k" nearest neighbors are classified based on their "label" and the majority is given to the predicted value.

Thus, our final goal will be to define a function costum_knn that accepts the following parameters:

A single data point to be classified (input_vector)
Training data (X_train)
Labels for training data (y_train)
Value of k (some positive integer)
Function definition:
def costum_knn(input_vector, X_train, y_train, k)

Pseudo Code:

for every point in our dataset:
    calculate the distance between the current point and input_vector
    sort the distances in increasing order
    take k items with lowest distances from input_vector
    find the majority class among these items
    return the majority class label from the k closest neighbors
Return:

The prediction for input_vector
We will begin by defining some auxiliary functions that we will need to build our KNN algorithm from scratch.



* Question 4:
* 5 points

Find the Euclidean distance between the points p1 = (1,2,3,-4,6) and p2 = (10,2,32,-2,0), defined below. Assign the distance as a float to ans4.

In [None]:
### GRADED

### YOUR ANSWER BELOW

p1 = (1,2,3,-4,6)
p2 = (10,2,32,-2,0)

ans4 = None

### BEGIN SOLUTION
ans4 = 0
for i, j in zip(p1,p2):
    ans4 += (i-j)**2
ans4 = ans4**.5
### END SOLUTION

* Question 5:
* 5 points

Define a function called euclid_dist that takes an input of two points represented as tuples in the format p1 = (a1, b1,...n1) and p2 = (a2, b2, ...n2). Your function should return the the euclidean distance between the two points.



In [None]:
### GRADED

### YOUR ANSWER BELOW

def euclid_dist(p1, p2):
    """
    Calculate the Euclidian Distance between two points
    
    Positional Arguments:
        p1 -- A tuple of n numbers
        p2 -- A tuple of n numbers
    
    Example:
        p1 = (5,5)
        p2 = (0,0)
        p3 = (5,6,7,8,9,10)
        p4 = (1,2,3,4,5,6)
        print(euclid_dist(p1,p2)) #--> 7.0710678118654755
        print(euclid_dist(p3,p4)) #--> 9.797958971132712
    """

### BEGIN SOLUTION
def euclid_dist(p1, p2):
    # Start with 0 distance
    dist = 0
    # For all pairs of values in two points,
    # Find difference and square
    for a, b in zip(p1,p2):
        dist += (a-b)**2
    
    # Take Square Root
    return dist**.5
### END SOLUTION

Distances with numpy
In the KNN functions that we will define later in the assignment we will use the library numpy to calculate the distance more efficiently.

This can be done via the following code : np.linalg.norm(p1-p2).

Now that we can easily calculate the distances between any two points, we can start building our function.

* Question 6:
* 10 points

Define a function a function called "all_distances" that takes as inputs: an observation from a data set, e.g: har_train.iloc[50,:], and the full data set, e.g. har_train.

Your function should create a list or numpy array with the distances between that observation point and all points in the full dataset. Your function should return a list dists with distances sorted from smallest to largest.

Hints: Use np.linalg.norm() to find dists as described in the above cell. The smallest distance should be 0. Additionally, use the function np.apply_along_axis to apply this function over the columns (axis 1). You can find additional documentation about this function here.

In [None]:
### GRADED

### YOUR ANSWER BELOW


def all_distances(test_point, data_set):
    """
    Find and return a list of distances between the "test_point"
    and all the points in "data_set", sorted from smallest to largest.
    
    Positional Arguments:
        test_point -- a Pandas Series corresponding to a row in "data_set"
        data_set -- a Pandas DataFrame
    
    Example:
        test_point = har_train.iloc[50,:]
        data_set = har_train
        
        print(all_distances(test_point, data_set)[:5])
        #--> [0.0, 2.7970187358249854, 2.922792670143521, 2.966555149052483, 3.033982453218797]
    
    """
    return 
    
### BEGIN SOLUTION
def all_distances(test_point, data_set):
    # Take difference 
    diff = test_point - data_set
    
    # Find distance
    dists = np.apply_along_axis(np.linalg.norm, 1, diff )
    
    # Sort
    dists = np.sort(dists)
    return dists
### END SOLUTION

* Question 7:
* 10 points

Define a function a function called labels_of_smallest that takes the inputs: two different numpy arrays, the first one corresponding to a numeric column and the second one corresponding to a label column. Note that the i-th element of the numeric column corresponds to the i-th element of the label column. The third input should be a positive integer n.

Your function should execute the following steps:

concatenate the numeric and the label columns
create a new dataframe with columns num and lab with the concatenation
sort the values by the num column
Your function should return a list (or numpy array) df of the n smallest labels corresponding to the n smallest values in the numeric array.

NOTE: Make sure the order of labels corresponds to the order of values.

Hint: The labels are found in har_train_labels or y. The function np.concatenate() might be useful for this or subsequent exercises

In [None]:
### GRADED

### YOUR ANSWER BELOW

def labels_of_smallest(numeric, labels, n):
    
    """
    Return the n labels corresponding to the n smallest values in the "numeric"
    numpy array.
    
    Positional Arguments:
        numeric -- a numpy array of numbers
        labels -- a numpy array of labels (string or numeric)
            corresponding to the values in "numeric"
        n -- a positive integer
        
    Example:
        numeric = np.array([7,6,5,4,3,2,1])
        labels = np.array(["a","a","b","b","b","a","a"])
        n = 6
        
        print(labels_of_smallest(numeric, labels, n))
        #--> np.array(['a', 'a', 'b', 'b', 'b', 'a'])
    """

    
### BEGIN SOLUTION
def labels_of_smallest(numeric, labels, n):
    
    # Create a df of the two arrays (to simplify sorting)
    con = np.concatenate((numeric.reshape(-1,1), labels.reshape(-1,1)), axis = 1)
    df = pd.DataFrame(con, columns = ["num","lab"])
    
    # Sort
    df = df.sort_values(by = 'num')
    
    # Return the top "n" values
    return df['lab'].head(n).values
### END SOLUTION    

* Question 8:
* 10 points

For this question, look at the Counter function and the attribute .most_common().

Define a function called label_voting that takes as input a non-empty numpy array of labels as input. Your function should return the value (as an integer) that appears most frequently in that array. In the case of of a tie, return the value in the tie that appears first in the array

recast the labels as a list
initiate the counter for labels to find the most common
check to see if there is only one number in the list
loop through the list labels to determine which appears most often
return the value that appears most often in the list

In [None]:
### GRADED
from collections import Counter



### YOUR ANSWER BELOW

def label_voting(labels):
    """
    Given a numpy array of labels. Return the label that appears most frequently
    If there is a tie for most frequent, return the label that appears first.
    
    Positional Argument:
        labels -- a numpy array of labels
    
    Example:
        lab1 = np.array([1,2,2,3,3])
        lab2 = np.array(["a","a","b","b","b"])
        
        print(label_voting(lab1)) #--> 2
        print(label_voting(lab2)) #--> "b"
        
    """
### BEGIN SOLUTION 
def label_voting(labels):
    
    # List methods used in this function, recast labels as list
    labels = list(labels)
    # instantiate counter, find most common, returns tuples
    c = Counter(labels).most_common()
    
    # If only one value present, return that value
    if len(c) == 1:
        return c[0][0]
    # IF first has majority, return first
    if c[0][1] > c[1][1]:
        return c[0][0]
    
    # Otherwise, check to see which comes first in list
    else:
        top_votes = c[0][1]
        #print(top_votes)
        poss = []
        for t in c:
            if t[1] == top_votes:
                poss.append(t[0])
        idx = dict()
       # print(poss)
        for p in poss:
            idx[labels.index(p)] = p
        #print(idx)
        return labels[sorted(idx.keys())[0]]
### END SOLUTION

* Question 9:
* 15 points

Now it is time to put everything together. In questions 5 and 6 we defined functions to calculate distances. In question 7 we defined a function to sort and return n desired labels. Finally, in question 8 we counted the number of "votes."

The next question asks for a KNN modeling function.

Define a function called custom_KNN that takes as inputs

a single value from X_test (created above in our test_train_split)
X_train
the labels y_train
n - the number of nearest neighbors to poll in making predictions.
Your function should calculate the Euclidean distance between that X_test-point and every point in X_train and finds the labels from the n nearest neighbors ordered from the closest ones to the furthest ones.

Your function should return a prediction according to the voting rules outlined in question 8.

In [None]:
### GRADED

### YOUR ANSWER BELOW

def custom_KNN( point, X_train, y_train, n):
    """
    Predict the label for a single point, given training data and a specified
    "n" number of neighbors.
    
    Positional Arguments:
        point -- a pandas Series corresponding to an observation of a point with
             unknown label.
        x_train -- a pandas DataFrame corresponding to the measurements
            of points in a dataset. Assume all values are numeric, and
            observations are in the rows; features in the columns
        y_train -- a pandas Series corresponding to the labels for the observations
            in x_train
    
    Example:
        point = pd.Series([1,2])
        X_train = pd.DataFrame([[1,2],[3,4],[5,6]])
        y_train = pd.Series(["a","a","b"])
        n = 2
        print(custom_KNN(point, X_train, y_train, n)) #--> 'a'
    """
    df = pd.DataFrame()
    dists = all_distances(point, X_train)
    labs = labels_of_smallest(dists, y_train, n)
    return label_voting(labs)

### BEGIN SOLUTION
def custom_KNN(point, X_train, y_train, n):
    
    # Helper Function for vote counting
    def countVotes(l):
        c = Counter(l).most_common()
        if len(c) == 1:
            return c[0][0]
        if c[0][1] > c[1][1]:
            return c[0][0]
        else:
            top_votes = c[0][1]
            #print(top_votes)
            poss = []
            for t in c:
                if t[1] == top_votes:
                    poss.append(t[0])
            idx = dict()
           # print(poss)
            for p in poss:
                idx[l.index(p)] = p
            #print(idx)
            return l[sorted(idx.keys())[0]]
        
    # Take difference 
    diff = point - X_train
    
    # Find distance
    dists = np.apply_along_axis(np.linalg.norm, 1, diff )
    
    # Create df of distances; re-index to original data    
    df = pd.DataFrame(dists)
    df.index = X_train.index
    
    # Add labels, column names.
    df = pd.concat([df, y_train], axis = 1)
    
    df.columns = ["dist","label"]
    
    # Take top votes, and count
    votes = list(df.sort_values("dist").head(n)['label'])
    return countVotes(votes)
###END SOLUTION

You should now have a functioning KNN classifier assigned to the function customKNN.

Let's now see how good our classifier is using n = 5.

The below cell may or may not complete running on Vocareum due to processing constraints.

FOR FASTER COMPLETION, TRY COMMENTING OUT THE CELL BELOW

In [None]:
%%time

# Create new test train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=24)

print("Total 'test' observations:", len(X_test))
print("Classifying every point in X_test would take too long - classify the first 100")
custom_preds = []
for i, idx in enumerate(X_test.index[:100]):
    if i % 100 == 0: print(i)
    pred = custom_KNN(X_test.loc[idx,:], X_train, y_train, 5)
    custom_preds.append(pred)

KNN in sklearn
While is useful to learn and see how predictions are made using K-Nearest Neighbors using our own function, the sklearn library has an implementation called KNeighborsClassifier that will run much faster than our home-built version.

In the next question, we will ask you to implement KNN in sklearn.

* Question 10:
* 15 points

Use the function KNeighborsClassifier to instantiate the classifier knn by setting the parameter n_neighbors = 5. Next, use the function fit() to fit the X and y training sets to the classifier.

Finally, create a prediction for the first 100 test obervation. Define this prediction to be skpreds This can be accomplished by using the function predict on the classifier knn by setting the argument X_test[:100].

Feel free to refer to the documentation.

In [None]:
### GRADED
# Import
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

### YOUR ANSWER BELOW

knn = None
skpreds = None


### BEGIN SOLUTION
knn = KNeighborsClassifier(n_neighbors = 5)

# Fit model with training data
knn.fit(X_train, y_train)

# Create predictions for first 100 test observations
skpreds = knn.predict(X_test[:100])

###END SOLUTION

In the code cell below, we compare the performance of our home build knn classifier and the one that comes with sklearn. If we have made no mistakes, the difference should equal zero!

In [None]:
print("sklearn prediction performance")
print(classification_report(y_test[:100], skpreds))


### The below lines of code will compare the performance of your home-built classification with
### The sklearn predictions -- if all the cells above were run sucessfully, you should see identical scores

print("\nHome-Built prediction performance")
print(classification_report(y_test[:100], custom_preds))


### The below lines of code will explicitly compare predictions:
### "differences" should == 0!

### NB: Commenting/uncommenting multiple lines in Jupyter can be accomplished with:
### <ctrl-/> on windows and <cmd-/> on mac
differences = 0
for cust, sk in zip(custom_preds, skpreds):
    if cust != sk:
        differences +=1
print("Total Differences:", differences)

Practice with sklearn:
In the questions below, we will ask you to create a new test/train split, and fit a new KNN model using sklearn.

All of the basic steps for KNN have already been performed above. Feel free to reference.

In [None]:
# Ensure Data is consistent

# read feature names
feats = pd.read_csv(FEATURE_NAMES, sep='\n', header=None)

# read in training data
har_train = pd.read_csv(TRAIN_DATA, sep='\s+', header=None)

# read in training labels, and clean them.
har_train_labels = pd.read_csv(TRAIN_LABELS, sep='\n', header=None)
clean_features = [feat[0].split(' ')[1] for feat in feats.values]
har_train.columns = clean_features

har_train_labels = pd.read_csv(TRAIN_LABELS, sep='\n', header=None)
har_train_labels.columns = ['label']
y = har_train_labels.loc[:, 'label']

* Question 11:
* 5 points

Suppose you want to make a new test-train-split on our data such that the labels should be found in y and observations should be found in har_train. When splitting, we want to assign the output from the split to X_train2, X_test2, y_train2, and y_test2. We want to set test_size equal to .4 and the random_state to 1738. Which of the following commands would accomplish this task?

'a') X_train2, X_test2, y_train2, y_test2 = train_test_split(har_train, y, test_size = .4, random_state = 1738)
'b') X_train2, X_test2, y_train2, y_test2 = train_test_split(y, har_train, train_size = .4, random_state = 1738)
'c') X_train2, X_test2, y_train2, y_test2 = train_test_split(har_train, y, .4, 1738)
'd') X_train2, X_test2, y_train2, y_test2 = train_test_split(har_train, y, t_size = .4, rs = 1738)
Assign the letter associated with you choice as string to ans11.

In [None]:
### GRADED

### YOUR ANSWER BELOW

ans11 = None
### BEGIN SOLUTION
ans11 = 'a'
### END SOLUTION

Building a model using sklearn is just as easy as those last two steps! As long as your data is in the right format, once you make your train/test split, the syntax for fitting pretty much any of the models in sklearn is about the same.

Interpret the results
To interpret the results we will be looking at the trade-off between bias and variance as we change our n_neighbors. In many cases, false negatives are more costly than false positives. As such we will be looking primarily at the change in recall as we build a number of different models.

Note: The code below takes some time to run and it may time out on Vocareum. For this reason, we have commented out the code for you and just put the image produced.

In [None]:
#%%time
#from sklearn.metrics import recall_score

### Calculating Recal scores for multiple "n-neighbors"
#recall_scores = {}
#for n in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,20,25,50,75,100]:
 #    knn = KNeighborsClassifier(n_neighbors=n)
 #   knn.fit(X_train, y_train)
 #    recall_scores[n] = recall_score(y_test, knn.predict(X_test), average = None)
    
### Put recall scores into DataFrame
#scores_df = pd.DataFrame(recall_scores).T
#scores_df.columns = [str(i) for i in range(1,7)]
#scores_df.index = scores_df.index.astype(str)

### Create plot of recall scores
#plt.figure(figsize = (10,10))
#for col in scores_df:
 #    if col != 'n_neighbors':
 #      plt.plot(scores_df[col], label = col)
    
#plt.ylabel(" Recall Score", fontsize = 12)
#plt.xlabel("n_neighbors (NB: not an interval scale)", fontsize = 12)
#plt.legend(title = "activity");

Looking at the recall scores above, as n_neighbors trends towards 100 we see in increase in bias. Furthermore, it looks like the better KNN models have less than 15 n_neighbors.