# Predict Classes with a kNN Classifier
## Mini-Lab: Repurposing the Classifier

Welcome to your final mini-lab! Go ahead an run the following cell to get started. You can do that by clicking on the cell and then clickcing `Run` on the top bar. You can also just press `Shift` + `Enter` to run the cell.

In [1]:
from datascience import *
import numpy as np
import otter

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

For our final lab, we'll take things into our own hands and build a classifier using the k-Nearest Neighbors algorithm by repurposing the classfier we wrote in the interactive content. The kNN code from the interactive content has already been copied for you so go ahead and run the next cell to import all of the relevant code.

In [2]:
def distance(point1, point2):
    """Returns the distance between point1 and point2
    where each argument is an array
    consisting of the coordinates of the point"""
    return np.sqrt(np.sum((point1 - point2)**2))

def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop('Class')
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)

def table_with_distances(training, new_point):
    """Augments the training table
    with a column of distances from new_point"""
    return training.with_column('Distance', all_distances(training, new_point))

def closest(training, new_point, k):
    """Returns a table of the k rows of the augmented table
    corresponding to the k smallest distances"""
    with_dists = table_with_distances(training, new_point)
    sorted_by_distance = with_dists.sort('Distance')
    topk = sorted_by_distance.take(np.arange(k))
    return topk

def majority(topkclasses):
    ones = topkclasses.where('Class', are.equal_to(1)).num_rows
    zeros = topkclasses.where('Class', are.equal_to(0)).num_rows
    if ones > zeros:
        return 1
    else:
        return 0

def classify(training, new_point, k):
    closestk = closest(training, new_point, k)
    topkclasses = closestk.select('Class')
    return majority(topkclasses)

This code was specfically built for the wine dataset which we'll import and test below. The classifier should output `1`.

In [3]:
wine = Table().read_table("../datasets/wine.csv")
wine.show(5)
classify(wine, wine.drop("Class").rows[0], 5)

Class,Alcohol,Malic Acid,Ash,Alcalinity of Ash,Magnesium,Total Phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color Intensity,Hue,OD280/OD315 of diulted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


1

Cool right? Sadly, if we try to use this classifier for a different purpose it wouldn't work out. Below we have imported the data for 500 NBA atheletes in 2013. Our final task then will be to repurpose this classifier to not only classify NBA athletes for `Position` based on `Height`, `Weight`, and `Age`, but also to classify between three separate positions rather than the standard binary classification. Go ahead and run the next cell to import our NBA dataset.

In [4]:
nba = Table().read_table("../datasets/nba2013.csv").drop("Name")
nba.show(5)

Position,Height,Weight,Age in 2013
Guard,80,221,23
Guard,80,235,23
Guard,80,210,28
Guard,80,215,32
Guard,79,215,26


If we try running the `classify` function again we'll run into issues. To get around this, we'll have to modify our current code in order to classify based on our NBA dataset rather than the wine dataset. Below are the three functions that you will need to modify. Change these functions so that that we can classify a different dataset!

*Note*: Two of these functions only need minor changes whereas the remaining function needs a major rehaul in order to function correctly. Can you identify which are which? Also, you may find the following code snippet useful:

```
positions = make_array("Guard", "Center", "Forward")
positions[np.argmax(make_array(x, y, z))]
```

If the variables `x`, `y`, and `z` correspond to the number of `Guard`, `Center`, and `Forward` athletes, then the code snippet return which position has a majority. For example, if `x` = 2, `y` = 6, and `z` = 4, then this code snippet would return `Center`. Where would this be useful?

In [5]:
def all_distances(training, new_point):
    """Returns an array of distances
    between each point in the training set
    and the new point (which is a row of attributes)"""
    attributes = training.drop('Position')
    def distance_from_point(row):
        return distance(np.array(new_point), np.array(row))
    return attributes.apply(distance_from_point)


def majority(topkclasses):
    positions = make_array("Guard", "Center", "Forward")
    x = topkclasses.where('Position', are.equal_to("Guard")).num_rows
    y = topkclasses.where('Position', are.equal_to("Center")).num_rows
    z = topkclasses.where('Position', are.equal_to("Forward")).num_rows
    return positions[np.argmax(make_array(x, y, z))]


def classify(training, new_point, k):
    closestk = closest(training, new_point, k)
    topkclasses = closestk.select('Position')
    return majority(topkclasses)

In [6]:
# This cell should output "Guard" if you implemented the above fixes correctly.
classify(nba, nba.drop("Position").rows[16], 15)

'Guard'

Congratulations! Not only have you repurposed a classidier and finished the final lab, you've also completed this course!