# Lesson Five

In this episode, we're going to do something special, and that's write our own classifier from scratch. If you're new to machine learning, this is a big milestone. Because if you can follow along and do this on your own, it means you understand an important piece of the puzzle. The classifier we're going to write today is a scrappy version of k-Nearest Neighbors. That's one of the simplest classifiers around. First, here's a quick outline of what we'll do in this episode. We'll start with our code from Episode 4, Let’s Write a Pipeline:

In [4]:
# Recall that lesson we did a simple experiment.
# We imported the datasets
from sklearn import datasets
iris = datasets.load_iris()

# X is the feature (input)
X = iris.data

# y is the label (output)
y = iris.target

# We split out datasets into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)

# WE import KNeighborsClassifier 
from sklearn.neighbors import KNeighborsClassifier
my_classifier = KNeighborsClassifier()

# we used train to train a classifier
my_classifier.fit(X_train, y_train)

# We tested to see how accurate it was
predictions = my_classifier.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.973333333333


Writing the classifier is the part we're going to focus on today. Previously we imported the classifier from a library using these two lines. Here we'll comment them out and write our own. The rest of the pipeline will stay exactly the same.

In [8]:
import random

# Step 1: comment out imports, and write out own, the rest of the pipeline will stay exactly the same
#from sklearn.neighbors import KNeighborsClassifier

# Step 2: implemet a class for our custom classifier
class ScrappyKKN():
    # Step 3: Let see what methods we need to implement. Looking a the interface for classifier,
    # we see there are two that we care about, fit() which does the training, and predict() which does
    # the prediction.
    
    def fit(self, X_train, y_train):
        '''
        This fit() method will do the training
        Remember it takes the features and labels for the training set as input parameters.
        '''
        # Lets store the training data in this class, you can think of this just memorizing it.
        # And you will see why we do that later on.
        self.X_train = X_train
        self.y_train = y_train
            
        pass
    
    def predict(self, X_test):
        '''
        As parameters, this receives the features for out testing data.
        And as output it returns predictions for the labels
        '''
        
        # Remember that we'll need to return a list of predictions. That;s because the parameter, X_test is
        # actually a 2D array, or list of lists.
        predictions = []
        
        # each row contains the features for one testing example.
        for row in X_test:
            # To make a prediction for each row, I will just randomly pick a label from the training data.
            # and append that to our predictions.
            label = random.choice(self.y_train)
            predictions.append(label)
        
        return predictions

# Change our pipline to use the customer class
# At this point our pipline is working again
my_classifier = ScrappyKKN()

# we used train to train a classifier
my_classifier.fit(X_train, y_train)

# We tested to see how accurate it was
predictions = my_classifier.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.346666666667


Recall there are three different types of flowers in the iris dataset, so accuracy should be about 33%. Now we know the interface for a classifier. But when we started this exercise, our accuracy was above 90%. So let's see if we can do better. To do that, we'll implement our classifier, which is based on k-Nearest Neighbors. Here's the intuition for how that algorithm works.

Now we know the interface for a classifier. But when we started this exercise, our accuracy was above 90%. So let's see if we can do better. To do that, we'll implement our classifier, which is based on k-Nearest Neighbors.

There's a formula for that called the Euclidean Distance, and here's what the formula looks like. It measures the distance between two points, and it works a bit like the Pythagorean Theorem. A squared plus B squared equals C squared. You can think of this term as A, or the difference between the first two features. Likewise, you can think of this term as B, or the difference between the second pair of features. And the distance we compute is the length of the hypotenuse. Now here's something cool. Right now we're computing distance in two-dimensional space, because we have just two features in our toy dataset. But what if we had three features or three dimensions? Well then we'd be in a cube. We can still visualize how to measure distance in the space with a ruler. But what if we had four features or four dimensions, like we do in iris? Well, now we're in a hypercube, and we can't visualize this very easy. The good news is the Euclidean Distance works the same way regardless of the number of dimensions. With more features, we can just add more terms to the equation. You can find more details about this online. Now let's code up Euclidean distance.

In [14]:
# Step 7: implement nearest neighbor algorithm for a classifier.
# To make a prediction for test point we'll calculate the distance to all the training points.
# Then we'll predict the testing point has the same label as the closest one.

# Let's code up Eculidean distance, there are plenty of ways to do that, but 
# we'll use a library called scipy
from scipy.spatial import distance

class ScrappyKKN():
    '''
    Complete version
    '''
    
    def fit(self, X_train, y_train):
        '''
        This fit() method will do the training
        Remember it takes the features and labels for the training set as input parameters.
        '''
        self.X_train = X_train
        self.y_train = y_train
            
        pass
    
    def predict(self, X_test):
        '''
        As parameters, this receives the features for out testing data.
        And as output it returns predictions for the labels
        '''
        
        predictions = []
        for row in X_test:
            
            # delete the rancom prediction wen made
            #label = random.choice(self.y_train)
            # replcae it with the a method that finds the closest training
            # to the test point
            label = self.closest(row)
            
            predictions.append(label)
        
        return predictions
    
    def closest(self, row):
        '''
        In this method we will loop over all the training points, and keep track of the 
        closest one so far. Remember that we memorized the training data in out self.fit()
        '''
        
        # To start I will calculate the distance from the test point to the first training point.
        best_dist = euc(row, self.X_train[0])
        # I will use this variable to keep track of the index of the training point that's closest.
        # We'll need this later to retrieve its label
        best_index = 0
        
        # Now we'll iterate over all the other taining points.
        for i in range(1, len(self.X_train)):
            dist = euc(row, self.X_train[i])
            
            # And every time we find a closer one, we will update out variables.
            if dist < best_dist:
                best_dist = dist
                best_index = i
        
        # Finally we will use the index to return the label for the closst training example.
        return self.y_train[best_index]
        
    def euc(a, b):
        '''
        Here (a) and (b) are lists of numeric features. Say (a) is a point from out training data,
        and (b) is a point from our testing data. This function returns the distance between them.
        '''
        return distance.euclidean(a, b)



my_classifier = ScrappyKKN()
my_classifier.fit(X_train, y_train)
predictions = my_classifier.predict(X_test)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.96


We did get accuracy over 90%.... awesome.

Now if you can code this up and understand it, that's a big accomplishment because it means you can write a simple classifier from scratch. Now, there are a number of pros and cons to this algorithm, many of which you can find online. The basic pro is that it's relatively easy to understand, and works reasonably well for some problems. And the basic cons are that it's slow, because it has to iterate over every training point to make a prediction. And importantly, as we saw in Episode 3, some features are more informative than others. But there's not an easy way to represent that in k-Nearest Neighbors. In the long run, we want a classifier that learns more complex relationships between features and the label we're trying to predict. A decision tree is a good example of that. And a neural network like we saw in TensorFlow Playground is even better.