In [2]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
import seaborn as sns
import pandas as pd
from sklearn import neighbors, datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn import neighbors, datasets

# Lecture 10: Decision Trees, Random Forest, Ensemble Learning, K-NN, Hyperparameter Tuning
## 4/26/2022

### Hosted by and maintained by [Student Association for Applied Statistics (SAAS)](https://saas.berkeley.edu/). Updated by [Jonathan Pan](mailto:jonathanpan4@berkeley.edu), [Michael Wang](mailto:wangjmichael18@berkeley.edu)

#### Originally authored by [Calvin Chen](mailto:chencalvin99@berkeley.edu), [Michelle Hao](mailto:mhao@berkeley.edu), and [Patrick Chao](mailto:prc@berkeley.edu).
        

## Classification

Decision trees rely a series of yes or no questions to make a decision on which class an input point falls under. You've seen decision trees your entire life. Here are some examples:

<img src='pictures/mansplaining.png' width=100%>
<img src='pictures/meme.png' height=5px>

In general, we answer a yes or no question at every step, and depending on our answer, we either went one way or another through the tree. They are essentially the same idea as flowcharts.

Now let's apply this to the data science setting for a classification task. In particular, you're given a data point $X = \begin{bmatrix} X_1 & X_2 & ... & X_k \end{bmatrix}$, and you want to assign it a class $c$. We've seen examples of this before: logistic regression tries to assign a class $c \in \{0, 1\}$ for each data point by predicting $\mathbb{P}(X = 1)$.

For a decision tree to work for this, we want to look at $X$, ask yes-no questions about its features, and assign it to a class.

<a id='dataset'></a>
### The Dataset

<img src='pictures/iris.jpg' width="250" height="250">
<center> Image from: A Complete Guide to K-Nearest-Neighbors by Zakka </center>

The dataset we'll be using is the [Iris Flower Dataset](https://archive.ics.uci.edu/ml/datasets/Iris). It contains a series of observations on three species of Iris (Iris setosa, Iris versicolor, and Iris virginica). Each observation contains four features: the *petal length, petal width, sepal length, and sepal width*. The **question** we're asking today is: can we predict the species of Iris from its *petal length, petal width, sepal length, and sepal width*.

In [3]:
#importing the data
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= ['Sepal Length', 'Sepal Width','Petal Length','Petal Width'] + ['species'])

#y contains the correct classifications (0, 1, 2 for each type of Iris)
#0 = Iris Setosa,1 = Iris Versicolour,2 = Iris Virginica
Y = iris["species"]
Y[50]

1.0

#### Summarize the dataset

In [57]:
iris.describe()

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width,species
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [58]:
# Let’s now take a look at the number of instances (rows) that 
# belong to each class. We can view this as an absolute count.
iris.groupby('species').size()

species
0.0    50
1.0    50
2.0    50
dtype: int64

#### Dividing the Dataframe into Feature and Labels

In [4]:
feature_columns = ['Sepal Length', 'Sepal Width','Petal Length','Petal Width']
X = iris[feature_columns].values
Y = iris['species'].values

# Alternative way of selecting features and labels arrays:
# X = dataset.iloc[:, 1:5].values
# y = dataset.iloc[:, 5].values

#### Splitting the Data into Train and Test Sets

In [5]:
#Splitting dataset into training and test
from sklearn.model_selection import train_test_split
X_train_iris, X_test_iris, Y_train_iris, Y_test_iris = train_test_split(X, Y, test_size = 0.2, random_state = 0)

An example decision tree to solve this **classification** task could look as follows:

<img src='pictures/Example Decision Tree.png'>

Let's see how this decision tree fares on our training data.

In [61]:
class TreeNode:
    #What could be an example of a split function?
    def __init__(self, left=None, right=None, split_fn=None, leaf_evaluate=None):
        self.left = left
        self.right = right
        self.split_fn = split_fn
        self.leaf_evaluate = leaf_evaluate
    
    def is_leaf(self):
        return self.left == None and self.right == None
    
    def evaluate(self, X_i):
        if self.is_leaf():
            return self.leaf_evaluate()
        if self.split_fn(X_i):
            return self.left.evaluate(X_i)
        else:
            return self.right.evaluate(X_i)

#Can you trace what the evaluate function does for a tree of depth 3?

class Leaf(TreeNode):
    
    def __init__(self, label):
        TreeNode.__init__(self, leaf_evaluate=lambda: label)
        


In [62]:
def accuracy(y_pred, y_true):
    return (y_pred == y_true).sum() / y_true.shape[0]

In [63]:
def predict(X, tree):
    if len(X.shape) == 1:
        X = X.reshape(1, -1)
    preds = np.zeros(X.shape[0])
    for i in range(X.shape[0]):
        preds[i] = tree.evaluate(X[i])
    return preds

In [64]:
root = TreeNode(
    split_fn=lambda X_i: X_i[0] > 5,
    left=TreeNode(
        split_fn=lambda X_i: X_i[2] > 3,
        left=Leaf(0),
        right=Leaf(2)
    ),
    right=TreeNode(
        split_fn=lambda X_i: X_i[3] > 3,
        left=Leaf(0),
        right=Leaf(2)
    )
)

In [65]:
preds = predict(X_train_iris, root)
preds[:5]

array([0., 0., 2., 0., 0.])

In [66]:
accuracy(preds, Y_train_iris)

0.008333333333333333

This decision tree is horrible! We didn't actually try to train it on our data.

### Training a Decision Tree

The question is now: how do we choose how to make the splits? The answer, of course, comes from our training data.

To make things a little simpler, let's just examine the first ten data points.

In [67]:
X_train_small = X_train_iris[:10]
Y_train_small = Y_train_iris[:10]
X_train_small

array([[6.4, 3.1, 5.5, 1.8],
       [5.4, 3. , 4.5, 1.5],
       [5.2, 3.5, 1.5, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.2],
       [5.2, 2.7, 3.9, 1.4],
       [5.7, 3.8, 1.7, 0.3],
       [6. , 2.7, 5.1, 1.6],
       [5.9, 3. , 4.2, 1.5],
       [5.8, 2.6, 4. , 1.2]])

In [68]:
Y_train_small

array([2., 1., 0., 2., 2., 1., 0., 1., 1., 1.])

Say that our first split is based on sepal length (the first feature).

In [69]:
sl_0 = X_train_small[Y_train_small == 0][:, 0]
sl_1 = X_train_small[Y_train_small == 1][:, 0]
sl_2 = X_train_small[Y_train_small == 2][:, 0]

In [70]:
sl_0

array([5.2, 5.7])

In [71]:
sl_1

array([5.4, 5.2, 6. , 5.9, 5.8])

In [72]:
sl_2

array([6.4, 6.1, 6.4])

For our decision tree, how should we split on sepal length?

Just based on our training data, if we split on (Sepal Length > 6), we've isolated all irises that are class 2 (iris viginica).

<img src='pictures/Simple Decision Tree.png'/>

In [73]:
simple = TreeNode(
    split_fn=lambda X_i: X_i[0] > 6,
    left=Leaf(2),
    right=Leaf(1)
)

In [74]:
preds = predict(X_train_small, simple)
preds

array([2., 1., 1., 2., 2., 1., 1., 1., 1., 1.])

In [75]:
accuracy(preds, Y_train_small)

0.8

Pretty good! Now let's try to come up with a programmatic way of doing this.

The intuition behind making a good decision tree is optimizing our questions (or different steps in the decision tree) to be able *to split up the data into as different categories as possible*. For example in the iris case, we would like to find a split where we may separate the various irises as much as possible. 

This idea of "splitting" to separate our irises the most introduces the idea of **entropy**. We minimize the entropy, or randomness in each split section of the data.

<a id='entropy'></a>
### Entropy

To begin, let's first define what entropy is. In the context of machine learning or information theory, entropy is **the measure of disorder  within a set** or the **amount of surprise**.

Let's take a look at our training data, and the feature we chose to split on, **sepal length**.

In [76]:
sepal_length = X_train_small[:, 0]
sepal_length

array([6.4, 5.4, 5.2, 6.1, 6.4, 5.2, 5.7, 6. , 5.9, 5.8])

In [77]:
Y_train_small

array([2., 1., 0., 2., 2., 1., 0., 1., 1., 1.])

After we split on (Sepal Length > 6), we divided our data into two halves.

In [78]:
yes = Y_train_small[sepal_length > 6]
no = Y_train_small[sepal_length <= 6]

In [79]:
yes

array([2., 2., 2.])

In [80]:
no

array([1., 0., 1., 0., 1., 1., 1.])

Let's consider a different split: (Sepal Length > 5.5).

In [81]:
yes_bad = Y_train_small[sepal_length > 5.5]
no_bad = Y_train_small[sepal_length <= 5.5]

In [82]:
yes_bad

array([2., 2., 2., 0., 1., 1., 1.])

In [83]:
no_bad

array([1., 0., 1.])

Which split was better? The first, because once we made the split, *we were more sure of what class we should predict*. How can we quantify this?

The mathematical definition of entropy is:

$$H(\textbf{p}) = -\sum_i p_i \cdot \log_2(p_i)$$

where $H(\textbf{p})$ is equal to the entropy of the data set, and $p_i$ is the probability of getting each result.


Above, we see a visualization of the entropy of a set with two classes:
<img src='pictures/Entropy.png' width='50%'>

Let's say $Pr(X = 1)$ is the probability that you flips a heads, where heads is represented by $1$ and tails is represented by $0$. From this, we can see that the y-value, $H(X)$ (or calculated entropy), is at a minimum when the chance of flipping a heads is $0$ or $1$, but is at a maximum when the chance of flipping a heads is $0.5$. In other words, the data subset is the most random when there is an equal probability of all classes, and minimized when there are probabilites of classes that are equal to $0$.

When we look at a set of y-values, the entropy is:

$$\sum_{\text{class $c_i$}} -\left(\text{proportion of $c_i$'s}\right) \cdot \log_2 \left(\text{proportion of $c_i$'s}\right)$$

In [84]:
def H(y):
    def proportion(val, y):
        return (y == val).sum() / len(y)
    unique = set(y)
    return sum(-1 * proportion(val, y) * np.log2(proportion(val, y)) for val in unique)
        

Let's see how this comes into play in our splits.

In [85]:
original_entropy = H(Y_train_small)
original_entropy

1.4854752972273344

In our good split, our entropies were:

In [86]:
H(yes), H(no)

(0.0, 0.863120568566631)

In our bad split, our entropies were:

In [87]:
H(yes_bad), H(no_bad)

(1.4488156357251847, 0.9182958340544896)

Clearly, the first split was better, because we reduced entropy the most.

To combine these statistics together for one measure, we'll take the **weighted average**, weighting by the sizes of the two sets.

In [88]:
def weighted_entropy(yes, no):
    total_size = len(yes) + len(no)
    return (len(yes) / total_size) * H(yes) + (len(no) / total_size) * H(no)

In [89]:
H(Y_train_small)

1.4854752972273344

In [90]:
weighted_entropy(yes, no)

0.6041843979966417

In [91]:
weighted_entropy(yes_bad, no_bad)

1.289659695223976

This is huge! We now have a way to choose our splits for our decision tree: 

**Find the best split value (of each feature) that reduces our entropy from the original set the most!**

Check for understanding: why does it make sense to pick the split which reduces our entropy the most?

### Training

In [92]:
from scipy.stats import mode

def train(X_train, Y_train, max_depth=None):
    if len(Y_train) == 0:
        return Leaf(0)
    
    if len(set(Y_train)) == 1 or max_depth == 1:
        return Leaf(mode(Y_train).mode)
    
    def split_weighted_entropy(feature_idx, feature_value):
        feature = X_train[:, feature_idx]
        yes = Y_train[feature > feature_value]
        no = Y_train[feature <= feature_value]
        return weighted_entropy(yes, no)
    
    splits = np.zeros(X_train.shape)
    for feature_idx in range(X_train.shape[1]):
        for i, feature_value in enumerate(X_train[:, feature_idx]): # try to split on each X-value
            splits[i, feature_idx] = split_weighted_entropy(feature_idx, feature_value)
    
    max_idxs = X_train.argmax(axis=0)
    for col, max_idx in enumerate(max_idxs):
        splits[max_idx, col] = float('inf')
    
    i = np.argmin(splits)
    best_feature_idx = i % splits.shape[1]
    best_feature_value = X_train[i // splits.shape[1], best_feature_idx]
    
    yes = X_train[:, best_feature_idx] > best_feature_value
    no = X_train[:, best_feature_idx] <= best_feature_value
    
    tree = TreeNode(
        split_fn=lambda X_i: X_i[best_feature_idx] > best_feature_value,
        left=train(X_train[yes], Y_train[yes], max_depth=max_depth - 1 if max_depth is not None else None),
        right=train(X_train[no], Y_train[no], max_depth=max_depth - 1 if max_depth is not None else None)
    )
    
    return tree

In [93]:
tree = train(X_train_iris, Y_train_iris)

In [94]:
preds = predict(X_train_iris, tree)
accuracy(preds, Y_train_iris)

1.0

Whoa! We have a model that performs at 100% training accuracy! Let's see what happens when we try the model on the validation set.

In [95]:
preds = predict(X_test_iris, tree)
accuracy(preds, Y_test_iris)

0.9333333333333333

We're doing worse, so we're probably overfitting. 
##### Question: What could be changed to avoid overfitting? (Hint: Look at parameters of our train function)

### Regression

How can we use decision trees to perform regression?

When we decide to make a leaf, take the mean/median of the points that are left, instead of the mode.

<a id='ensemble_learning'></a>
# Ensemble Learning

<img src='pictures/elephant.jpeg' width="700" height="700">

This is the fable of the blind men and the elephant. Each individual is correct in their own right, however together their descriptions paint a much more accurate picture.

We have discussed notions of bias and variance. To refresh these concepts again,  


**Bias** is how well the average model would perform if you trained models on many data sets.

**Variance** is how different the models would be if you trained models on many data sets. 

In practice, we only have one data set. If we train a model on this dataset, we would like to minimize both bias and variance. However, we can use some techniques to try to get the best of both worlds.

Since bias is talking about how well the average model performs, and variance is about how varied the different models are, we can attempt to reduce both of these by considering an *average model*.

Consider the following analogy.

<img src='pictures/weather.jpg' width="700" height="700">

We would like to predict the weather tomorrow. Perhaps we have $3$ separate sources for weather, Channel $4$ on TV, a online website, and the iPhone weather application. 

We may expect that a better estimate for the weather tomorrow is actually the average of all these estimates. Perhaps the different sources all have their own methods and data for creating a prediction, thus taking the average pools together all their resources into a more powerful estimator. 

The important gain of this approach is our improvement in variance. Keep in mind this is mentioning how different would another similar estimator be. While a single source may have high variation, such as an online website, we would expect another averaged weather amalgamation would be similar. If we considered Channel $5$ predictions, a different online website, and the Android weather application, we would not expect as much variation between their predictions since we already took the average of multiple sources. Note: for those of you who have taken a probability class, you've probably already come across this idea: taking the average generally decreases variance.

Thus, one technique to improve the quality of a model is to *train multiple models on the same data and pool their predictions*. This is known as **ensembling**.



<a id='bootstrapping_and_bagging'></a>
# Bootstrapping and Bagging
An important idea often used in data science is **bootstrapping**. This is a method to generate more samples of data, despite the fact that we only have a single dataset.

**Bootstrapping**: We take the original dataset of size $N$ and draw $M$ samples with replacement of size $N$. 

For example, if we would like to estimate the average height of people in the U.S., we may take a sample of $1000$ people and average their heights. However, this does not tell us much about the data other than the average. We pooled together $1000$ data points into a single value, but there is much more information available. 

What we can do is draw many samples with replacement of size $1000$, and compute the average heights of these. This mimics as if we had many dataset, and we have many average heights. Then we can compute a distribute of average heights we would have collected, and from this we can determine how good our estimate is. By the Central Limit theorem, this distribution of bootstrapped statistics should approach the normal distribution.

However, we are not limited to just calculating the average. We may calculate the median, standard deviation, or any other statistic of each bootstrapped sample. Furthermore, we can even create a model for each sample! This allows us to utilize the notion of training many models on the same dataset.

If we create many models and then aggregate our predictions, this is known as **bagging** (bootstrapp aggregating). Thus we may create many separate models that all are trained on separate data to obtain new predictions and better results.

The purpose of bagging is to decrease the variance of our model. Since we essentially consider many models together in parallel, we avoid overfitting.

<a id='random_forest'></a>
# Random Forest

A single decision tree often results in poor predictions, as they are rather simple. Just cutting the feature space into separate regions does not perform very well, as these strict linear constraints prevent complex boundaries. Furthermore, especially if you don't restrict the max depth, it's extremely prone to overfitting.

However, if we include many decision trees and create a **random forest**, we obtain drastically better results. 
The idea of a random forest is quite simple, take many decision trees and output their average guess.
<center>
<img src="pictures/RandomForest.png" width="60%">
Image from https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d
</center>


In the case of classification, we take the most popular guess.   
In the case of regression, we take some form of the average guess.


Now, let's consider this exact setup. What would happen if we created many decision trees and took the most popular guess?

In practice, we could obtain the same decision tree over and over. This is because there is some optimal set of splitting values in the dataset to minimize entropy, even with different sets of data. Perhaps we have one feature that works very well in splitting the data, and it is always utilized as the first split in the tree. Then all decision trees end up looking quite similar, despite our efforts in bagging.

A solution to this problem is feature bagging. We may also select a subset of features for each tree to train on, thus each feature has a chance to be split on. 
<center>
<img src="pictures/RandomForestPipeline.jpg" width="40%">
Image from https://sites.google.com/site/rajhansgondane2506/publications
</center>


<img src="pictures/random_forest.png" width="60%">

In summary, we begin with a dataset $\mathcal{D}$ of size $N$. 
1. We bootstrap the data so that we have $M$ new datasets $d_1,\ldots, d_M$ drawn with replacement of size $N$ from $\mathcal{D}$.
2. Select a subset of features $f_i$ for each new dataset $d_i$.
3. Fit a decision tree for each $d_i$ with features $f_i$.

Now to predict, we take the input data and feed it through each decision tree to get an output. Then we can take the most popular vote or the average output as the output of our model, based on the type of problem we are attempting to solve.

In [96]:
from sklearn.ensemble import RandomForestClassifier

In [97]:
model = RandomForestClassifier(n_estimators=200, max_depth=10)

In [98]:
model.fit(X_train_iris, Y_train_iris)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [99]:
accuracy(model.predict(X_train_iris), Y_train_iris)

1.0

In [100]:
accuracy(model.predict(X_test_iris), Y_test_iris)

1.0

<a id='boosting'></a>
# Boosting
We have mentioned bagging as a method of decreasing variance, but what about bias? There are also techniques to do this, namely **boosting**. This is a very popular technique in Kaggle competitions and most models that win competitions utilize huge ensembles of boosted random forests.

The exact implementation of boosting is out of scope for this discussion, but the main idea is to *fit your models sequentially rather than in parallel in bagging*.

In random forest, we take many samples of our data, and fit separate decision trees to each one. We account for similar decision trees by feature bagging as well. However, many of these decision trees will end up predicting similar things and essentially only reduce variance.

The key idea of boosting is to **emphasize the specific data points that we fail on**. Rather than trying to improve the prediction by considering the problem from different angles (e.g. new datasets from bagging or new features from feature bagging), consider where we predict incorrectly and attempt to improve our predictions from there.

<center>
<img src="pictures/Boosting.png" width="70%">
Image from https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/
</center>

This is similar to real life. If you would like to learn a new musical piece, it is more beneficial to practice the specific part that is challenging, rather than playing the entire piece from the start every time you mess up. By boosting our model, we attempt to place greater emphasis on the samples that we consistently misclassify.


For further reading, we highly recommend the following resources for explanations for boosting:  
https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/  
https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

Spectacular visualizations of decision trees and boosting:  
http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

There are many forms of boosting in practice. Popular ones include: Adaboost, GradientBoost, and XGBoost. XGBoost is famous (or infamous) for excelling at Kaggle competitions, many winning solutions contain XGBoost. We encourage you to look at the links above.

## k-Nearest Neighbors

A natural observation to make regarding data is that data points close to one another are likely to be the same class. For example, a flower with traits similar to a rose with the exception of having slightly shorter petals or a longer stem is still very likely to still be a rose. It requires larger and more numerous changes in traits such as a color change, petal count, and stem size for that flower to be considered another variety, say a lilac.  

<center>
<img src="pictures/petals.png" width="40%">
Slight variations in petal length don't change their class
</center>

So how do we determine the type of a flower given its specfic traits? What we can do is ask its neighbors. From our conjecture above, the class majority of the data points closest to the flower (e.g. the data points with the most similar traits) should tell us what our flower's class is. If it's near a bunch of roses, our flower is probably a rose. From here, two questions arise: 
1. How do we quantify similarity between data points?
2. How many neighbors should we ask?

We can answer the first question by using a distance metric, the most common being Euclidean distance.

<center>
<img src="pictures/euclidean.jpeg" width="80%">
</center>

We can treat each feature as a dimension and place our training data points in N-dimensional space. We can then apply the Euclidean distance metric to find the k-nearest neighbors in relation to our data point that we want to classify. After finding the $k$ nearest points, we'll use the labels of these points to decide what to label to assign to our point. For classification, we simply choose the majority class. For regression, we can take the average of these labels.

A key point to note is that we have to manually specify what our hyperparamater $k$ value is. The value of $k$ will heavily influence the outcome of our model.

<center>
<img src="pictures/knn.png" width="50%">
</center>

It helps to think about the extreme choices of $k$ implies for k-NN. Take a moment to think what it means to set 
1. $k = 1$ (we only ask our closest neighbor)
2. $k = N$, where $N$ is the number of data points we have. (we ask every data point)

What are the bias/variance tradeoffs between both options?

In the first case, we'll get a training error of 0, meaning we classify every point correctly at training time. This is because we assign each training point to its own class. As a result, our bias is incredibly low. Unfortunately, this choice of $k$ is not very robust, resulting in an extremely high variance.

On the other hand, choosing $k = N$ means we simply assign each point to whatever the majority is. This gives us incredibly low variance, as new points will be assigned to the likeliest class. This also means we have incredibly high bias, as we're assigning every point to be one class and disregarding all else.

Let's take a look at this in code.

In [101]:
iris = datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= ['Sepal Length', 'Sepal Width','Petal Length','Petal Width'] + ['species'])

feature_columns = ['Sepal Length', 'Sepal Width','Petal Length','Petal Width']
X = iris[feature_columns].values
Y = iris['species'].values

#Splitting dataset into training and test
from sklearn.model_selection import train_test_split
X_train_iris, X_test_iris, Y_train_iris, Y_test_iris = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [102]:
# Code adapted from https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

N = X_train_iris.shape[0]

# instantiate learning model (k = 1, N)
one_nn = KNeighborsClassifier(n_neighbors=1)
N_nn = KNeighborsClassifier(n_neighbors=N)

# fitting the model
one_nn.fit(X_train_iris, Y_train_iris)
N_nn.fit(X_train_iris, Y_train_iris)

# predict the response
predOne = one_nn.predict(X_train_iris)
predN = N_nn.predict(X_train_iris)

# evaluate accuracy
print("Training accuracy for 1-NN: {}".format(accuracy_score(Y_train_iris, predOne)))
print("Training accuracy for N-NN: {}".format(accuracy_score(Y_train_iris, predN)))

Training accuracy for 1-NN: 1.0
Training accuracy for N-NN: 0.36666666666666664


We see that our 1-NN does incredibly well in training time, while our N-NN does quite poorly. In fact, if you print out the N-NN's predictions, you'll find that it predicts the same class for everything! The question then becomes: How might we choose our $k$? This leads us to our next topic: hyperparameter tuning!

# Hyperparameter Tuning

For $k$-NNs, our main hyperparamter, which we want to find the optimal values for, is $k$. To do this, we can simply train our model at different values of the hyperparameter and compare the accuracies of the validation set to determine the best value.

In [108]:
for k in range(1, N+1):
    k_nn = KNeighborsClassifier(n_neighbors=k)
    k_nn.fit(X_train_iris, Y_train_iris)
    predK = k_nn.predict(X_test_iris)
    print("Testing accuracy for {}-NN: {}".format(k, accuracy_score(Y_test_iris, predK)))

Testing accuracy for 1-NN: 1.0
Testing accuracy for 2-NN: 0.9666666666666667
Testing accuracy for 3-NN: 0.9666666666666667
Testing accuracy for 4-NN: 1.0
Testing accuracy for 5-NN: 0.9666666666666667
Testing accuracy for 6-NN: 1.0
Testing accuracy for 7-NN: 1.0
Testing accuracy for 8-NN: 1.0
Testing accuracy for 9-NN: 1.0
Testing accuracy for 10-NN: 1.0
Testing accuracy for 11-NN: 1.0
Testing accuracy for 12-NN: 1.0
Testing accuracy for 13-NN: 1.0
Testing accuracy for 14-NN: 1.0
Testing accuracy for 15-NN: 1.0
Testing accuracy for 16-NN: 1.0
Testing accuracy for 17-NN: 1.0
Testing accuracy for 18-NN: 1.0
Testing accuracy for 19-NN: 1.0
Testing accuracy for 20-NN: 1.0
Testing accuracy for 21-NN: 1.0
Testing accuracy for 22-NN: 1.0
Testing accuracy for 23-NN: 1.0
Testing accuracy for 24-NN: 1.0
Testing accuracy for 25-NN: 1.0
Testing accuracy for 26-NN: 0.9666666666666667
Testing accuracy for 27-NN: 0.9333333333333333
Testing accuracy for 28-NN: 0.9666666666666667
Testing accuracy for 29

Exercise: Which value(s) of $k$ gives us the best training accuracy?