In [190]:
%matplotlib widget

from sklearn.datasets import fetch_covtype
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import numpy as np

In [191]:
# load the full dataset and display it
dataset = fetch_covtype(shuffle=True) # make sure to shuffle the data so no local patterns emerge
feature_names = dataset.feature_names
target_names = dataset.target_names
data = dataset.data
target = dataset.target

In [192]:
print("Feature names:", feature_names)
print("Target names:", target_names)
print("Features: ", data.shape, data.dtype)
print("Classes:", target.shape, target.dtype)

print("Features: ", data)
print("Classes:", target)

Feature names: ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area_0', 'Wilderness_Area_1', 'Wilderness_Area_2', 'Wilderness_Area_3', 'Soil_Type_0', 'Soil_Type_1', 'Soil_Type_2', 'Soil_Type_3', 'Soil_Type_4', 'Soil_Type_5', 'Soil_Type_6', 'Soil_Type_7', 'Soil_Type_8', 'Soil_Type_9', 'Soil_Type_10', 'Soil_Type_11', 'Soil_Type_12', 'Soil_Type_13', 'Soil_Type_14', 'Soil_Type_15', 'Soil_Type_16', 'Soil_Type_17', 'Soil_Type_18', 'Soil_Type_19', 'Soil_Type_20', 'Soil_Type_21', 'Soil_Type_22', 'Soil_Type_23', 'Soil_Type_24', 'Soil_Type_25', 'Soil_Type_26', 'Soil_Type_27', 'Soil_Type_28', 'Soil_Type_29', 'Soil_Type_30', 'Soil_Type_31', 'Soil_Type_32', 'Soil_Type_33', 'Soil_Type_34', 'Soil_Type_35', 'Soil_Type_36', 'Soil_Type_37', 'Soil_Type_38', 'Soil_Type_39']
Target names: ['Cover_Type']
Features:  (58

Baseline performance is around 95% accuracy. Let's see how close we can get to it. Beware, we are trying to classify using nearest neighbor algorithms instead of the typical logistic regression classification. This might produce better results or worse results. We will explore it soon and see the results for ourselves. There are a lot of dimensions and the dataset is large so anything can happen. Additionally, we also shuffle the dataset so we can compute a more harsher and realistic score later on.

In [193]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.33, random_state=1)
print("Train:", X_train.shape, y_train.shape)
print("Test:", X_test.shape, y_test.shape)

Train: (389278, 54) (389278,)
Test: (191734, 54) (191734,)


Firstly, we split the data to train and test data so it becomes a fair way to judge the model on the test data. We now only use the training data for the nearest neighbor algorithm. Here we do a 33% test data split and 66% train data split.

The way nearest neighbor works in the most basic case is to find the closest point coressponding to the point we want to classify. This is done with a distance function that we can choose. Two of the most common distances we can choose is manhattan distance or the well known euclidean distance. Let's try euclidean distance since it's more generalized and accurate and see how it performs.

In [194]:
def predict_point(x, y, point):
    point = point.reshape((1,-1))
    x = x - point # mxn
    x2 = x**2 # squared euclidean distances
    dist_squared = np.sum(x2,axis=1) # add up all the squared euclidean distances 
    i = np.argmin(dist_squared)
    return y[i] # return the class associated with the closest point

Here, we're just predicting a single point and returning a single class

In [195]:
def predict_points(x, y, points):
    # same thing as predict_point but we handle multiple points now
    mp = points.shape[0]
    categories = np.zeros((mp))
    for i in range(mp):
        categories[i] = predict_point(x,y,points[i]) # use predict_point we defined
    return categories

In [196]:
# categories = predict_points(X_train, y_train, X_test) # really slow

When we predict the points using a loop like this, it becomes really slow. Firstly, there are a 100K data points and additionally for each data point we are getting distances 
to all the other 400K labels in order to find the closest point. Therefore, it's really inefficient and time complexity is too high. Around 400K*100K or 40 billion operations just for prediction. Therefore, let's just use 100 points and see what the accuracy is for closest point. 

In [197]:
samples = 100
predictions = predict_points(X_train, y_train, X_test[:samples]) # try only 100 points from the testing set
accuracy = accuracy_score(predictions, y_test[:samples]) # compute the accuracy
print("Accuracy:",accuracy*100,"%") # show the accuracy

Predictions [1. 1. 2. 1. 2. 1. 1. 1. 1. 2. 1. 6. 1. 3. 4. 2. 2. 1. 2. 1. 7. 2. 2. 2.
 2. 1. 2. 2. 1. 2. 2. 2. 3. 1. 2. 1. 2. 4. 3. 1. 1. 1. 2. 7. 1. 1. 1. 1.
 2. 2. 1. 2. 6. 7. 2. 4. 1. 3. 3. 2. 7. 2. 6. 2. 2. 3. 3. 2. 7. 2. 1. 2.
 2. 2. 2. 2. 1. 1. 3. 2. 2. 2. 1. 3. 6. 2. 2. 2. 7. 5. 2. 2. 2. 2. 7. 2.
 2. 1. 1. 2.]
Accuracy: 95.0 %


Wow, we have ~95% accuracy in predicting the labels. Since the baseline is also around 95%, we are actually doing surprisingly well. Of course, this is only for 100 points and for true classification we need to predict the entire test set. But with data shuffling and a good set of points, it's safe to say that simple nearest can perform well in certain situations like this. It seems that the categories are linearly separable and far from each other. Hence, we get a high accuracy like this. But it's important to recognize that performance can be really poor when dealing with high dimensional and high training data. Additionally, for simple classifier we need a lot of training data which we clearly have here. These are some other reasons why logisitic regression is used instead of nearest neighbor algorithm.