# Nonparametric ML Models - Cumulative Lab

## Introduction

In this cumulative lab, you will apply two nonparametric models you have just learned — k-nearest neighbors and decision trees — to the forest cover dataset.

## Objectives

* Practice identifying and applying appropriate preprocessing steps
* Perform an iterative modeling process, starting from a baseline model
* Explore multiple model algorithms, and tune their hyperparameters
* Practice choosing a final model across multiple model algorithms and evaluating its performance

## Your Task: Complete an End-to-End ML Process with Nonparametric Models on the Forest Cover Dataset


The task is to predict the `Cover_Type` based on the available cartographic variables:

In [52]:
# Run this cell without changes
import pandas as pd

df = pd.read_csv('/content/forest_cover.csv')
df

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type_31,Soil_Type_32,Soil_Type_33,Soil_Type_34,Soil_Type_35,Soil_Type_36,Soil_Type_37,Soil_Type_38,Soil_Type_39,Cover_Type
0,2553,235,17,351,95,780,188,253,199,1410,...,0,0,0,0,0,0,0,0,0,0
1,2011,344,17,313,29,404,183,211,164,300,...,0,0,0,0,0,0,0,0,0,0
2,2022,24,13,391,42,509,212,212,134,421,...,0,0,0,0,0,0,0,0,0,0
3,2038,50,17,408,71,474,226,200,102,283,...,0,0,0,0,0,0,0,0,0,0
4,2018,341,27,351,34,390,152,188,168,190,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38496,2396,153,20,85,17,108,240,237,118,837,...,0,0,0,0,0,0,0,0,0,0
38497,2391,152,19,67,12,95,240,237,119,845,...,0,0,0,0,0,0,0,0,0,0
38498,2386,159,17,60,7,90,236,241,130,854,...,0,0,0,0,0,0,0,0,0,0
38499,2384,170,15,60,5,90,230,245,143,864,...,0,0,0,0,0,0,0,0,0,0


> As you can see, we have over 38,000 rows, each with 52 feature columns and 1 target column:

> * `Elevation`: Elevation in meters
> * `Aspect`: Aspect in degrees azimuth
> * `Slope`: Slope in degrees
> * `Horizontal_Distance_To_Hydrology`: Horizontal dist to nearest surface water features in meters
> * `Vertical_Distance_To_Hydrology`: Vertical dist to nearest surface water features in meters
> * `Horizontal_Distance_To_Roadways`: Horizontal dist to nearest roadway in meters
> * `Hillshade_9am`: Hillshade index at 9am, summer solstice
> * `Hillshade_Noon`: Hillshade index at noon, summer solstice
> * `Hillshade_3pm`: Hillshade index at 3pm, summer solstice
> * `Horizontal_Distance_To_Fire_Points`: Horizontal dist to nearest wildfire ignition points, meters
> * `Wilderness_Area_x`: Wilderness area designation (3 columns)
> * `Soil_Type_x`: Soil Type designation (39 columns)
> * `Cover_Type`: 1 for cottonwood/willow, 0 fo

This is also an imbalanced dataset, since cottonwood/willow trees are relatively rare in this forest:

In [53]:
# Run this cell without changes
print("Raw Counts")
print(df["Cover_Type"].value_counts())
print()
print("Percentages")
print(df["Cover_Type"].value_counts(normalize=True))

Raw Counts
Cover_Type
0    35754
1     2747
Name: count, dtype: int64

Percentages
Cover_Type
0    0.928651
1    0.071349
Name: proportion, dtype: float64


### Requirements

#### 1. Prepare the Data for Modeling

#### 2. Build a Baseline kNN Model

#### 3. Build Iterative Models to Find the Best kNN Model

#### 4. Build a Baseline Decision Tree Model

#### 5. Build Iterative Models to Find the Best Decision Tree Model

#### 6. Choose and Evaluate an Overall Best Model


## 1. Prepare the Data for Modeling

The target is `Cover_Type`. In the cell below, split `df` into `X` and `y`, then perform a train-test split with `random_state=42` and `stratify=y` to create variables with the standard `X_train`, `X_test`, `y_train`, `y_test` names.

Include the relevant imports as you go.

In [54]:

# Import the relevant function
from sklearn.model_selection import train_test_split

# Split df into X and y
X = df.drop("Cover_Type", axis=1)
y = df["Cover_Type"]

# Perform train-test split with random_state=42 and stratify=y
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

Now, instantiate a StandardScaler, fit it on X_train, and create new variables X_train_scaled and X_test_scaled containing values transformed with the scaler.

In [55]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [56]:

# Checking that df was separated into correct X and y
assert type(X) == pd.DataFrame and X.shape == (38501, 52)
assert type(y) == pd.Series and y.shape == (38501,)

# Checking the train-test split
assert type(X_train) == pd.DataFrame and X_train.shape == (28875, 52)
assert type(X_test) == pd.DataFrame and X_test.shape == (9626, 52)
assert type(y_train) == pd.Series and y_train.shape == (28875,)
assert type(y_test) == pd.Series and y_test.shape == (9626,)

# Checking the scaling
assert X_train_scaled.shape == X_train.shape
assert round(X_train_scaled[0][0], 3) == -0.636
assert X_test_scaled.shape == X_test.shape
assert round(X_test_scaled[0][0], 3) == -1.370

## 2. Build a Baseline kNN Model

Build a scikit-learn kNN model with default hyperparameters. Then use `cross_val_score` with `scoring="neg_log_loss"` to find the mean log loss for this model (passing in `X_train_scaled` and `y_train` to `cross_val_score`). You'll need to find the mean of the cross-validated scores, and negate the value (either put a `-` at the beginning or multiply by `-1`) so that your answer is a log loss rather than a negative log loss.

Call the resulting score `knn_baseline_log_loss`.

Your code might take a minute or more to run.

In [57]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

knn_baseline_model = KNeighborsClassifier()

knn_baseline_log_loss = -cross_val_score(knn_baseline_model, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_baseline_log_loss

np.float64(0.12964546386734577)

## 3. Build Iterative Models to Find the Best kNN Model

Build and evaluate at least two more kNN models to find the best one. Explain why you are changing the hyperparameters you are changing as you go. These models will be *slow* to run, so be thinking about what you might try next as you run them.


In [58]:

"""
Your work will not look identical to this, and that is ok!
The goal is that there should be an explanation for
everything you try, and accurate reporting on the outcome
"""

"""
Maybe we are overfitting, since the default neighbors of 5
seems small compared to the large number of records in this
dataset. Let's increase that number of neighbors 10x to see
if it improves the results.
"""

knn_second_model = KNeighborsClassifier(n_neighbors=50)

knn_second_log_loss = -cross_val_score(knn_second_model, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_second_log_loss

np.float64(0.07871799429857204)

In [59]:

"""
Great, that looks good. What if we keep that number of
neighbors, and change the distance metric from euclidean
to manhattan?
"""

knn_third_model = KNeighborsClassifier(n_neighbors=50, metric="manhattan")

knn_third_log_loss = -cross_val_score(knn_third_model, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_third_log_loss

np.float64(0.07631568557001107)

In [60]:

"""
Ok, slightly better but it's a much smaller difference now.

Maybe we can get even better performance by increasing the
number of neighbors again.
"""

knn_fourth_model = KNeighborsClassifier(n_neighbors=75, metric="manhattan")

knn_fourth_log_loss = -cross_val_score(knn_fourth_model, X_train_scaled, y_train, scoring="neg_log_loss").mean()
knn_fourth_log_loss

np.float64(0.0859644295080102)


"""
While this was still better than when n_neighbors was 5
(the default), it's worse than n_neighbors being 50

If we were to build more models, we would probably start
investigating the space between 5 and 50 neighbors to find
the best number, but for now we'll just stop and say that
knn_third_model is our best one.
"""

## 4. Build a Baseline Decision Tree Model

Now that you have chosen your best kNN model, start investigating decision tree models. First, build and evaluate a baseline decision tree model, using default hyperparameters (with the exception of `random_state=42` for reproducibility).

(Use cross-validated log loss, just like with the previous models.)


In [61]:
from sklearn.tree import DecisionTreeClassifier

dtree_baseline_model = DecisionTreeClassifier(random_state=42)

dtree_baseline_log_loss = -cross_val_score(dtree_baseline_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_baseline_log_loss

np.float64(0.7352281158853683)

Interpret this score. How does this compare to the log loss from our best logistic regression and best kNN models? Any guesses about why?


This is much worse than either the logistic regression or the
kNN models. We can probably assume that the model is badly
overfitting, since we have not "pruned" it at all.

## 5. Build Iterative Models to Find the Best Decision Tree Model

Build and evaluate at least two more decision tree models to find the best one. Explain why you are changing the hyperparameters you are changing as you go.


In [62]:

"""
There are a lot of ways to reduce overfitting in this
model, so it can be overwhelming to choose which one
to try!

Let's start with increasing min_samples_leaf by an
order of magnitude. This is conceptually the most
similar to increasing the number of neighbors, although
the process for determining similarity of neighbors and
samples in the same leaf is quite different.
"""

dtree_second_model = DecisionTreeClassifier(random_state=42, min_samples_leaf=10)

dtree_second_log_loss = -cross_val_score(dtree_second_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_second_log_loss

np.float64(0.29824280507779183)

In [63]:

"""
Ok, increasing the minimum samples per leaf from 1 to 10
helped reduce overfitting. What if we increase them again?
"""

dtree_third_model = DecisionTreeClassifier(random_state=42, min_samples_leaf=100)

dtree_third_log_loss = -cross_val_score(dtree_third_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_third_log_loss

np.float64(0.12065990783277543)

In [64]:

"""
Now we are getting scores in the same range as the best
logistic regression model or the baseline kNN model

Wait, we just realized that this is a very imbalanced dataset,
but we haven't told the model that. Let's try using the same
min_samples_leaf as before, and also specifying the class weight
"""

dtree_fourth_model = DecisionTreeClassifier(random_state=42, min_samples_leaf=100, class_weight="balanced")

dtree_fourth_log_loss = -cross_val_score(dtree_fourth_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_fourth_log_loss

np.float64(0.20949578809641323)

In [65]:

"""
Oh well, sometimes that backfires when the model overcompensates
trying to create the right balance. We'll leave off the
class_weight hyperparameter for now.

We also notice that this dataset has a lot of dimensions. What if
we limit the number of features that can be used in a given split,
while keeping min_samples_leaf the same?
"""

dtree_fifth_model = DecisionTreeClassifier(random_state=42, min_samples_leaf=100, max_features="sqrt")

dtree_fifth_log_loss = -cross_val_score(dtree_fifth_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_fifth_log_loss

np.float64(0.14405851143696752)

In [66]:

"""
Still not better than dtree_third_model

Let's try one more value for min_samples_leaf
"""

dtree_sixth_model = DecisionTreeClassifier(random_state=42, min_samples_leaf=75)

dtree_sixth_log_loss = -cross_val_score(dtree_sixth_model, X_train, y_train, scoring="neg_log_loss").mean()
dtree_sixth_log_loss

np.float64(0.11677101481401571)

That looks good. Maybe in the future we would do something more
systematic (like a grid search) but for now we'll say that the
sixth model is the best one of the decision tree models

## 6. Choose and Evaluate an Overall Best Model

Which model had the best performance? What type of model was it?

Instantiate a variable `final_model` using your best model with the best hyperparameters.

In [67]:

"""
Note that this model is MUCH slower than best logistic
regression or decision tree models, but we think the
performance is worth it. That might not always be the
case, depending on the business goal.
"""

final_model = KNeighborsClassifier(n_neighbors=50, metric="manhattan")

# Fit the model on the full training data
# (scaled or unscaled depending on the model)
final_model.fit(X_train_scaled, y_train)

Now, evaluate the log loss, accuracy, precision, and recall. This code is mostly filled in for you, but you need to replace `None` with either `X_test` or `X_test_scaled` depending on the model you chose.


In [68]:
# Replace None with appropriate code
from sklearn.metrics import accuracy_score, precision_score, recall_score, log_loss

preds = final_model.predict(X_test_scaled)
probs = final_model.predict_proba(X_test_scaled)

print("log loss: ", log_loss(y_test, probs))
print("accuracy: ", accuracy_score(y_test, preds))
print("precision:", precision_score(y_test, preds))
print("recall:   ", recall_score(y_test, preds))

log loss:  0.07523086602462235
accuracy:  0.9716393102015375
precision: 0.8876404494382022
recall:    0.6899563318777293


Interpret your model performance. How would it perform on different kinds of tasks? How much better is it than a "dummy" model that always chooses the majority class, or the logistic regression described at the start of the lab?


This model has 97% accuracy, meaning that it assigns the
correct label 97% of the time. This is definitely an
improvement over a "dummy" model, which would have about
92% accuracy.

If our model labels a given forest area a 1, there is
about an 89% chance that it really is class 1, compared
to about a 67% chance with the logistic regression

The recall score is also improved from the logistic
regression model. If a given cell of forest really is
class 1, there is about a 69% chance that our model
will label it correctly. This is better than the 48%
of the logistic regression model, but still doesn't
instill a lot of confidence. If the business really
cared about avoiding "false negatives" (labeling
cottonwood/willow as ponderosa pine) more so than
avoiding "false positives" (labeling ponderosa pine
as cottonwood/willow), then we might want to adjust
the decision threshold on this
