<a href="https://colab.research.google.com/github/Jeesoo-Jhun/DS-NTL-091624/blob/main/%5BFIS_DS%5D_TUTORIAL_WALKTHROUGH_Applying_Decision_Trees_on_Forest_Cover_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
---
---

# 💠 **TUTORIAL: Applying Decision Trees on Forest Cover Data**

In this tutorial, we'll be extending our conceptual knowledge of decision tree classifiers in an attempt to classify across the Colorado Roosevelt National Forest dataset, available on Kaggle via **[this link](https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset)**.

---

To start, let's get all of our relevant importations and instantiations underway.

In [None]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.tree import export_graphviz
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from scipy.stats import randint

Specifically, we'll be making use of the **`DecisionTreeClassifier()`** algorithm available with SciKit-Learn.

---

As always, let's first get access to our dataset and take a look at our data.

In [None]:
dataset = pd.read_csv("covtype.csv")

We can take a look at our data.

In [None]:
dataset.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
1,2590,56,2,212,-6,390,220,235,151,6225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
2,2804,139,9,268,65,3180,234,238,135,6121,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
3,2785,155,18,242,118,3090,238,238,122,6211,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
4,2595,45,2,153,-1,391,220,234,150,6172,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0


Some datasets - especially ones curated for machine learning analysis - come with informational metadata that can be investigated and accessed via the **`.info()`** method.

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 324634 entries, 0 to 324633
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Elevation                           324634 non-null  int64  
 1   Aspect                              324634 non-null  int64  
 2   Slope                               324634 non-null  int64  
 3   Horizontal_Distance_To_Hydrology    324634 non-null  int64  
 4   Vertical_Distance_To_Hydrology      324634 non-null  int64  
 5   Horizontal_Distance_To_Roadways     324634 non-null  int64  
 6   Hillshade_9am                       324634 non-null  int64  
 7   Hillshade_Noon                      324634 non-null  int64  
 8   Hillshade_3pm                       324634 non-null  int64  
 9   Horizontal_Distance_To_Fire_Points  324634 non-null  int64  
 10  Wilderness_Area1                    324634 non-null  int64  
 11  Wilderness_Area2          

---

Just in case, let's also get rid of any ambient null data quickly.

Normally, this is a pretty naive way of doing so as this does not take into account any imputation methodologies that could persist signal from null occurrences.

For the sake of brevity and simplicity, however, we can simply delete all null occurrences since they're minimal.

In [None]:
try:
    dataset.dropna(inplace=True)
except:
    pass

---
---

Now let's start preparing for machine learning analysis.

We'll start by segmenting our data into **`X`** and **`y`** segments.

In [None]:
X, y = dataset.drop("Cover_Type", axis=1), dataset["Cover_Type"]

From there, we can produce training and testing subsets through the use of our trusty module **`train_test_split()`**.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

---

We're now ready to make use of our decision tree classifier.

In [None]:
classifier = DecisionTreeClassifier()

In [None]:
classifier.fit(X_train, y_train)

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
metrics.accuracy_score(y_test, y_pred)

0.9372218029479261

~91-93%! Not bad!

Let's see how we can improve our classifier function.

---

To start, let's investigate the expressed signal from each of our features in our dataset!

(Yes, we can actually do that!)

In [None]:
classifier.feature_importances_

array([2.63328624e-01, 3.05098278e-02, 1.65785628e-02, 5.07706928e-02,
       4.14546344e-02, 1.42147335e-01, 2.80329501e-02, 3.15706538e-02,
       2.59664323e-02, 1.48663483e-01, 1.42319947e-02, 1.95589038e-05,
       1.72936104e-03, 1.38789299e-01, 3.02448909e-04, 1.94809787e-03,
       2.52535060e-03, 3.18759193e-03, 1.19439645e-03, 1.64295293e-03,
       0.00000000e+00, 1.37083581e-04, 1.85777917e-04, 4.49780912e-03,
       1.03945142e-03, 1.26677426e-03, 1.63057370e-03, 1.11614231e-04,
       2.23779168e-05, 1.33829566e-03, 5.17852941e-04, 0.00000000e+00,
       1.19121517e-03, 3.89596538e-03, 1.11997251e-05, 4.25495677e-03,
       1.03719414e-02, 1.50478205e-03, 0.00000000e+00, 8.77119482e-05,
       2.00249520e-05, 2.62188638e-04, 1.31419681e-02, 3.26901241e-03,
       1.20316689e-03, 1.43047807e-03, 1.80758594e-03, 1.81899550e-05,
       2.76883558e-04, 0.00000000e+00, 8.57415258e-05, 7.14877185e-04,
       9.71197170e-04, 1.39054092e-04])

Yikes, that looks a little... uninterpretable.

Let's polish this up so it's clearer as to what we're looking at!

In [None]:
importances, features = classifier.feature_importances_, list(X)

feature_importances = [(features[iteration], importances[iteration]) for iteration in range(len(features))]
feature_importances.sort(reverse=True, key=lambda X: X[1])

In [None]:
feature_importances

[('Elevation', 0.26332862383955574),
 ('Horizontal_Distance_To_Fire_Points', 0.14866348269010518),
 ('Horizontal_Distance_To_Roadways', 0.14214733527857806),
 ('Wilderness_Area4', 0.13878929912492988),
 ('Horizontal_Distance_To_Hydrology', 0.05077069276288228),
 ('Vertical_Distance_To_Hydrology', 0.04145463444062626),
 ('Hillshade_Noon', 0.03157065380403105),
 ('Aspect', 0.03050982783301966),
 ('Hillshade_9am', 0.028032950146433203),
 ('Hillshade_3pm', 0.025966432305934088),
 ('Slope', 0.016578562806385956),
 ('Wilderness_Area1', 0.014231994712370457),
 ('Soil_Type29', 0.013141968093320968),
 ('Soil_Type23', 0.010371941360034628),
 ('Soil_Type10', 0.00449780912102281),
 ('Soil_Type22', 0.004254956766817089),
 ('Soil_Type20', 0.003895965376142676),
 ('Soil_Type30', 0.0032690124115354177),
 ('Soil_Type4', 0.0031875919289032226),
 ('Soil_Type3', 0.0025253505958489444),
 ('Soil_Type2', 0.001948097873150176),
 ('Soil_Type33', 0.0018075859391242155),
 ('Wilderness_Area3', 0.00172936103518494

We can also look into our expressed memory/storage per features, taking note of our top 15 features which already express most of our signal.

In [None]:
print("All Features: {} Mb".format(X_train.memory_usage(index=True).sum() / 1000000))

NUM_FEATURES_TO_PERSIST = 10
print(f"Top {NUM_FEATURES_TO_PERSIST} Features: {X_train[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]].memory_usage(index=True).sum() / 1000000} Mb")

All Features: 114.27064 Mb
Top 10 Features: 22.854128 Mb


---

Let's segment our data by our top 15 expressed signal features to save on memory and reduce training time.

In [None]:
X_train = X_train[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]]
X_test = X_test[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]]

In [None]:
subspace_classifier = DecisionTreeClassifier()

In [None]:
subspace_classifier.fit(X_train, y_train)

In [None]:
y_pred = subspace_classifier.predict(X_test)

In [None]:
metrics.accuracy_score(y_test, y_pred)

0.9255933586951499

Hmm... our accuracy largely remained the same (actually it may have degraded a little bit), but we cut our training time by a third.

---

In [None]:
hyperparameters = {
		 'criterion': ['gini', 'entropy'],
		 'max_depth': [10, 20, 30],
		 'max_leaf_nodes': [1000, 5000, 10000],
		 'min_samples_leaf': [20, 50, 100],
		 'min_samples_split': [10, 50, 100]
}

In [None]:
tuned_classifier = DecisionTreeClassifier(random_state=42)

In [None]:
model_tuner = GridSearchCV(tuned_classifier, hyperparameters, cv=5)

In [None]:
model_tuner.fit(X_train, y_train)

  _data = np.array(data, dtype=dtype, copy=copy,


In [None]:
optimally_tuned_classifier = model_tuner.best_estimator_

optimally_tuned_classifier

In [None]:
y_pred = optimally_tuned_classifier.predict(X_test)

In [None]:
metrics.accuracy_score(y_test, y_pred)

0.8975310733593113

Model performance has dropped slightly, but this is actually to be expected with the inclusion of cross-validation to ensure that we construct an averaged accuracy score more generalized to the entire dataset and not skewed by minor variation across the data.

---
---

Finally, we can export any saved decision tree model as a visualization available as a PNG or interactive image file using the **`export_graphviz()`** modular function.

In [None]:
labels = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
     	'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']

export_graphviz(
    subspace_classifier,
    out_file="forest.dot",
    feature_names=list(X_train),
    class_names=labels,
    rounded=True,
    filled=True
)

In [None]:
# RUN THIS IN YOUR COMMAND LINE TO GENERATE A PNG!
!dot -Tpng forest.dot -o forest.png

And that's that!

You now know how to utilize a basic CART-designed decision tree algorithm for classification!

---
---
---