Lecture note : 
binary tree is not same as decision tree (interviewer trick question)

---
---
---

# 💠 **TUTORIAL: Applying Decision Trees on Forest Cover Data**

In this tutorial, we'll be extending our conceptual knowledge of decision tree classifiers in an attempt to classify across the Colorado Roosevelt National Forest dataset, available on Kaggle via **[this link](https://www.kaggle.com/datasets/uciml/forest-cover-type-dataset)**.

---

To start, let's get all of our relevant importations and instantiations underway.

In [1]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.tree import export_graphviz
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from scipy.stats import randint

Specifically, we'll be making use of the **`DecisionTreeClassifier()`** algorithm available with SciKit-Learn.

---

As always, let's first get access to our dataset and take a look at our data.

In [2]:
dataset = pd.read_csv("covtype.csv")

We can take a look at our data.

In [3]:
dataset.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


Some datasets - especially ones curated for machine learning analysis - come with informational metadata that can be investigated and accessed via the **`.info()`** method.

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    581012 non-null  int64
 11  Wilderness_Area2                    581012 non-null 

---

Just in case, let's also get rid of any ambient null data quickly.

Normally, this is a pretty naive way of doing so as this does not take into account any imputation methodologies that could persist signal from null occurrences.

For the sake of brevity and simplicity, however, we can simply delete all null occurrences since they're minimal.

In [5]:
# instructor prefer try/pass block to avoid error
try:
    dataset.dropna(inplace=True)
except:
    pass

---
---

Now let's start preparing for machine learning analysis.

We'll start by segmenting our data into **`X`** and **`y`** segments.

In [6]:
# cover type is the primary variable
X, y = dataset.drop("Cover_Type", axis=1), dataset["Cover_Type"]

From there, we can produce training and testing subsets through the use of our trusty module **`train_test_split()`**.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

---

We're now ready to make use of our decision tree classifier.

In [8]:
classifier = DecisionTreeClassifier()

In [9]:
classifier.fit(X_train, y_train)

In [10]:
y_pred = classifier.predict(X_test)

In [11]:
metrics.accuracy_score(y_test, y_pred)

0.939235647960896

~91-93%! Not bad!

Let's see how we can improve our classifier function.

---

To start, let's investigate the expressed signal from each of our features in our dataset!

(Yes, we can actually do that!)

In [12]:
classifier.feature_importances_

array([3.38056053e-01, 2.65923204e-02, 1.52915210e-02, 6.29788769e-02,
       4.34021874e-02, 1.48550349e-01, 3.00303415e-02, 3.27590498e-02,
       2.43305915e-02, 1.43563034e-01, 7.49547590e-03, 4.86259583e-03,
       1.32954518e-02, 1.48491036e-03, 1.60464750e-04, 1.02193065e-02,
       1.99904101e-03, 1.22856195e-02, 6.07180706e-04, 8.69392999e-04,
       0.00000000e+00, 3.54712543e-05, 1.09199270e-04, 2.56772344e-03,
       1.85888586e-03, 1.18381336e-03, 2.77150856e-03, 1.55697687e-04,
       6.59228194e-06, 8.09240075e-04, 1.29252620e-03, 4.60413342e-06,
       8.37298240e-04, 3.11574169e-03, 5.29360333e-04, 8.23104319e-03,
       9.40269213e-03, 5.29506723e-03, 5.36326816e-05, 3.66794771e-04,
       7.89574004e-04, 1.62847299e-04, 7.24501505e-03, 2.90036347e-03,
       5.82203604e-03, 1.25197101e-02, 4.75833494e-03, 4.18506227e-04,
       8.75320913e-04, 1.98258134e-05, 1.58068047e-04, 2.13254670e-03,
       3.44490567e-03, 1.29228928e-03])

Yikes, that looks a little... uninterpretable.

Let's polish this up so it's clearer as to what we're looking at!

In [13]:
importances, features = classifier.feature_importances_, list(X)

feature_importances = [(features[iteration], importances[iteration]) for iteration in range(len(features))]
feature_importances.sort(reverse=True, key=lambda X: X[1])

In [14]:
feature_importances

[('Elevation', 0.3380560533007945),
 ('Horizontal_Distance_To_Roadways', 0.14855034854123647),
 ('Horizontal_Distance_To_Fire_Points', 0.14356303441811372),
 ('Horizontal_Distance_To_Hydrology', 0.06297887694495703),
 ('Vertical_Distance_To_Hydrology', 0.04340218741401295),
 ('Hillshade_Noon', 0.03275904977754823),
 ('Hillshade_9am', 0.030030341458491094),
 ('Aspect', 0.02659232036875537),
 ('Hillshade_3pm', 0.024330591523529965),
 ('Slope', 0.015291520954420157),
 ('Wilderness_Area3', 0.013295451751083013),
 ('Soil_Type32', 0.012519710111139984),
 ('Soil_Type4', 0.012285619493295707),
 ('Soil_Type2', 0.010219306520091679),
 ('Soil_Type23', 0.009402692130896626),
 ('Soil_Type22', 0.008231043194376944),
 ('Wilderness_Area1', 0.007495475898651351),
 ('Soil_Type29', 0.007245015052067079),
 ('Soil_Type31', 0.005822036044601275),
 ('Soil_Type24', 0.00529506722761015),
 ('Wilderness_Area2', 0.004862595833986963),
 ('Soil_Type33', 0.004758334943520184),
 ('Soil_Type39', 0.003444905674386879),

We can also look into our expressed memory/storage per features, taking note of our top 15 features which already express most of our signal.

In [15]:
print("All Features: {} Mb".format(X_train.memory_usage(index=True).sum() / 1000000))

NUM_FEATURES_TO_PERSIST = 10
print(f"Top {NUM_FEATURES_TO_PERSIST} Features: {X_train[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]].memory_usage(index=True).sum() / 1000000} Mb")

All Features: 204.51596 Mb
Top 10 Features: 40.903192 Mb


---

Let's segment our data by our top 15 expressed signal features to save on memory and reduce training time.

In [16]:
X_train = X_train[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]]
X_test = X_test[[feature[0] for feature in feature_importances[:NUM_FEATURES_TO_PERSIST]]]

In [17]:
subspace_classifier = DecisionTreeClassifier()

In [18]:
subspace_classifier.fit(X_train, y_train)

In [19]:
y_pred = subspace_classifier.predict(X_test)

In [20]:
metrics.accuracy_score(y_test, y_pred)

0.9175666721168989

Hmm... our accuracy largely remained the same (actually it may have degraded a little bit), but we cut our training time by a third.

---

In [21]:
hyperparameters = {
		 'criterion': ['gini', 'entropy'],
		 'max_depth': [10, 20, 30],
		 'max_leaf_nodes': [1000, 5000, 10000],
		 'min_samples_leaf': [20, 50, 100],
		 'min_samples_split': [10, 50, 100]
}

In [22]:
tuned_classifier = DecisionTreeClassifier(random_state=42)

In [23]:
model_tuner = GridSearchCV(tuned_classifier, hyperparameters, cv=5)

In [24]:
model_tuner.fit(X_train, y_train)

In [25]:
optimally_tuned_classifier = model_tuner.best_estimator_

optimally_tuned_classifier

In [26]:
y_pred = optimally_tuned_classifier.predict(X_test)

In [27]:
metrics.accuracy_score(y_test, y_pred)

0.8795728165365783

### Homework : Rise accuracy. Don't go back. do it from here.

In [34]:
# 2. Feature Engineering
# Transform existing ones to improve model performance

# 3. Ensemble Methods
# Combine predictions from multiple models to improve accuracy

# 4. Data Augmentation
# Increase the size of the training dataset by adding modified copies of existing data or newly created synthetic data

# 5. Cross-Validation
# Use cross-validation to ensure the model generalizes well to unseen data

# 6. Regularization
# Apply techniques like L1 or L2 regularization to prevent overfitting

# END: Suggestions to improve accuracy

### Hyperparameter Tuning

In [35]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# Define the expanded hyperparameter grid
hyperparameters = {
    'criterion': ['gini', 'entropy'],              # Split criteria
    'max_depth': [10, 20, 30, None],               # Expanded max depth, including None for no limit
    'max_leaf_nodes': [1000, 5000, 10000],         # Maximum number of leaf nodes
    'min_samples_leaf': [10, 20, 50, 100]          # Minimum samples required at a leaf node
}

# Initialize the Decision Tree Classifier
tuned_classifier = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV to find the best hyperparameters with cross-validation
model_tuner = GridSearchCV(tuned_classifier, hyperparameters, cv=5)

# Fit the model on the training data
model_tuner.fit(X_train, y_train)

# Retrieve the best estimator from the grid search
optimally_tuned_classifier = model_tuner.best_estimator_

# Predict on the test data using the best model
y_pred = optimally_tuned_classifier.predict(X_test)

# Calculate and print the accuracy score
accuracy_score = metrics.accuracy_score(y_test, y_pred)
print("Accuracy with optimally tuned Decision Tree:", accuracy_score)


Accuracy with optimally tuned Decision Tree: 0.8974208927480357


In [32]:
# Randomized Search

from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Define hyperparameter space
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(), 
                                   param_distributions=param_dist, 
                                   n_iter=20, # set the number of iterations
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting and tuning
random_search.fit(X_train, y_train)
best_dt_classifier = random_search.best_estimator_

# Predict and evaluate with the best estimator
y_pred_dt_best = best_dt_classifier.predict(X_test)
accuracy_dt_best = metrics.accuracy_score(y_test, y_pred_dt_best)

print("Best Decision Tree Classifier Accuracy:", accuracy_dt_best)




Best Decision Tree Classifier Accuracy: 0.8846415324905553


In [33]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# Hyperparameter space (updated)
param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [10, 20, 30, 40, None],  # max_depth increased
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# RandomizedSearchCV object
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(), 
                                   param_distributions=param_dist, 
                                   n_iter=30,  # increased the number of iterations
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting and tuning
random_search.fit(X_train, y_train)
best_dt_classifier = random_search.best_estimator_

# Predict and evaluate with the best estimator
y_pred_dt_best = best_dt_classifier.predict(X_test)
accuracy_dt_best = metrics.accuracy_score(y_test, y_pred_dt_best)

print("Best Decision Tree Classifier Accuracy:", accuracy_dt_best)




Best Decision Tree Classifier Accuracy: 0.8969303718492638


In [36]:
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

# Step 1: Apply PCA for dimensionality reduction
pca = PCA(n_components=10)  # Selects 10 components (adjustable based on data and variance)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Step 2: Define the hyperparameter grid
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, None],               
    'max_leaf_nodes': [1000, 5000, 10000],         
    'min_samples_leaf': [10, 20, 50, 100]          
}

# Step 3: Initialize and tune the Decision Tree model with PCA-transformed data
tuned_classifier = DecisionTreeClassifier(random_state=42)
model_tuner = GridSearchCV(tuned_classifier, hyperparameters, cv=5)
model_tuner.fit(X_train_pca, y_train)

# Step 4: Predict and evaluate using the best model
optimally_tuned_classifier = model_tuner.best_estimator_
y_pred = optimally_tuned_classifier.predict(X_test_pca)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Accuracy with PCA-transformed features:", accuracy_score)


Accuracy with PCA-transformed features: 0.8845296593031161


In [37]:
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# Step 1: Apply PCA for dimensionality reduction
pca = PCA(n_components=10)  # Selects 10 components (adjust based on data)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Step 2: Define the hyperparameter space (more efficient with RandomizedSearch)
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, None],               
    'max_leaf_nodes': [1000, 5000, 10000],         
    'min_samples_leaf': [10, 20, 50, 100]          
}

# Step 3: Use RandomizedSearchCV for faster tuning with fewer parameter combinations
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=20, # Limit the number of random combinations
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_pca, y_train)

# Best model with PCA and tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_pca)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Accuracy with PCA and RandomizedSearch:", accuracy_score)


Accuracy with PCA and RandomizedSearch: 0.8845296593031161


In [41]:
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# Step 1: Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Apply PCA with valid n_components for dimensionality reduction
pca = PCA(n_components=10)  # Set within valid range
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Step 3: Define the hyperparameter space
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, None],               
    'max_leaf_nodes': [1000, 5000, 10000],         
    'min_samples_leaf': [10, 20, 50, 100]          
}

# Step 4: Use RandomizedSearchCV for faster tuning
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=20, 
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_pca, y_train)

# Best model with PCA and tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_pca)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Accuracy with Scaled, PCA-transformed features:", accuracy_score)


Accuracy with Scaled, PCA-transformed features: 0.8205898298666988


In [43]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# Step 1: Apply SelectKBest with f_classif for feature selection
selector = SelectKBest(score_func=f_classif, k=10)  # Select top 10 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Step 2: Define the hyperparameter space
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, None],               
    'max_leaf_nodes': [1000, 5000, 10000],         
    'min_samples_leaf': [10, 20, 50, 100]          
}

# Step 3: Use RandomizedSearchCV for faster tuning
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=20, 
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_selected, y_train)

# Best model with selected features and tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_selected)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Accuracy with Selected Features (f_classif):", accuracy_score)


Accuracy with Selected Features (f_classif): 0.8974208927480357


In [45]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# Step 1: Apply SelectKBest with all features (since n_features=10)
selector = SelectKBest(score_func=f_classif, k='all')  # Select all features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

# Step 2: Define an expanded hyperparameter space
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, 40, None],           # Expanded depth range            
    'max_leaf_nodes': [500, 1000, 5000, 10000],    # Expanded leaf node options
    'min_samples_leaf': [5, 10, 20, 50, 100]       # Adjusted for more granular options
}

# Step 3: Use RandomizedSearchCV for faster tuning
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=30,     # Increased n_iter to explore more combinations
                                   cv=5, 
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_selected, y_train)

# Best model with all features and tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_selected)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Accuracy with All Features and Enhanced Tuning:", accuracy_score)


Accuracy with All Features and Enhanced Tuning: 0.9034104110909357


In [46]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

# Step 1: Apply MinMaxScaler for more refined scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Apply SelectKBest with all features (all features are retained)
selector = SelectKBest(score_func=f_classif, k='all')  # Select all features
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Step 3: Define an even more expanded hyperparameter space
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, 40, None],            
    'max_leaf_nodes': [500, 1000, 5000, 10000],     
    'min_samples_leaf': [5, 10, 20, 50, 100],       
    'min_samples_split': [2, 5, 10, 20]             # Added min_samples_split for finer tuning
}

# Step 4: Use RandomizedSearchCV with more cross-validation folds
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=40,         # Increased n_iter for broader search
                                   cv=10,             # Increased to 10 folds for stability
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_selected, y_train)

# Best model with scaled features and further tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_selected)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Enhanced Accuracy with MinMax Scaling and Expanded Tuning:", accuracy_score)


Enhanced Accuracy with MinMax Scaling and Expanded Tuning: 0.9022744679569374


In [53]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics

# Step 1: Apply MinMaxScaler for more refined scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Apply SelectKBest with all features (all features are retained)
selector = SelectKBest(score_func=f_classif, k='all')  # Select all features
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Step 3: Define an even more expanded hyperparameter space
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, 40, 50],            
    'max_leaf_nodes': [500, 1000, 5000, 10000],     
    'min_samples_leaf': [5, 10, 20, 50, 100],       
    'min_samples_split': [2, 5, 10, 20]             # Added min_samples_split for finer tuning
}

# Step 4: Use RandomizedSearchCV with more cross-validation folds
random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=50,         # Increased n_iter for broader search
                                   cv=10,             # Increased to 10 folds for stability
                                   n_jobs=-1, 
                                   verbose=0, 
                                   random_state=42)

# Model fitting
random_search.fit(X_train_selected, y_train)

# Best model with scaled features and further tuned parameters
best_classifier = random_search.best_estimator_

# Prediction and evaluation
y_pred = best_classifier.predict(X_test_selected)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Enhanced Accuracy with MinMax Scaling and Expanded Tuning:", accuracy_score)


Enhanced Accuracy with MinMax Scaling and Expanded Tuning: 0.7465728079309484


Model performance has dropped slightly, but this is actually to be expected with the inclusion of cross-validation to ensure that we construct an averaged accuracy score more generalized to the entire dataset and not skewed by minor variation across the data.

### Select Features

In [51]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Step 1: Feature Selection with SelectKBest
# Select top 10 features based on ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=10)
X_new = selector.fit_transform(X, y)

# Get the selected feature names for clarity (optional)
selected_features = [X.columns[i] for i in selector.get_support(indices=True)]
print("Selected Top 10 Features:", selected_features)

# Step 2: Train/Test Split with Selected Features
X_train, X_test, y_train, y_test = train_test_split(X_new, y, train_size=0.8, test_size=0.2, random_state=42)

# Step 3: Train the Decision Tree Classifier with Selected Features
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)

# Step 4: Make Predictions and Calculate Accuracy
y_pred = classifier.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy with Top 10 Features:", accuracy)


Selected Top 10 Features: ['Elevation', 'Horizontal_Distance_To_Roadways', 'Wilderness_Area1', 'Wilderness_Area4', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type10', 'Soil_Type38', 'Soil_Type39']
Accuracy with Top 10 Features: 0.6709809557412459


### PCA

In [52]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Step 1: Feature Selection with SelectKBest
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Step 2: PCA for Dimensionality Reduction (optional)
# Let's reduce the selected features to 5 principal components
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X_selected)

# Step 3: Train/Test Split with Reduced Features
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, train_size=0.8, test_size=0.2, random_state=42)

# Step 4: Train the Decision Tree Classifier with Reduced Features
classifier = DecisionTreeClassifier(random_state=42)
classifier.fit(X_train, y_train)

# Step 5: Make Predictions and Calculate Accuracy
y_pred = classifier.predict(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy with SelectKBest and PCA:", accuracy)


Accuracy with SelectKBest and PCA: 0.6761185167336472


### Cross-Validation

In [54]:
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Step 1: Apply MinMaxScaler for better feature scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Select all features with SelectKBest for now, or a reduced number if necessary
selector = SelectKBest(score_func=f_classif, k='all')  # Keep all features
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Step 3: Define an expanded hyperparameter space for better tuning
hyperparameters = {
    'criterion': ['gini', 'entropy'],              
    'max_depth': [10, 20, 30, 40, None],            
    'max_leaf_nodes': [500, 1000, 5000, 10000],     
    'min_samples_leaf': [5, 10, 20, 50, 100],       
    'min_samples_split': [2, 5, 10, 20]             
}

# Step 4: Use Stratified K-Fold Cross-Validation with RandomizedSearchCV
stratified_kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

random_search = RandomizedSearchCV(estimator=DecisionTreeClassifier(random_state=42), 
                                   param_distributions=hyperparameters, 
                                   n_iter=40,        # Iterations for the randomized search
                                   cv=stratified_kfold,  # Use Stratified K-Fold here
                                   n_jobs=-1, 
                                   random_state=42)

# Model fitting with cross-validation
random_search.fit(X_train_selected, y_train)

# Retrieve the best model from the search
best_classifier = random_search.best_estimator_

# Prediction and accuracy evaluation
y_pred = best_classifier.predict(X_test_selected)
accuracy_score = metrics.accuracy_score(y_test, y_pred)

print("Enhanced Accuracy with Stratified K-Fold and Expanded Tuning:", accuracy_score)


Enhanced Accuracy with Stratified K-Fold and Expanded Tuning: 0.7465728079309484


---
---

Finally, we can export any saved decision tree model as a visualization available as a PNG or interactive image file using the **`export_graphviz()`** modular function.

In [47]:
labels = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
     	'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']

export_graphviz(
    subspace_classifier,
    out_file="forest.dot",
    feature_names=list(X_train),
    class_names=labels,
    rounded=True,
    filled=True
)

In [50]:
# RUN THIS IN YOUR COMMAND LINE TO GENERATE A PNG!
!dot -Tpng forest.dot -o forest.png

zsh:1: command not found: dot


And that's that!

You now know how to utilize a basic CART-designed decision tree algorithm for classification!

---
---
---