# Lab 07:  Decision Trees



#### Part 1

Train and fine tune a decision tree for the moons dataset  
* Use make_moons(n_samples=10000, noise=0.4) to generate a moons dataset
* Use train_test_split() to split the dataset into a training set and a test set.
* Use grid search with cross-validation (with the help of the GridSearchCV
class) to find good hyperparameter values for a DecisionTreeClassifier.
Hint: try various values for max_leaf_nodes.
* Train it on the full training set using these hyperparameters, and measure
your model’s performance on the test set. You should get roughly 85% to 87%
accuracy.

In [1]:
# Source: Chapter 5 of Hands On Machine Learning, page 157
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=10000, noise=0.4) # Generate moons dataset

In [2]:
# Source: Chapter 2 of Hands On Machine Learning, page 53
from sklearn.model_selection import train_test_split

# Divide moons dataset into training and test sets where 20% is for testing
X_moons_train, X_moons_test, y_moons_train, y_moons_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('X_moons_train size: ', len(X_moons_train))
print('X_moons_test size: ', len(X_moons_test))
print('y_moons_train size: ', len(y_moons_train))
print('y_moons_test size: ', len(y_moons_test))

X_moons_train size:  8000
X_moons_test size:  2000
y_moons_train size:  8000
y_moons_test size:  2000


In [3]:
# Sources: Chapter 6 of Hands On Machine Learning, page 175
#          Chapter 2 of Hands On Machine Learning, page 76
# https://www.geeksforgeeks.org/how-to-tune-a-decision-tree-in-hyperparameter-tuning/
# https://www.geeksforgeeks.org/building-and-implementing-decision-tree-classifiers-with-scikit-learn-a-comprehensive-guide/
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Create base model for the Decision Tree Classifier before training it
decision_tree_base_model = DecisionTreeClassifier()

# Perform Grid Search Cross-Validation hyperparameter tuning to find the best model for the Decision Tree Classifier
# Based on results, this will be used for training the Decision Tree Classifier
decision_tree_params = {
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': list(range(2, 5, 1)),
    'max_leaf_nodes': list(range(2, 28, 1))
}

decision_tree_GridSearchCV = GridSearchCV(estimator=decision_tree_base_model, param_grid=decision_tree_params, cv=5)

# Get the best model for the Decision Tree Classifer using Grid Search CV
decision_tree_GridSearchCV.fit(X_moons_train, y_moons_train)

#print('X_moons_train GridSearchCV size: ', len(X_moons_train))
#print('y_moons_train GridSearchCV size: ', len(y_moons_train))

best_decision_tree_model = decision_tree_GridSearchCV.best_estimator_
best_decision_tree_params = decision_tree_GridSearchCV.best_params_

print('Best decision tree model: ', best_decision_tree_model)
print('Best decision tree hyperparameters: ', best_decision_tree_params)


Best decision tree model:  DecisionTreeClassifier(max_leaf_nodes=25, min_samples_leaf=2)
Best decision tree hyperparameters:  {'criterion': 'gini', 'max_leaf_nodes': 25, 'min_samples_leaf': 2}


In [4]:
# Measure best model's performance with the test set

# max_leaf_nodes: 4, min_samples_leaf: 2 yielded 0.8635 for accuracy (when list(range(2, 37, 1)))
# max_leaf_nodes: 4, min_samples_leaf: 2 yielded 0.858 for accuracy (when list(range(2, 45, 1)))
# max_leaf_nodes: 4, min_samples_leaf: 2 yielded 0.8605 for accuracy (when list(range(2, 48, 1)))

# max_leaf_nodes: 21, min_samples_leaf: 2 yielded 0.86 for accuracy (when list(range(2, 28, 1)))
# max_leaf_nodes: 21, min_samples_leaf: 2 yielded 0.857 for accuracy (when list range(2, 44, 1))
# max_leaf_nodes: 21, min_samples_leaf: 2 yielded 0.866 for accuracy (when list(range(2, 46, 1)))
# max_leaf_nodes: 21, min_samples_leaf: 2 yielded 0.8645 for accuracy (when list(range(2, 50, 1)))

# max_leaf_nodes: 22, min_samples_leaf: 2 yielded 0.858 for accuracy (when list(range(2, 32, 1)))
# max_leaf_nodes: 22, min_samples_leaf: 3 yielded 0.8625 for accuracy (when list(range(2, 30, 1)))
# max_leaf_nodes: 22, min_samples_leaf: 2 yielded 0.859 for accuracy (when list(range(2, 43, 1)))

# max_leaf_nodes: 24, min_samples_leaf: 3 yielded 0.864 for accuracy (when list(range(2, 27, 1)))
# max_leaf_nodes: 24, min_samples_leaf: 2 yielded 0.866 for accuracy (when list(range(2, 47, 1)))

from sklearn.metrics import accuracy_score, classification_report

y_moons_pred = best_decision_tree_model.predict(X_moons_test)

print('Accuracy on moons test set:', accuracy_score(y_moons_test, y_moons_pred))
print('Classification Report for moons test set:\n', classification_report(y_moons_test, y_moons_pred))

Accuracy on moons test set: 0.8585
Classification Report for moons test set:
               precision    recall  f1-score   support

           0       0.85      0.86      0.86       988
           1       0.86      0.85      0.86      1012

    accuracy                           0.86      2000
   macro avg       0.86      0.86      0.86      2000
weighted avg       0.86      0.86      0.86      2000



#### Part 2: Grow a forest by following these steps:

* Continuing the previous exercise, generate 1,000 subsets of the training set,
each containing 100 instances selected randomly. Hint: you can use Scikit-
Learn’s ShuffleSplit class for this.
* Train one Decision Tree on each subset, using the best hyperparameter values
found in the previous exercise. Evaluate these 1,000 Decision Trees on the test
set. Since they were trained on smaller sets, these Decision Trees will likely
perform worse than the first Decision Tree, achieving only about 80%
accuracy.
* Now comes the magic. For each test set instance, generate the predictions of
the 1,000 Decision Trees, and keep only the most frequent prediction (you can
use SciPy’s mode() function for this). This approach gives you majority-vote
predictions over the test set.
* Evaluate these predictions on the test set: you should obtain a slightly higher
accuracy than your first model (about 0.5 to 1.5% higher). Congratulations,
you have trained a Random Forest classifier!

In [5]:
from sklearn.model_selection import ShuffleSplit
from scipy import stats
import numpy as np
import copy as c

# Store subsets in a list
decision_tree_subsets = []

# Store best model estimates for Decision Trees in a list
decision_tree_forest = []

# Store predictions in a list
decision_tree_predictions = []

# Store accuracies in a list
decision_tree_accuracy_scores = []

# Create 1,000 subsets where each has 100 instances/samples in each split
# X_moons_train has 8,000 samples, therefore we want the train_size = 100/8000
decision_tree_shuffle_split = ShuffleSplit(n_splits=1000, train_size=0.0125,random_state=42)

# Sources: Chapter 2 of Hands On Machine Learning, page 55
# Source: Chapter 6 of Hands On Machine Learning
# Source: Chapter 7 of Hands On Machine Learning
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
for moons_train_index, moons_test_index in decision_tree_shuffle_split.split(X_moons_train):
  X_moons_train = X[moons_train_index]
  y_moons_train = y[moons_train_index]
  decision_tree_subsets.append((X_moons_train, y_moons_train))


# Train Decision Tree on each subset using hyperparameters from best model
# Sources: Chapters 6 and 7 of Hands On Machine Learning
# https://www.freecodecamp.org/news/python-for-loop-for-i-in-range-example/
# https://realpython.com/python-zip-function/

# Get copy of the best decision tree model
best_decision_tree_model_copy = c.copy(best_decision_tree_model)

# 1,000 because we're evaluating 1,000 Decision Trees on test set
for i in range(1000):
  decision_tree_forest.append(best_decision_tree_model_copy)

for some_decision_tree, (X_moons_train, y_moons_train) in zip(decision_tree_forest, decision_tree_subsets):
  some_decision_tree.fit(X_moons_train, y_moons_train)

  # Get prediction of Decision Tree with test set and store in list
  y_moons_pred = some_decision_tree.predict(X_moons_test)
  #print('\nMoons pred: ', y_moons_pred) # Type is numpy.ndarray
  decision_tree_predictions.append(y_moons_pred)

  # Get accuracy score of Decision Tree and store in list
  decision_tree_accuracy_score = accuracy_score(y_moons_test, y_moons_pred)
  decision_tree_accuracy_scores.append(decision_tree_accuracy_score)

# Get average accuracy score for all 1,000 Decision Trees
decision_tree_accuracy_sum = sum(decision_tree_accuracy_scores)
num_accuracy_scores = len(decision_tree_accuracy_scores)
accuracy_score_avg = decision_tree_accuracy_sum / num_accuracy_scores

# Print average accuracy score for all 1,000 Decision Trees
print('Average accuracy score for 1,000 Decision Trees: ', accuracy_score_avg)

# Convert decision_tree_predictions from list to 2D array
# Sources: https://www.geeksforgeeks.org/numpy-vstack-in-python/
# https://numpy.org/doc/stable/reference/generated/numpy.vstack.html
decision_tree_predictions_arr = np.vstack(decision_tree_predictions)

# For comparing the outputs (but the contents should be the same)
# Source: https://www.geeksforgeeks.org/how-to-append-a-numpy-array-to-an-empty-array-in-python/
print('\nDecision Tree Predictions:\n', decision_tree_predictions)
print('\nDecision Tree Predictions 2D Array:\n', decision_tree_predictions_arr)


# For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction using SciPy’s mode() function.

# Sources: Chapters 6 and 7 of Hands On Machine Learning
# https://www.geeksforgeeks.org/numpy-zeros-python/
# First parameter of np.zeros() represents desired shape of array
# 1000 is the number of decision trees, 2000 is the size of X_moons_test test set
y_moons_pred_tree_arr = np.zeros([1000, 2000], dtype=int)

for index, a_decision_tree in enumerate(decision_tree_forest):
  y_moons_pred_tree_arr[index] = a_decision_tree.predict(X_moons_test)
  #print(f'\ny_moons_pred[{index}]: ', y_moons_pred[index]) # For testing

# Get most frequent prediction over test set
y_most_frequent_pred, votes_num = stats.mode(y_moons_pred_tree_arr, axis=0, keepdims=False)
print('\nMost frequent prediction:\n', y_most_frequent_pred)

# Get accuracy score of the most frequent prediction over test set
most_frequent_pred_accuracy = accuracy_score(y_moons_test, y_most_frequent_pred)
print('\nMost frequent prediction accuracy: ', most_frequent_pred_accuracy)

Average accuracy score for 1,000 Decision Trees:  0.7909734999999996

Decision Tree Predictions:
 [array([0, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 1, 1, ..., 1, 0, 0]), array([1, 1, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([0, 1, 0, ..., 1, 0, 0]), array([1, 1, 0, ..., 1, 0, 0]), array([0, 1, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([0, 1, 0, ..., 0, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 1, 0, ..., 1, 0, 0]), array([0, 0, 1, ..., 1, 0, 0]), array([1, 0, 0, ..., 1, 0, 0]), array([1, 0, 1, ..., 1, 1, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([0, 0, 0, ..., 1, 0, 0]), array([1, 1, 0, ..., 0, 1, 0]), array