### Decision Tree

In this example, we will take a look at the Decision Tree and test it's performance on several datasets while comparing it to the performance of scikit-learn's Decision Tree on the same datasets. The datasets used for testing are 5 in total, 3 for classification and 2 for regression with increasing complexity.

In [1]:
# Load modules
from models.decision_tree import DecisionTreeClassifier as OwnDecisionTreeClassifier, DecisionTreeRegressor as OwnDecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier as SklearnDecisionTreeClassifier, DecisionTreeRegressor as SklearnDecisionTreeRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error
from sklearn.metrics import accuracy_score, precision_score, recall_score

First we load the necessary datasets and split them into training and testing sets.

In [2]:
# Load datasets
# one easy, one medium, one hard for each classification and regression
from sklearn.datasets import load_iris, load_breast_cancer, load_digits
from sklearn.datasets import load_diabetes, fetch_california_housing


ds_c_easy = load_iris()
X, Y = ds_c_easy.data, ds_c_easy.target
X_c_easy_train, X_c_easy_test, Y_c_easy_train, Y_c_easy_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_c_medium = load_breast_cancer()
X, Y = ds_c_medium.data, ds_c_medium.target 
X_c_medium_train, X_c_medium_test, Y_c_medium_train, Y_c_medium_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_c_hard = load_digits()
X, Y = ds_c_hard.data, ds_c_hard.target
X_c_hard_train, X_c_hard_test, Y_c_hard_train, Y_c_hard_test = train_test_split(X , Y, test_size=0.2, random_state=42)

ds_r_easy = fetch_california_housing()
X, Y = ds_r_easy.data, ds_r_easy.target
X_r_easy_train, X_r_easy_test, Y_r_easy_train, Y_r_easy_test = train_test_split(X , Y, test_size=0.2, random_state=42)

ds_r_medium = load_diabetes()
X, Y = ds_r_medium.data, ds_r_medium.target
X_r_medium_train, X_r_medium_test, Y_r_medium_train, Y_r_medium_test = train_test_split(X, Y, test_size=0.2, random_state=42)


## Decision Tree Classifier

First we will look at our own implementation:

In [3]:
dt_classifier = OwnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_easy_train, Y_c_easy_train)
Y_c_easy_pred = dt_classifier.predict(X_c_easy_test)

print(classification_report(Y_c_easy_test, Y_c_easy_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



As we can see, our model achieves a perfect accuracy on the Iris dataset (This is to be expected as the dataset is very simple).

In [4]:
dt_classifier = SklearnDecisionTreeClassifier(max_depth=5)
dt_classifier.fit(X_c_easy_train, Y_c_easy_train)
Y_c_easy_pred = dt_classifier.predict(X_c_easy_test)

print(classification_report(Y_c_easy_test, Y_c_easy_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Without much surprise, scikit-learn's Decision Tree also achieves a perfect accuracy here. Moving on the Breast Cancer dataset:

In [5]:
dt_classifier = OwnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_medium_train, Y_c_medium_train)
Y_c_medium_pred = dt_classifier.predict(X_c_medium_test)

print(classification_report(Y_c_medium_test, Y_c_medium_pred))

              precision    recall  f1-score   support

           0       0.95      0.88      0.92        43
           1       0.93      0.97      0.95        71

    accuracy                           0.94       114
   macro avg       0.94      0.93      0.93       114
weighted avg       0.94      0.94      0.94       114



In [6]:
dt_classifier = SklearnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_medium_train, Y_c_medium_train)
Y_c_medium_pred = dt_classifier.predict(X_c_medium_test)

print(classification_report(Y_c_medium_test, Y_c_medium_pred))

              precision    recall  f1-score   support

           0       0.93      0.91      0.92        43
           1       0.94      0.96      0.95        71

    accuracy                           0.94       114
   macro avg       0.94      0.93      0.93       114
weighted avg       0.94      0.94      0.94       114



Here, while not completely perfect, both models still achieve a very high, similar accuracy of 94%. Finally, the digits dataset. This is particularly tricky since we are now dealing with non-binary classification.

In [7]:
dt_classifier = OwnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = dt_classifier.predict(X_c_hard_test)

print(classification_report(Y_c_hard_test, Y_c_hard_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.96        33
           1       0.21      0.32      0.25        28
           2       0.61      0.67      0.64        33
           3       0.82      0.82      0.82        34
           4       0.81      0.76      0.79        46
           5       0.71      0.32      0.44        47
           6       0.94      0.86      0.90        35
           7       0.92      0.71      0.80        34
           8       0.41      0.57      0.48        30
           9       0.56      0.70      0.62        40

    accuracy                           0.67       360
   macro avg       0.69      0.67      0.67       360
weighted avg       0.71      0.67      0.67       360



In [8]:
dt_classifier = SklearnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = dt_classifier.predict(X_c_hard_test)

print(classification_report(Y_c_hard_test, Y_c_hard_pred))

              precision    recall  f1-score   support

           0       1.00      0.88      0.94        33
           1       0.44      0.25      0.32        28
           2       0.58      0.21      0.31        33
           3       0.42      0.82      0.55        34
           4       0.81      0.85      0.83        46
           5       0.98      0.91      0.95        47
           6       0.94      0.91      0.93        35
           7       0.92      0.65      0.76        34
           8       0.31      0.73      0.44        30
           9       0.87      0.33      0.47        40

    accuracy                           0.67       360
   macro avg       0.73      0.65      0.65       360
weighted avg       0.75      0.67      0.67       360



We immediately can see, that both models take a considerable hit in accuracy. Given the complexity of the dataset, this is to be expected. However, the de-facto standard implementation of a decision tree by sklearn only outperforms our own implementation by a single percentage point. This indicates that the issue doesn't lie in our implementation, but rather that we are reaching the limits of what single a decision tree can achieve on this dataset. Let's see how our ensamble methods can improve on this.

In [9]:
from models.decision_tree import DecisionTreeClassifier as OwnDecisionTreeClassifier
from sklearn.model_selection import ParameterGrid, train_test_split
from sklearn.datasets import load_digits
from models.grid_search_cv import GridSearchCV

ds_c_hard = load_digits()
X, Y = ds_c_hard.data, ds_c_hard.target
X_train, X_test, Y_train, Y_test = train_test_split(X , Y, test_size=0.2, random_state=42)


params = {
    'max_depth': [7, 9, 11],
    'min_samples_split': [1, 2, 3, 4],
    'min_samples_leaf': [1, 2, 3]
}

param_grid = list(ParameterGrid(params))

grid_search = GridSearchCV(OwnDecisionTreeClassifier, param_grid, cv=5)


grid_search.fit(X_train, Y_train)

print(grid_search.best_params)

{'max_depth': 9, 'min_samples_leaf': 1, 'min_samples_split': 3}
