## Decision Tree

In this example, we will take a look at the Decision Tree and test it's performance on several datasets while comparing it to the performance of scikit-learn's Decision Tree on the same datasets. The datasets used for testing are 5 in total, 3 for classification and 2 for regression with increasing complexity.

First we load the necessary datasets and split them into training and testing sets:

In [1]:
# Load modules
from models.decision_tree import DecisionTreeClassifier as OwnDecisionTreeClassifier, DecisionTreeRegressor as OwnDecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier as SklearnDecisionTreeClassifier, DecisionTreeRegressor as SklearnDecisionTreeRegressor

from utils.reports import evaluate_classification, evaluate_regression
from utils.grid_search_cv import GridSearchCV
from sklearn.model_selection import train_test_split, ParameterGrid

# set up a list of hyperparameters to search over for all trees
params = {
    'max_depth': [1, 3, 5, 7, 9, 11],
    'min_samples_split': [2, 3, 5, 7],
    'min_samples_leaf': [1, 3, 5, 7]
}
param_grid = list(ParameterGrid(params))

# Load datasets
# one easy, one medium, one hard for each classification and regression
from sklearn.datasets import load_iris, load_breast_cancer, load_digits
from sklearn.datasets import load_diabetes, fetch_california_housing

# diamonds dataset is a very large regression dataset which will test the efficiency of the algorithms
from datasets.diamonds import load_diamonds


ds_c_easy = load_iris()
X, Y = ds_c_easy.data, ds_c_easy.target
X_c_easy_train, X_c_easy_test, Y_c_easy_train, Y_c_easy_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_c_medium = load_breast_cancer()
X, Y = ds_c_medium.data, ds_c_medium.target 
X_c_medium_train, X_c_medium_test, Y_c_medium_train, Y_c_medium_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_c_hard = load_digits()
X, Y = ds_c_hard.data, ds_c_hard.target
X_c_hard_train, X_c_hard_test, Y_c_hard_train, Y_c_hard_test = train_test_split(X , Y, test_size=0.2, random_state=42)

ds_r_easy = fetch_california_housing()
X, Y = ds_r_easy.data, ds_r_easy.target
X_r_easy_train, X_r_easy_test, Y_r_easy_train, Y_r_easy_test = train_test_split(X , Y, test_size=0.2, random_state=42)

ds_r_medium = load_diabetes()
X, Y = ds_r_medium.data, ds_r_medium.target
X_r_medium_train, X_r_medium_test, Y_r_medium_train, Y_r_medium_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_r_hard = load_diamonds()
X, Y = ds_r_hard.data, ds_r_hard.target
X_r_hard_train, X_r_hard_test, Y_r_hard_train, Y_r_hard_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Decision Tree Classifier
First we will look at our own implementation:

In [2]:
dt_classifier = OwnDecisionTreeClassifier(max_depth=5)

dt_classifier.fit(X_c_easy_train, Y_c_easy_train)
Y_c_easy_pred = dt_classifier.predict(X_c_easy_test)

evaluate_classification(Y_c_easy_test, Y_c_easy_pred)

Precision: 1.00, Recall: 1.00, F1-Score: 1.00


As we can see, our model achieves a perfect accuracy on the Iris dataset (This is to be expected as the dataset is very simple).

In [3]:
dt_classifier = SklearnDecisionTreeClassifier(max_depth=5)
dt_classifier.fit(X_c_easy_train, Y_c_easy_train)
Y_c_easy_pred = dt_classifier.predict(X_c_easy_test)

evaluate_classification(Y_c_easy_test, Y_c_easy_pred)

Precision: 1.00, Recall: 1.00, F1-Score: 1.00


Without much surprise, scikit-learn's Decision Tree also achieves a perfect accuracy here. Moving on the Breast Cancer dataset:

In [4]:
dt_classifier = OwnDecisionTreeClassifier()

dt_classifier.fit(X_c_medium_train, Y_c_medium_train)
Y_c_medium_pred = dt_classifier.predict(X_c_medium_test)

evaluate_classification(Y_c_medium_test, Y_c_medium_pred)

Precision: 0.96, Recall: 1.00, F1-Score: 0.98


In [5]:
dt_classifier = SklearnDecisionTreeClassifier()

dt_classifier.fit(X_c_medium_train, Y_c_medium_train)
Y_c_medium_pred = dt_classifier.predict(X_c_medium_test)

evaluate_classification(Y_c_medium_test, Y_c_medium_pred)

Precision: 0.96, Recall: 0.96, F1-Score: 0.96


Here, while not completely perfect, both models still achieve a very high, similar accuracy of 96% with our model even slightly edgeing out scikits implementation in terms of recall. Finally, the digits dataset. This is particularly tricky since we are now dealing with non-binary classification.

In [6]:
dt_classifier = OwnDecisionTreeClassifier()

dt_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = dt_classifier.predict(X_c_hard_test)

evaluate_classification(Y_c_hard_test, Y_c_hard_pred)

Precision: 0.88, Recall: 0.87, F1-Score: 0.87


In [7]:
dt_classifier = SklearnDecisionTreeClassifier()

dt_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = dt_classifier.predict(X_c_hard_test)

evaluate_classification(Y_c_hard_test, Y_c_hard_pred)

Precision: 0.86, Recall: 0.85, F1-Score: 0.85


We immediately can see, that both models take a considerable hit in accuracy. Given the complexity of the dataset, this is to be expected. However, the de-facto standard implementation of a decision tree by sklearn only outperforms our own implementation by a single percentage point. This indicates that the issue doesn't lie in our implementation, but rather that we are reaching the limits of what single a decision tree can achieve on this dataset. After taking a look at the regression datasets, we  will see how our ensamble methods can help us improve on that.

### Decision Tree Regressor
easy:

In [None]:
# GridSearch optimal params: {'max_depth': 11, 'min_samples_leaf': 7, 'min_samples_split': 2}
dt_regressor = OwnDecisionTreeRegressor(max_depth=11, min_samples_leaf=7, min_samples_split=2)

dt_regressor.fit(X_r_easy_train, Y_r_easy_train)
Y_r_easy_pred = dt_regressor.predict(X_r_easy_test)

evaluate_regression(Y_r_easy_test, Y_r_easy_pred)

MAE: 0.41, MSE: 0.38, R²: 0.71


In [29]:
# GridSearch optimal params: {'max_depth': 11, 'min_samples_leaf': 7, 'min_samples_split': 2}
dt_regressor = SklearnDecisionTreeRegressor(max_depth=11, min_samples_leaf=7, min_samples_split=2)

dt_regressor.fit(X_r_easy_train, Y_r_easy_train)
Y_r_easy_pred = dt_regressor.predict(X_r_easy_test)

evaluate_regression(Y_r_easy_test, Y_r_easy_pred)

MAE: 0.41, MSE: 0.37, R²: 0.72


medium:

In [30]:
# GridSearch optimal params: {'max_depth': 5, 'min_samples_leaf': 7, 'min_samples_split': 2}
dt_regressor = OwnDecisionTreeRegressor(max_depth=5, min_samples_leaf=7, min_samples_split=2)

dt_regressor.fit(X_r_medium_train, Y_r_medium_train)
Y_r_medium_pred = dt_regressor.predict(X_r_medium_test)

evaluate_regression(Y_r_medium_test, Y_r_medium_pred)

MAE: 45.20, MSE: 3257.41, R²: 0.39


In [34]:
# GridSearch optimal params: {'max_depth': 5, 'min_samples_leaf': 7, 'min_samples_split': 2}
dt_regressor = SklearnDecisionTreeRegressor(max_depth=5, min_samples_leaf=7, min_samples_split=2)

dt_regressor.fit(X_r_medium_train, Y_r_medium_train)
Y_r_medium_pred = dt_regressor.predict(X_r_medium_test)

evaluate_regression(Y_r_medium_test, Y_r_medium_pred)

MAE: 41.86, MSE: 2810.52, R²: 0.47


hard:

In [39]:
# GridSearch optimal params: {'max_depth': 11, 'min_samples_leaf': 5, 'min_samples_split': 2}
dt_regressor = OwnDecisionTreeRegressor(max_depth=11, min_samples_leaf=5, min_samples_split=2)

dt_regressor.fit(X_r_hard_train, Y_r_hard_train)
Y_r_hard_pred = dt_regressor.predict(X_r_hard_test)

evaluate_regression(Y_r_hard_test, Y_r_hard_pred)

MAE: 321.93, MSE: 364038.26, R²: 0.98


In [41]:
# GridSearch optimal params: {'max_depth': 11, 'min_samples_leaf': 5, 'min_samples_split': 2}
dt_regressor = SklearnDecisionTreeRegressor(max_depth=11, min_samples_leaf=5, min_samples_split=2)

dt_regressor.fit(X_r_hard_train, Y_r_hard_train)
Y_r_hard_pred = dt_regressor.predict(X_r_hard_test)

evaluate_regression(Y_r_hard_test, Y_r_hard_pred)

MAE: 320.34, MSE: 378656.13, R²: 0.98
