In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

os.chdir("..")

In [3]:
import pandas as pd

from custom_tree_classifier import CustomDecisionTreeClassifier

To illustrate this example, we use the data from the `titanic.csv` dataset to simply try and predict the survival of the individual (`Survived`) as a function of their boarding class (`Pclass`) and age (`Age`). 

In [4]:
df = (
    pd
    .read_csv("notebooks/data/titanic.csv")
    [["Survived", "Pclass", "Age"]]
    .dropna()
)

df.head()

Unnamed: 0,Survived,Pclass,Age
0,0,3,22.0
1,1,1,38.0
2,1,3,26.0
3,1,1,35.0
4,0,3,35.0


To define your own splitting metric, you need to create a class derived from the `custom_tree.metrics.MetricBase` model. This class must contain the same methods and object types as inputs and outputs.

As a reminder, the splitting in a decision tree is performed based on the optimization (maximization or minimization) of a measure.

*   `.compute_metric`: Measure of the metric. For example, the Gini index represents the impurity of a group of observations based on the observations of each class (0 and 1):

$$ I_{G} = 1 - p_0^2 - p_1^2 $$

*   `.compute_delta`: Measure of the delta. For example, optimizing the Gini index involves minimizing the weighted average of the Gini index across child nodes ($L$ and $R$). This is equivalent to minimizing $\Delta$:

$$ \Delta = \frac{N_t}{N} \times (I_G - \frac{N_{t_L} * I_{G_L}}{N_t} - \frac{N_{t_R} * I_{G_R}}{N_t}) $$

Here is an example of class for the Gini index :

In [5]:
import numpy as np

from custom_tree_classifier.metrics import MetricBase


class Gini(MetricBase):

    @staticmethod
    def compute_metric(metric_data: np.ndarray) -> np.float64:

        y = metric_data[:, 0]

        prop0 = np.sum(y == 0) / len(y)
        prop1 = np.sum(y == 1) / len(y)

        metric = 1 - (prop0**2 + prop1**2)

        return metric

    @staticmethod
    def compute_delta(
            split: np.ndarray,
            metric_data: np.ndarray
        ) -> np.float64:

        delta = (
            Gini.compute_metric(metric_data) -
            Gini.compute_metric(metric_data[split]) * np.mean(split) -
            Gini.compute_metric(metric_data[np.invert(split)]) * (1 - np.mean(split))
        )

        return delta

# CustomDecisionTreeClassifier

After instantiating the model, you can specify the metric to be considered using the `.setup_metric` method:

In [6]:
decision_tree = CustomDecisionTreeClassifier(
    max_depth=8
)

decision_tree.setup_metric(metric=Gini)

Train the model with `.fit` :

In [7]:
X = np.array(df[["Pclass", "Age"]])
y = np.array(df["Survived"])
metric_data = np.array(df[["Survived"]])

decision_tree.fit(
    X=X,
    y=y,
    metric_data=metric_data
)

Retrieve predictions on new data with `predict_proba` :

In [8]:
X = np.array(df[["Pclass", "Age"]])

probas = decision_tree.predict_proba(
    X=X
)

probas[:5]

array([[0.77358491, 0.22641509],
       [0.27272727, 0.72727273],
       [0.77358491, 0.22641509],
       [0.        , 1.        ],
       [0.83333333, 0.16666667]])

Display the classification tree using the `.print_tree` method:

In [9]:
features_names = {
    0: "Pclass",
    1: "Age"
}

decision_tree.print_tree(
    features_names=features_names,
    digits=2,
    metric_name="Gini index"
)

[1] -> Gini index = 0.48 | repartition = [424, 290]
|    Δ Gini index = +0.05
|   [2] Pclass <= 2.0 -> Gini index = 0.49 | repartition = [154, 205]
|   |    Δ Gini index = +0.03
|   |   [4] Age <= 17.0 -> Gini index = 0.16 | repartition = [3, 32]
|   |   |    Δ Gini index = +0.01
|   |   |   [8] Age <= 15.0 -> Gini index = 0.08 | repartition = [1, 24]
|   |   |   |    Δ Gini index = +0.01
|   |   |   |   [16] Pclass <= 1.0 -> Gini index = 0.28 | repartition = [1, 5]
|   |   |   |   |    Δ Gini index = +0.11
|   |   |   |   |   [32] Age <= 2.0 -> Gini index = 0.5 | repartition = [1, 1]
|   |   |   |   |   |    Δ Gini index = +0.5
|   |   |   |   |   |   [64] Age <= 0.92 -> Gini index = 0.0 | repartition = [0, 1]
|   |   |   |   |   |   [65] Age > 0.92 -> Gini index = 0.0 | repartition = [1, 0]
|   |   |   |   |   [33] Age > 2.0 -> Gini index = 0.0 | repartition = [0, 4]
|   |   |   |   [17] Pclass > 1.0 -> Gini index = 0.0 | repartition = [0, 19]
|   |   |   [9] Age > 15.0 -> Gini index

# CustomRandomForestClassifier

In [10]:
from custom_tree_classifier import CustomRandomForestClassifier

In [11]:
random_forest = CustomRandomForestClassifier(
    n_estimators=100,
    max_depth=5
)

random_forest.setup_metric(metric=Gini)

In [12]:
X = np.array(df[["Pclass", "Age"]])
y = np.array(df["Survived"])
metric_data = np.array(df[["Survived"]])

random_forest.fit(
    X=X,
    y=y,
    metric_data=metric_data
)

100%|██████████| 100/100 [00:01<00:00, 54.85it/s]


In [13]:
X = np.array(df[["Pclass", "Age"]])

probas = random_forest.predict_proba(
    X=X
)

probas[:5]

array([[0.70183596, 0.29816404],
       [0.46371079, 0.53628921],
       [0.69864171, 0.30135829],
       [0.43616443, 0.56383557],
       [0.64645878, 0.35354122]])