# Árboles de decisión

Un árbol de decisión es un mapa de los posibles resultados de una serie de decisiones relacionadas. Permite que un individuo o una organización comparen posibles acciones entre sí según sus costos, probabilidades y beneficios. Se pueden usar para dirigir un intercambio de ideas informal o trazar un algoritmo que anticipe matemáticamente la mejor opción.

## Caso de estudio

### Decidir si un paciente que padece de COVID 19 necesita cama UCI de acuerdo a sus datos médicos

Algunos de estos datos son su nivel de calcio, nivel de glucosa, nivel de hemoglobina, nivel de linfocitos, etc.

Esto ayudará a que las clínicas puedan priorizar la admisión a los pacientes de mayor riesgo en sus camas UCI limitadas.

### Importamos las librerías necesarias

In [60]:
import seaborn as sns
from sklearn import tree
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

## PASO 0: Obtención de datos

Leemos el dataset

In [75]:
uci_df_og = pd.read_csv("Kaggle_Sirio_Libanes_ICU_Prediction.csv",sep=";")

## PASO 1: Preparación de datos

Filtrar aquellos pacientes que recibieron cama UCI luego de 12 horas de ser atendidos

In [82]:
# Filtrar aquellos pacientes que recibieron cama UCI luego de 12 horas de ser atendidos
uci_df = uci_df_og.loc[uci_df_og['WINDOW'] == 'ABOVE_12']

# Seleccionamos solo las columnas necesarias para resolver el probema
columns = ["CALCIUM_MEAN","GLUCOSE_MEAN","HEMOGLOBIN_MEAN","LINFOCITOS_MEAN","PH_ARTERIAL_MEAN","OXYGEN_SATURATION_MEAN","HEART_RATE_MEAN","ICU"]
uci_df = uci_df[columns]

# Eliminamos los valores nulos del dataset
uci_df_dropna = uci_df.dropna()

In [85]:
uci_df_dropna.head()

Unnamed: 0,CALCIUM_MEAN,GLUCOSE_MEAN,HEMOGLOBIN_MEAN,LINFOCITOS_MEAN,PH_ARTERIAL_MEAN,OXYGEN_SATURATION_MEAN,HEART_RATE_MEAN,ICU
4,0.326531,-0.891993,-0.353659,-0.643154,0.574468,0.665932,-0.213031,1
9,0.530612,-0.891993,-0.719512,-0.717842,0.361702,0.841977,-0.141163,1
14,0.367347,-0.891993,0.0,-0.73029,0.234043,0.797149,-0.28066,1
19,0.326531,-0.891993,-0.219512,-0.8361,0.234043,0.694035,-0.270189,0
24,0.357143,-0.891993,0.292683,-0.537344,0.234043,0.820327,0.051399,0


## PASO 2: División de datos

In [96]:
from sklearn.model_selection import train_test_split

training_data, testing_data = train_test_split(uci_df_dropna, test_size=0.3, random_state=25)

print(f"No. of training examples: {training_data.shape[0]}")
print(f"No. of testing examples: {testing_data.shape[0]}")

No. of training examples: 261
No. of testing examples: 113


In [90]:
X.head()

Unnamed: 0,CALCIUM_MEAN,GLUCOSE_MEAN,HEMOGLOBIN_MEAN,LINFOCITOS_MEAN,PH_ARTERIAL_MEAN,OXYGEN_SATURATION_MEAN,HEART_RATE_MEAN
4,0.326531,-0.891993,-0.353659,-0.643154,0.574468,0.665932,-0.213031
9,0.530612,-0.891993,-0.719512,-0.717842,0.361702,0.841977,-0.141163
14,0.367347,-0.891993,0.0,-0.73029,0.234043,0.797149,-0.28066
19,0.326531,-0.891993,-0.219512,-0.8361,0.234043,0.694035,-0.270189
24,0.357143,-0.891993,0.292683,-0.537344,0.234043,0.820327,0.051399


In [91]:
Y

array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
       0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1,

## PASO 3: Aplicación del modelo

In [98]:
# definimos a columna objetivo y separamos el dataset en features (X) y target (Y)
target_column = 'ICU'
X = training_data.drop(['ICU'], axis=1)
Y = training_data[target_column].values

# definimos los hiperparámetros
hp = {
 "max_depth": 10,
 "min_samples_split": 1
}
root = Node(Y, X, **hp)

In [None]:
Aplicación del modelo: Árbol de decisión construido desde cero
Resultados: Precision-recall auc y ROC auc como métricas primarias, precisión, recuperación y exactitud como secundarias.
Evaluación: Se contrastan los resultados del modelo planteado contra el DecisionTreeClassifier de scikit learn

In [5]:
# Data wrangling 
import pandas as pd 

# Array math
import numpy as np 

# Quick value count calculator
from collections import Counter


class Node: 
    """
    Class for creating the nodes for a decision tree 
    """
    def __init__(
        self, 
        Y: list,
        X: pd.DataFrame,
        min_samples_split=None,
        max_depth=None,
        depth=None,
        node_type=None,
        rule=None
    ):
        # Saving the data to the node 
        self.Y = Y 
        self.X = X

        # Saving the hyper parameters
        self.min_samples_split = min_samples_split if min_samples_split else 20
        self.max_depth = max_depth if max_depth else 5

        # Default current depth of node 
        self.depth = depth if depth else 0

        # Extracting all the features
        self.features = list(self.X.columns)

        # Type of node 
        self.node_type = node_type if node_type else 'root'

        # Rule for spliting 
        self.rule = rule if rule else ""

        # Calculating the counts of Y in the node 
        self.counts = Counter(Y)

        # Getting the GINI impurity based on the Y distribution
        self.gini_impurity = self.get_GINI()

        # Sorting the counts and saving the final prediction of the node 
        counts_sorted = list(sorted(self.counts.items(), key=lambda item: item[1]))

        # Getting the last item
        yhat = None
        if len(counts_sorted) > 0:
            yhat = counts_sorted[-1][0]

        # Saving to object attribute. This node will predict the class with the most frequent class
        self.yhat = yhat 

        # Saving the number of observations in the node 
        self.n = len(Y)

        # Initiating the left and right nodes as empty nodes
        self.left = None 
        self.right = None 

        # Default values for splits
        self.best_feature = None 
        self.best_value = None 

    @staticmethod
    def GINI_impurity(y1_count: int, y2_count: int) -> float:
        """
        Given the observations of a binary class calculate the GINI impurity
        """
        # Ensuring the correct types
        if y1_count is None:
            y1_count = 0

        if y2_count is None:
            y2_count = 0

        # Getting the total observations
        n = y1_count + y2_count
        
        # If n is 0 then we return the lowest possible gini impurity
        if n == 0:
            return 0.0

        # Getting the probability to see each of the classes
        p1 = y1_count / n
        p2 = y2_count / n
        
        # Calculating GINI 
        gini = 1 - (p1 ** 2 + p2 ** 2)
        
        # Returning the gini impurity
        return gini

    @staticmethod
    def ma(x: np.array, window: int) -> np.array:
        """
        Calculates the moving average of the given list. 
        """
        return np.convolve(x, np.ones(window), 'valid') / window

    def get_GINI(self):
        """
        Function to calculate the GINI impurity of a node 
        """
        # Getting the 0 and 1 counts
        y1_count, y2_count = self.counts.get(0, 0), self.counts.get(1, 0)

        # Getting the GINI impurity
        return self.GINI_impurity(y1_count, y2_count)

    def best_split(self) -> tuple:
        """
        Given the X features and Y targets calculates the best split 
        for a decision tree
        """
        # Creating a dataset for spliting
        df = self.X.copy()
        df['Y'] = self.Y

        # Getting the GINI impurity for the base input 
        GINI_base = self.get_GINI()

        # Finding which split yields the best GINI gain 
        max_gain = 0

        # Default best feature and split
        best_feature = None
        best_value = None

        for feature in self.features:
            # Droping missing values
            Xdf = df.dropna().sort_values(feature)

            # Sorting the values and getting the rolling average
            xmeans = self.ma(Xdf[feature].unique(), 2)

            for value in xmeans:
                # Spliting the dataset 
                left_counts = Counter(Xdf[Xdf[feature]<value]['Y'])
                right_counts = Counter(Xdf[Xdf[feature]>=value]['Y'])

                # Getting the Y distribution from the dicts
                y0_left, y1_left, y0_right, y1_right = left_counts.get(0, 0), left_counts.get(1, 0), right_counts.get(0, 0), right_counts.get(1, 0)

                # Getting the left and right gini impurities
                gini_left = self.GINI_impurity(y0_left, y1_left)
                gini_right = self.GINI_impurity(y0_right, y1_right)

                # Getting the obs count from the left and the right data splits
                n_left = y0_left + y1_left
                n_right = y0_right + y1_right

                # Calculating the weights for each of the nodes
                w_left = n_left / (n_left + n_right)
                w_right = n_right / (n_left + n_right)

                # Calculating the weighted GINI impurity
                wGINI = w_left * gini_left + w_right * gini_right

                # Calculating the GINI gain 
                GINIgain = GINI_base - wGINI

                # Checking if this is the best split so far 
                if GINIgain > max_gain:
                    best_feature = feature
                    best_value = value 

                    # Setting the best gain to the current one 
                    max_gain = GINIgain

        return (best_feature, best_value)

    def grow_tree(self):
        """
        Recursive method to create the decision tree
        """
        # Making a df from the data 
        df = self.X.copy()
        df['Y'] = self.Y

        # If there is GINI to be gained, we split further 
        if (self.depth < self.max_depth) and (self.n >= self.min_samples_split):

            # Getting the best split 
            best_feature, best_value = self.best_split()

            if best_feature is not None:
                # Saving the best split to the current node 
                self.best_feature = best_feature
                self.best_value = best_value

                # Getting the left and right nodes
                left_df, right_df = df[df[best_feature]<=best_value].copy(), df[df[best_feature]>best_value].copy()

                # Creating the left and right nodes
                left = Node(
                    left_df['Y'].values.tolist(), 
                    left_df[self.features], 
                    depth=self.depth + 1, 
                    max_depth=self.max_depth, 
                    min_samples_split=self.min_samples_split, 
                    node_type='left_node',
                    rule=f"{best_feature} <= {round(best_value, 3)}"
                    )

                self.left = left 
                self.left.grow_tree()

                right = Node(
                    right_df['Y'].values.tolist(), 
                    right_df[self.features], 
                    depth=self.depth + 1, 
                    max_depth=self.max_depth, 
                    min_samples_split=self.min_samples_split,
                    node_type='right_node',
                    rule=f"{best_feature} > {round(best_value, 3)}"
                    )

                self.right = right
                self.right.grow_tree()

    def print_info(self, width=4):
        """
        Method to print the infromation about the tree
        """
        # Defining the number of spaces 
        const = int(self.depth * width ** 1.5)
        spaces = "-" * const
        
        if self.node_type == 'root':
            print("Root")
        else:
            print(f"|{spaces} Split rule: {self.rule}")
        print(f"{' ' * const}   | GINI impurity of the node: {round(self.gini_impurity, 2)}")
        print(f"{' ' * const}   | Class distribution in the node: {dict(self.counts)}")
        print(f"{' ' * const}   | Predicted class: {self.yhat}")   

    def print_tree(self):
        """
        Prints the whole tree from the current node to the bottom
        """
        self.print_info() 
        
        if self.left is not None: 
            self.left.print_tree()
        
        if self.right is not None:
            self.right.print_tree()

    def predict(self, X:pd.DataFrame):
        """
        Batch prediction method
        """
        predictions = []

        for _, x in X.iterrows():
            values = {}
            for feature in self.features:
                values.update({feature: x[feature]})
        
            predictions.append(self.predict_obs(values))
        
        return predictions

    def predict_obs(self, values: dict) -> int:
        """
        Method to predict the class given a set of features
        """
        cur_node = self
        while cur_node.depth < cur_node.max_depth:
            # Traversing the nodes all the way to the bottom
            best_feature = cur_node.best_feature
            best_value = cur_node.best_value

            if cur_node.n < cur_node.min_samples_split:
                break 

            if (values.get(best_feature) < best_value):
                if self.left is not None:
                    cur_node = cur_node.left
            else:
                if self.right is not None:
                    cur_node = cur_node.right
            
        return cur_node.yhat

In [6]:
d = pd.read_csv("titanic-train.csv")
dtree = d[["Survived", "Age", "Fare"]].dropna().copy()

In [7]:
Y = dtree["Survived"][:50].values
X = dtree[["Age", "Fare"]][:50]
features = list(X.columns)

In [8]:
print(Y)
print(X)
print(features)

[0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0
 0 1 1 0 1 0 1 0 0 1 0 0 1]
     Age      Fare
0   22.0    7.2500
1   38.0   71.2833
2   26.0    7.9250
3   35.0   53.1000
4   35.0    8.0500
6   54.0   51.8625
7    2.0   21.0750
8   27.0   11.1333
9   14.0   30.0708
10   4.0   16.7000
11  58.0   26.5500
12  20.0    8.0500
13  39.0   31.2750
14  14.0    7.8542
15  55.0   16.0000
16   2.0   29.1250
18  31.0   18.0000
20  35.0   26.0000
21  34.0   13.0000
22  15.0    8.0292
23  28.0   35.5000
24   8.0   21.0750
25  38.0   31.3875
27  19.0  263.0000
30  40.0   27.7208
33  66.0   10.5000
34  28.0   82.1708
35  42.0   52.0000
37  21.0    8.0500
38  18.0   18.0000
39  14.0   11.2417
40  40.0    9.4750
41  27.0   21.0000
43   3.0   41.5792
44  19.0    7.8792
49  18.0   17.8000
50   7.0   39.6875
51  21.0    7.8000
52  49.0   76.7292
53  29.0   26.0000
54  65.0   61.9792
56  21.0   10.5000
57  28.5    7.2292
58   5.0   27.7500
59  11.0   46.9000
60  22.0    7.2292
61  

In [9]:
hp = {
 "max_depth": 50,
 "min_samples_split": 1
}

In [10]:
root = Node(Y, X, **hp)

In [11]:
root.grow_tree()

In [12]:
root.print_tree()

Root
   | GINI impurity of the node: 0.49
   | Class distribution in the node: {0: 29, 1: 21}
   | Predicted class: 0
|-------- Split rule: Fare <= 7.867
           | GINI impurity of the node: 0.0
           | Class distribution in the node: {0: 5}
           | Predicted class: 0
|-------- Split rule: Fare > 7.867
           | GINI impurity of the node: 0.5
           | Class distribution in the node: {1: 21, 0: 24}
           | Predicted class: 0
|---------------- Split rule: Fare <= 8.04
                   | GINI impurity of the node: 0.0
                   | Class distribution in the node: {1: 3}
                   | Predicted class: 1
|---------------- Split rule: Fare > 8.04
                   | GINI impurity of the node: 0.49
                   | Class distribution in the node: {1: 18, 0: 24}
                   | Predicted class: 0
|------------------------ Split rule: Fare <= 9.988
                           | GINI impurity of the node: 0.0
                           | Class di

In [14]:
tree_one=tree.DecisionTreeClassifier()
tree_one=tree_one.fit(X,Y)
tree_one_accuracy= round(tree_one.score(X,Y),4)
print('Accuracy: %0.4f'% (tree_one_accuracy))

Accuracy: 1.0000


In [15]:
# !pip install -r requirements.txt --user
!pip install pydotplus --user



In [16]:
from io import StringIO
from IPython.display import Image,display
import pydotplus

out=StringIO()
tree.export_graphviz(tree_one,out_file=out)
graph = pydotplus.graph_from_dot_data(out.getvalue())
graph.write_png('titanic.png')

True

In [40]:
X

Unnamed: 0,CALCIUM_MEAN,GLUCOSE_MEAN,HEMOGLOBIN_MEAN,LINFOCITOS_MEAN,PH_ARTERIAL_MEAN,OXYGEN_SATURATION_DIFF,HEART_RATE_DIFF_REL
4,0.326531,-0.891993,-0.353659,-0.643154,0.574468,-0.818182,-0.230462
6,0.530612,-0.891993,-0.914634,-0.858921,0.234043,-1.0,-1.0
8,0.367347,-0.891993,0.019956,-0.614108,0.148936,-1.0,-0.940967
9,0.530612,-0.891993,-0.719512,-0.717842,0.361702,-0.79798,-0.239515
14,0.367347,-0.891993,0.0,-0.73029,0.234043,-0.89899,-0.576744
19,0.326531,-0.891993,-0.219512,-0.8361,0.234043,-0.171717,-0.069094
22,0.357143,-0.891993,0.019956,-0.614108,0.234043,-0.979798,-0.956805
24,0.357143,-0.891993,0.292683,-0.537344,0.234043,-0.939394,-0.634847
29,0.489796,-0.891993,0.134146,-0.624481,0.234043,-0.919192,-0.581849
34,0.357143,-0.891993,-0.487805,-0.821577,0.234043,0.858586,-0.78123


In [17]:
df=X.copy()
df['Y']=Y
n=len(Y)
features=root.features
features[0]

'Age'

In [18]:
Xdf = df.dropna().sort_values(features[0])
Xdf

Unnamed: 0,Age,Fare,Y
16,2.0,29.125,0
7,2.0,21.075,0
43,3.0,41.5792,1
63,4.0,27.9,0
10,4.0,16.7,1
58,5.0,27.75,1
50,7.0,39.6875,0
24,8.0,21.075,0
59,11.0,46.9,0
9,14.0,30.0708,1


In [19]:
a = Xdf[features[0]].unique()
print(a)
pro = Xdf[features[0]]<5
pro

[ 2.   3.   4.   5.   7.   8.  11.  14.  15.  18.  19.  20.  21.  22.
 26.  27.  28.  28.5 29.  31.  34.  35.  38.  39.  40.  42.  45.  49.
 54.  55.  58.  65.  66. ]


16     True
7      True
43     True
63     True
10     True
58    False
50    False
24    False
59    False
9     False
39    False
14    False
22    False
38    False
49    False
44    False
27    False
12    False
51    False
56    False
37    False
60    False
0     False
2     False
8     False
41    False
23    False
34    False
57    False
53    False
66    False
18    False
21    False
20    False
3     False
4     False
61    False
25    False
1     False
13    False
40    False
30    False
35    False
62    False
52    False
6     False
15    False
11    False
54    False
33    False
Name: Age, dtype: bool

In [20]:
b = np.ones(2)
print(b)
xmenas = np.convolve(a, b, 'valid') / 2
xmenas

[1. 1.]


array([ 2.5 ,  3.5 ,  4.5 ,  6.  ,  7.5 ,  9.5 , 12.5 , 14.5 , 16.5 ,
       18.5 , 19.5 , 20.5 , 21.5 , 24.  , 26.5 , 27.5 , 28.25, 28.75,
       30.  , 32.5 , 34.5 , 36.5 , 38.5 , 39.5 , 41.  , 43.5 , 47.  ,
       51.5 , 54.5 , 56.5 , 61.5 , 65.5 ])

In [21]:
value = xmenas[0]
left_counts = Counter(Xdf[Xdf[features[0]]<value]['Y'])
right_counts = Counter(Xdf[Xdf[features[0]]>=value]['Y'])
print(left_counts)
print(right_counts)

Counter({0: 2})
Counter({0: 27, 1: 21})


In [22]:
y0_left, y1_left, y0_right, y1_right = left_counts.get(0, 0), left_counts.get(1, 0), right_counts.get(0, 0), right_counts.get(1, 0)

In [23]:
def GINI_impurity(y1_count: int, y2_count: int) -> float:
        """
        Given the observations of a binary class calculate the GINI impurity
        """
        # Ensuring the correct types
        if y1_count is None:
            y1_count = 0

        if y2_count is None:
            y2_count = 0

        # Getting the total observations
        n = y1_count + y2_count
        
        # If n is 0 then we return the lowest possible gini impurity
        if n == 0:
            return 0.0

        # Getting the probability to see each of the classes
        p1 = y1_count / n
        p2 = y2_count / n
        
        # Calculating GINI 
        gini = 1 - (p1 ** 2 + p2 ** 2)
        
        # Returning the gini impurity
        return gini

In [24]:
gini_left = GINI_impurity(y0_left, y1_left)
gini_right = GINI_impurity(y0_right, y1_right)
print(gini_left)
print(gini_right)

0.0
0.4921875


In [25]:
n_left = y0_left + y1_left
n_right = y0_right + y1_right
GINI_base=0.48
print(n_left)
print(n_right)
# Calculating the weights for each of the nodes
w_left = n_left / (n_left + n_right)
w_right = n_right / (n_left + n_right)
print("=======================")
print(w_left)
print(w_right)
# Calculating the weighted GINI impurity
wGINI = w_left * gini_left + w_right * gini_right
print("=======================")
print(wGINI)

# Calculating the GINI gain 
GINIgain = GINI_base - wGINI

print("=======================")
print(GINIgain)

2
48
0.04
0.96
0.4725
0.007500000000000007


In [26]:
print(Y)
counter=Counter(Y)
print(counter)
c=Counter(Y).items()
print(c)

[0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0
 0 1 1 0 1 0 1 0 0 1 0 0 1]
Counter({0: 29, 1: 21})
dict_items([(0, 29), (1, 21)])


In [27]:
d=list(sorted(c, key=lambda item: item[1]))
print(d)
counter.get(1,0)

[(1, 21), (0, 29)]


21

In [28]:
p1,p2 = counter.get(0, 0), counter.get(1, 0)
n=p1+p2
p1 = p1 / n
p2 = p2 / n
gini = 1 - (p1 ** 2 + p2 ** 2)
gini

0.4872000000000001

In [29]:
dict(counter)

{0: 29, 1: 21}

In [30]:
d[-1][0]

0

In [31]:
root.features

['Age', 'Fare']

In [32]:
a=5 if 1 else 20

In [33]:
a

5

In [34]:
pro = X.iterrows()
pro

<generator object DataFrame.iterrows at 0x000001B30BD70D60>

In [35]:
X[['Age','Fare']]

Unnamed: 0,Age,Fare
0,22.0,7.25
1,38.0,71.2833
2,26.0,7.925
3,35.0,53.1
4,35.0,8.05
6,54.0,51.8625
7,2.0,21.075
8,27.0,11.1333
9,14.0,30.0708
10,4.0,16.7


In [36]:
print(features)
for _,x in X.iterrows():
    for feature in features:
        print(x[feature])
    print("====================")

['Age', 'Fare']
22.0
7.25
38.0
71.2833
26.0
7.925
35.0
53.1
35.0
8.05
54.0
51.8625
2.0
21.075
27.0
11.1333
14.0
30.0708
4.0
16.7
58.0
26.55
20.0
8.05
39.0
31.275
14.0
7.8542
55.0
16.0
2.0
29.125
31.0
18.0
35.0
26.0
34.0
13.0
15.0
8.0292
28.0
35.5
8.0
21.075
38.0
31.3875
19.0
263.0
40.0
27.7208
66.0
10.5
28.0
82.1708
42.0
52.0
21.0
8.05
18.0
18.0
14.0
11.2417
40.0
9.475
27.0
21.0
3.0
41.5792
19.0
7.8792
18.0
17.8
7.0
39.6875
21.0
7.8
49.0
76.7292
29.0
26.0
65.0
61.9792
21.0
10.5
28.5
7.2292
5.0
27.75
11.0
46.9
22.0
7.2292
38.0
80.0
45.0
83.475
4.0
27.9
29.0
10.5


In [48]:
d = pd.read_csv("Kaggle_Sirio_Libanes_ICU_Prediction.csv",sep=";")
d = d.loc[d['WINDOW'] == 'ABOVE_12']
dtree = d[["AGE_ABOVE65", "AGE_PERCENTIL", "GENDER","CALCIUM_MEAN","GLUCOSE_MEAN","HEMOGLOBIN_MEAN","LINFOCITOS_MEAN","PH_ARTERIAL_MEAN","OXYGEN_SATURATION_DIFF","HEART_RATE_DIFF_REL","ICU"]].dropna().copy()
Y = dtree["ICU"][:50].values
X = dtree[["CALCIUM_MEAN","GLUCOSE_MEAN","HEMOGLOBIN_MEAN","LINFOCITOS_MEAN","PH_ARTERIAL_MEAN","OXYGEN_SATURATION_DIFF","HEART_RATE_DIFF_REL"]][:50]
features = list(X.columns)
hp = {
 "max_depth": 10,
 "min_samples_split": 1
}
root = Node(Y, X, **hp)

In [50]:
X.head()

Unnamed: 0,CALCIUM_MEAN,GLUCOSE_MEAN,HEMOGLOBIN_MEAN,LINFOCITOS_MEAN,PH_ARTERIAL_MEAN,OXYGEN_SATURATION_DIFF,HEART_RATE_DIFF_REL
4,0.326531,-0.891993,-0.353659,-0.643154,0.574468,-0.818182,-0.230462
9,0.530612,-0.891993,-0.719512,-0.717842,0.361702,-0.79798,-0.239515
14,0.367347,-0.891993,0.0,-0.73029,0.234043,-0.89899,-0.576744
19,0.326531,-0.891993,-0.219512,-0.8361,0.234043,-0.171717,-0.069094
24,0.357143,-0.891993,0.292683,-0.537344,0.234043,-0.939394,-0.634847


In [51]:
root.grow_tree()

In [52]:
root.print_tree()

Root
   | GINI impurity of the node: 0.5
   | Class distribution in the node: {1: 24, 0: 26}
   | Predicted class: 0
|-------- Split rule: HEART_RATE_DIFF_REL <= -0.437
           | GINI impurity of the node: 0.32
           | Class distribution in the node: {1: 6, 0: 24}
           | Predicted class: 0
|---------------- Split rule: HEART_RATE_DIFF_REL <= -0.691
                   | GINI impurity of the node: 0.0
                   | Class distribution in the node: {0: 6}
                   | Predicted class: 0
|---------------- Split rule: HEART_RATE_DIFF_REL > -0.691
                   | GINI impurity of the node: 0.38
                   | Class distribution in the node: {1: 6, 0: 18}
                   | Predicted class: 0
|------------------------ Split rule: HEART_RATE_DIFF_REL <= -0.662
                           | GINI impurity of the node: 0.0
                           | Class distribution in the node: {1: 1}
                           | Predicted class: 1
|-------------------

In [53]:
tree_one=tree.DecisionTreeClassifier()
tree_one=tree_one.fit(X,Y)
tree_one_accuracy= round(tree_one.score(X,Y),4)
print('Accuracy: %0.4f'% (tree_one_accuracy))

Accuracy: 1.0000


In [41]:
out=StringIO()
tree.export_graphviz(tree_one,out_file=out)
graph = pydotplus.graph_from_dot_data(out.getvalue())
graph.write_png('uci.png')

True