# **MC8843 - Aprendizaje Automático**
# **Trabajo práctico 1: Árboles de decisión**

Pre-requisitos para el notebook.

In [1]:
import torch
import pandas
import numpy as np
import random
import optuna
import plotly

## **1. Implementación de la clasificación multi-clase con árboles de decisión (60 puntos)**

A continuación, implemente el algoritmo de maximización de la esperanza (descrito en el material del curso), usando la definición y descripción de las siguientes funciones como base:

1. El conjunto de datos disponible en https://www.kaggle.com/yashsawarn/wifi-stretgth-for-rooms corresponde a lecturas de 7 fuentes de señal Wi-Fi, los cuales pretender ser utilizados para determinar si el receptor de las lecturas se encuentra en la habitación 1, 2, 3 o 4. La Figura 1 muestra una muestra del conjunto de datos, donde se observa que usualmente las lecturas de la señal Wi-Fi son valores negativos.

<div style="text-align:center">
  <figure>
    <img src="images/sample.png" alt="sample">
    <figcaption>Figura 1: Muestra del conjunto de datos a utilizar.</figcaption>
  </figure>
</div>

In [2]:
def read_dataset(csv_name='wifi_localization.txt'):
    """

    :param csv_name:
    :return:
    """
    data_frame = pandas.read_table(csv_name, sep=r'\s+', names=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'ROOM'),
                                   dtype={'A': np.int64, 'B': np.float64, 'C': np.float64, 'D': np.float64,
                                          'E': np.float64, 'F': np.float64, 'G': np.float64, 'ROOM': np.float64})
    # targets_torch = torch.tensor(data_frame['ROOM'].values)
    dataset_torch = torch.tensor(data_frame.values)
    return dataset_torch

In [3]:
dataset_torch = read_dataset()

print(dataset_torch)

tensor([[-64., -56., -61.,  ..., -82., -81.,   1.],
        [-68., -57., -61.,  ..., -85., -85.,   1.],
        [-63., -60., -60.,  ..., -85., -84.,   1.],
        ...,
        [-62., -59., -46.,  ..., -87., -88.,   4.],
        [-62., -58., -52.,  ..., -90., -85.,   4.],
        [-59., -50., -45.,  ..., -88., -87.,   4.]], dtype=torch.float64)


2. Para resolver el problema de discriminar en cual habitación se encuentra el receptor de la señal Wi-Fi, su equipo decidió construir un arbol de decisión (CART por sus siglas en inglés). Para ello utilizará el código provisto en el "notebook" de "Jupyter".

a) El código provisto define las clases "CART" y "Node_CART", las cuales permiten construir un CART binario. Cada nodo del árbol tiene atributos como el "feature", el umbral, y el coeficiente de "gini" (o la entropía) de la partición definida en tal nodo. Además define el atributo "dominant_class" para el nodo, el cual es el resultado de calcular la clase con mayor cantidad de apariciones en la partición que define al nodo. Finalmente el código incluye la funcionalidad para generar un archivo "xml" (el cual se puede abrir en cualquier navegador web), para representar fácilmente el árbol.

**b) (10 puntos)** Implemente el método "calculate_gini(data_partition_torch, num_classes = 4)", el cual calcule el coeficiente de gini para el conjunto de datos recibido en un tensor de "pytorch". Para ello utilice la definición indicada en el material del curso. Realice una implementación matricial (prescindiendo de estructuras de repetición al máximo). Comente la implementación, detallando cada función utilizada en la documentación externa.

$E_{\textrm{gini},\rho}\left(\tau_{d}\right)=1-\sum_{k=1}^{K}a_{k}^{2}$

1) Diseñe e implemente al menos 2 pruebas unitarias para esta función.

**c) (10 puntos)** Implemente el método calculate_entropy(data_partition_torch, num_classes = 4), }el cual calcule la entropía de las etiquetas, representadas en un tensor de pytorch. Para ello utilice la definición indicada en el material del curso. Realice una implementación matricial (prescindiendo de estructuras de repetición al máximo). $p\left[k\right]$ es la función de densidad de las etiquetas en la partición de datos recibida por parámetro.

$E_{\textrm{entropy},\rho}\left(\tau_{d}\right)=-\sum_{k=1}^{K}p\left[k\right]\log\left(p\left[k\right]\right)$

1) Diseñe e implemente al menos 2 pruebas unitarias para esta función.

d) (30 puntos) Implemente los métodos "select_best_feature_and_thresh(data_torch, num_classes = 4)" y "create_with_children", de la clase "Node_CART". Este método recibe como parámetros el conjunto de datos en un tensor tipo torch a analizar. El método debe probar de forma extensiva todos los posibles features y sus correspondientes umbrales en los datos recibidos, hasta dar con el menor coeficiente ponderado de gini (o la mínima entropía, dependiendo de la función de error a utilizar). **Utilice indexación lógica para evitar al máximo el uso de estructuras de repetición tipo** "for". Solamente puede usar estructuras de repetición para iterar por los "features" y posibles umbrales dentro del conjunto de datos. Recuerde que para evaluar una posible partición, es necesario calcular el coeficiente de gini ponderado sugerido para decidir el feature y umbral óptimos es:

$\overline{E}_{\textrm{gini}}\left(\tau_{d},d\right)=\frac{n_{i}}{n}E_{\textrm{gini}}\left(D_{i}\right)+\frac{n_{d}}{n}E_{\textrm{gini}}\left(D_{d}\right).$

Con un ponderado similar para la entropía. Comente la implementación, detallando cada función utilizada en la documentación externa.

1) Diseñe e implemente al menos 2 pruebas unitarias para esta función.

In [4]:
class NodeCart:

    def __init__(self,
                 gini_entropy_function="GINI",
                 num_classes=4, num_features=7, ref_cart=None, current_depth=0):
        """

        :param num_classes:
        :param ref_cart:
        :param current_depth:
        """
        self.ref_CART = ref_cart
        self.threshold_value = 0  # Umbral
        self.feature_num = 0
        self.node_right = None
        self.node_left = None
        self.data_torch_partition = None  # Referencia a la partición del dato
        self.gini = 0  # O Entropia. Funcion numerica a utilizar cuando se construya el arbol.
        self.dominant_class = None  # Clase con mayor cantidad de observaciones en esa particion.
        self.accuracy_dominant_class = None  # Tasa de aciertos de esa clase dominante
        self.num_classes = num_classes
        self.num_features = num_features
        self.current_depth = current_depth  # Profundidad
        self.leaf = False
        self.gini_function = gini_entropy_function
        self.hits = 0
        self.fails = 0

    def to_xml(self, current_str=""):
        """
        Recursive function to write the node content to an xml formatted string
        param current_str : the xml content so far in the whole tree
        return the string with the node content
        """
        str_node = (f"<node>"
                    f"<thresh>{self.threshold_value}</thresh>"
                    f"<feature>{self.feature_num}</feature>"
                    f"<depth>{self.current_depth}</depth>"
                    f"<gini>{self.gini}</gini>")
        if self.node_left:
            str_left = self.node_left.to_xml(current_str)
            str_node += str_left
        if self.node_right:
            str_right = self.node_right.to_xml(current_str)
            str_node += str_right
        if self.is_leaf():
            str_node += (f"<dominant_class>{self.dominant_class}</dominant_class>"
                         f"<acc_dominant_class>{self.accuracy_dominant_class}</acc_dominant_class>")
        str_node += "</node>"
        return str_node

    def is_leaf(self):
        """
        Checks whether the node is a leaf
        :return:
        """
        return self.leaf

    def create_with_children(self, current_depth=0, list_selected_features=None, min_gini=0.000001, max_cart_depth=3,
                             min_observations=2, glob_list_selected_features=None):
        """
        Creates a node by selecting the best feature and threshold, and if needed, creating its children
        param data_torch: dataset with the current partition to deal with in the node
        param current_depth: depth counter for the node
        param list_selected_features: list of selected features so far for the CART building process
        param min_gini: hyperparameter selected by the user defining the minimum tolerated Gini coefficient for a node
        return the list of selected features so far
        """

        if list_selected_features is None:
            list_selected_features = []

        if glob_list_selected_features is None:
            glob_list_selected_features = []

        min_thresh, min_feature, min_gini_thresh = (
            self.select_best_feature_and_thresh(data_torch=self.data_torch_partition,
                                                list_features_selected=list_selected_features))

        self.feature_num = min_feature
        self.threshold_value = min_thresh
        self.gini = min_gini_thresh
        self.current_depth = current_depth
        list_selected_features.append(self.feature_num)

        if (min_gini_thresh <= min_gini or len(list_selected_features) == self.num_features or
                current_depth == max_cart_depth or self.data_torch_partition.shape[0] <= min_observations):
            # This is a leaf
            self.leaf = True
            length = self.data_torch_partition.shape[1] - 1
            tag_values = self.data_torch_partition[:, length:length + 1].squeeze()
            tags, counts = tag_values.unique(return_counts=True)
            most_common_value = tags[counts.argmax()].item()
            self.dominant_class = most_common_value
            return list_selected_features

        left_idx = self.data_torch_partition[:, self.feature_num] < self.threshold_value
        right_idx = self.data_torch_partition[:, self.feature_num] >= self.threshold_value

        dataset_partition_left = self.data_torch_partition[left_idx]
        dataset_partition_right = self.data_torch_partition[right_idx]

        left_child = NodeCart(current_depth=current_depth, gini_entropy_function=self.gini_function)
        left_child.data_torch_partition = dataset_partition_left
        left_child.ref_CART = self

        right_child = NodeCart(current_depth=current_depth, gini_entropy_function=self.gini_function)
        right_child.data_torch_partition = dataset_partition_right
        right_child.gini_function = self.gini_function
        right_child.ref_CART = self

        current_depth += 1

        self.node_left = left_child
        self.node_right = right_child

        unique_features_left = list_selected_features.copy()
        unique_features_right = list_selected_features.copy()

        left_selected = self.node_left.create_with_children(current_depth, unique_features_left,
                                                            max_cart_depth=max_cart_depth,
                                                            min_gini=min_gini,
                                                            min_observations=min_observations)
        right_selected = self.node_right.create_with_children(current_depth, unique_features_right,
                                                              min_gini=min_gini,
                                                              max_cart_depth=max_cart_depth,
                                                              min_observations=min_observations)

        glob_list_selected_features.extend(left_selected)
        glob_list_selected_features.extend(right_selected)

        # TODO eliminar duplicados

        return glob_list_selected_features

    def select_best_feature_and_thresh(self, data_torch, list_features_selected=None, num_classes=4):
        """
        Selects the best feature and threshold that minimizes the Gini coefficient
        param data_torch: dataset partition to analyze
        param list_features_selected list of features selected so far, thus must be ignored
        param num_classes: number of K classes to discriminate from
        return min_thresh, min_feature, min_gini found for the dataset partition when
        selecting the found feature and threshold
        """
        def evaluate_feature(data, feature_num, gini_entropy_total_function):
            root_node = NodeCart()
            root_node.data_torch_partition = data
            root_node.feature_num = feature_num
            threshold_values = torch.unique(data[:, feature_num:feature_num + 1].squeeze())
            value_gini = {}
            for value in threshold_values:
                root_node.threshold_value = value
                left_idx = data[:, root_node.feature_num] < root_node.threshold_value
                right_idx = data[:, root_node.feature_num] >= root_node.threshold_value
                dataset_partition_left = data[left_idx]
                dataset_partition_right = data[right_idx]
                left_child = NodeCart()
                left_child.data_torch_partition = dataset_partition_left
                right_child = NodeCart()
                right_child.data_torch_partition = dataset_partition_right
                gini = gini_entropy_total_function(left_child, right_child)
                value_gini[value] = gini
            minimum_gini = min(value_gini, key=value_gini.get)
            return {minimum_gini.item(): value_gini[minimum_gini].item()}

        if list_features_selected is None:
            list_features_selected = []
        num_features = data_torch.shape[1] - 1
        if len(list_features_selected) == num_features:
            raise ValueError("All features have been selected")
        features_gini = {}
        for feature in range(num_features):
            if feature not in list_features_selected:
                features_gini[feature] = evaluate_feature(data_torch, feature, self.calculate_total_gini_entropy)
        min_key, min_inner_dict = min(features_gini.items(), key=lambda item: next(iter(item[1].values())))
        result = features_gini[min_key]
        min_feature = min_key
        min_thresh = list(result.keys())[0]
        min_gini = result[min_thresh]
        return min_thresh, min_feature, min_gini

    def calculate_gini(self, data_partition_torch, num_classes=4):
        """
        Calculates the Gini coefficient for a given partition with the given number of classes
        param data_partition_torch: current dataset partition as a tensor
        param num_classes: K number of classes to discriminate from
        returns the calculated Gini coefficient
        """
        def calculate_gini_impurity(partition):
            size = partition.shape[0]
            if size == 0:  # To handle the division by zero
                return torch.tensor(0)
            length = partition.shape[1] - 1
            _, counts = partition[:, length].unique(return_counts=True)
            gini = 1 - torch.sum((counts / size) ** 2)
            return gini
        return calculate_gini_impurity(data_partition_torch)

    def calculate_entropy(self, data_partition_torch, num_classes=4):
        """
        Calculates the Gini coefficient for a given partition with the given number of classes
        param data_partition_torch: current dataset partition as a tensor
        param num_classes: K number of classes to discriminate from
        returns the calculated Gini coefficient
        """
        def calculate_entropy_disorder(partition):
            size = partition.shape[0]
            if size == 0:  # To handle the division by zero
                return torch.tensor(0)
            length = partition.shape[1] - 1
            epsilon = torch.tensor(0.0001)  # Small epsilon to prevent probabilities equal to 0
            _, counts = partition[:, length].unique(return_counts=True)
            probabilities = (counts / size) + epsilon
            entropy = - torch.sum(probabilities * torch.log(probabilities))
            return entropy
        return calculate_entropy_disorder(data_partition_torch)

    def calculate_total_gini_entropy(self, node_left, node_right):
        selected_function = self.calculate_gini if self.gini_function == "GINI" else self.calculate_entropy
        size_left = node_left.data_torch_partition.shape[0]
        size_right = node_right.data_torch_partition.shape[0]
        size_total = size_left + size_right
        gini_entropy_left = selected_function(node_left.data_torch_partition)
        gini_entropy_right = selected_function(node_right.data_torch_partition)
        gini_entropy_total = (size_left / size_total) * gini_entropy_left + (size_right / size_total) * gini_entropy_right
        return gini_entropy_total

    def evaluate_node(self, input_torch):
        """
        Evaluates an input observation within the node.
        If is not a leaf node, send it to the corresponding node
        return predicted label
        """
        feature_val_input = input_torch[self.feature_num]
        if self.is_leaf():
            return self.dominant_class, self
        elif feature_val_input < self.threshold_value:
            return self.node_left.evaluate_node(input_torch)
        else:
            return self.node_right.evaluate_node(input_torch)

    def update_accuracy(self):
        self.accuracy_dominant_class = (self.hits / (self.hits + self.fails)) * 100
        self.accuracy_dominant_class = round(self.accuracy_dominant_class, 2)

In [5]:
class CART:  # Este es el arbol
    # Do not change default values or unit tests will be affected !!
    def __init__(self, dataset_torch, max_cart_depth=3, min_observations=2, gini_entropy_function="GINI", num_classes=4):
        """
        CART has only one root node
        """
        # min observations per node
        self.min_observations = min_observations
        self.root = NodeCart(num_classes=num_classes, ref_cart=self, gini_entropy_function=gini_entropy_function)
        self.root.data_torch_partition = dataset_torch
        self.max_CART_depth = max_cart_depth
        self.list_selected_features = []
        self.confusion_matrix = torch.zeros(num_classes, num_classes)

    def get_root(self):
        """
        Gets tree root
        """
        return self.root

    def get_min_observations(self):
        """
        return min observations per node
        """
        return self.min_observations

    def get_max_depth(self):
        """
        Gets the selected max depth of the tree
        """
        return self.max_CART_depth

    def build_cart(self):
        """
        Build CART from root
        """
        self.list_selected_features = self.root.create_with_children(max_cart_depth=self.max_CART_depth,
                                                                     min_observations=self.min_observations)

    def to_xml(self, xml_file_name):
        """
        write Xml file with tree content
        """
        str_nodes = self.root.to_xml()
        with open(xml_file_name, 'w') as file:
            file.write(str_nodes)
        return str_nodes

    def evaluate_input(self, input_torch):
        """
        Evaluate a specific input in the tree and get the predicted class
        """
        return self.root.evaluate_node(input_torch)

    def update_confusion_matrix(self, estimated_class, real_class):
        self.confusion_matrix[int(estimated_class) - 1][int(real_class) - 1] += 1

    def get_f1_scores_per_class(self):
        def get_metrics(matrix, the_class):
            tp = matrix[the_class, the_class]
            fn = torch.sum(matrix[:, the_class]) - tp
            fp = torch.sum(matrix[the_class, :]) - tp
            tn = torch.sum(matrix) - tp - fp - fn
            return tp, tn, fp, fn
        f1_scores = {}
        for my_class in range(self.confusion_matrix.size(0)):
            vp, vn, fp, fn = get_metrics(self.confusion_matrix, my_class)
            sensibility_tvp = torch.nan_to_num(vp / (fn + vp))
            accuracy_vpp = torch.nan_to_num(vp / (vp + fp))
            f1_score = torch.nan_to_num((2 * sensibility_tvp * accuracy_vpp) / (sensibility_tvp + accuracy_vpp))
            f1_scores[my_class + 1] = f1_score
        return f1_scores

e) (10 puntos) Implemente la función "test_CART" la cual evalúe un CART previamente entrenado para un conjunto de datos $D$ representado en un tensor. Calcule la tasa de aciertos ("accuracy"), definida como:

$a=\frac{c}{n}$

donde $c$ corresponde a las estimaciones correctas, para tal conjunto de datos y retornela. Comente la implementación, detallando cada función utilizada en la documentación externa.

1. Diseñe e implemente al menos 2 pruebas unitarias para esta función.

In [6]:
def train_cart(dataset_torch, name_xml="", max_cart_depth=3, min_obs_per_leaf=2, gini_entropy_function="GINI"):
    """
    Train CART model
    """
    tree = CART(dataset_torch=dataset_torch, max_cart_depth=max_cart_depth, min_observations=min_obs_per_leaf,
                gini_entropy_function=gini_entropy_function)
    tree.build_cart()
    if name_xml:
        tree.to_xml(name_xml)
    return tree

In [7]:
def test_cart(tree, testset_torch):
    """
    Test a previously built CART
    """
    # Entropy 14 vs 3 | 1490 vs 510
    # Gini 11 vs 6 | 1397 vs 603
    hits = 0
    fails = 0
    for observation in testset_torch:
        expected = observation[-1]
        result, leaf = tree.evaluate_input(observation)
        tree.update_confusion_matrix(result, expected)
        if expected == result:
            leaf.hits += 1
            leaf.update_accuracy()
            hits += 1
        else:
            leaf.fails += 1
            leaf.update_accuracy()
            fails += 1
    accuracy = (hits / (hits + fails)) * 100
    return accuracy

In [8]:
tree = train_cart(dataset_torch, name_xml="testCART.xml", max_cart_depth=3,min_obs_per_leaf=2,gini_entropy_function="GINI")

print("Tree accuracy is: ", test_cart(tree=tree, testset_torch=dataset_torch))

Tree accuracy is:  95.25


## **2. Evaluación del CART (40 puntos)**

1. (20 puntos) Evalúe el CART implementado con el conjunto de datos completo provisto  https://www.kaggle.com/yashsawarn/wifi-stretgth-for-rooms usándolo como conjunto de datos de entrenamiento y prueba. Reporte la tasa de aciertos y el F1-score promedio de todas las clases, obtenida e incluya el código de la evaluación. Pruebe con una profundidad máxima de 3 y 4 nodos, siempre con mínimo 2 observaciones por hoja.

    a) Realice lo anterior usando la entropía y el coeficiente de Gini. Compare y comente los resultados.

In [9]:
def evaluate_tree_full_dataset(max_cart_depth, gini_entropy_function, min_obs_per_leaf=2):
    dataset = read_dataset(csv_name="wifi_localization.txt")
    tree = train_cart(dataset, max_cart_depth=max_cart_depth, gini_entropy_function=gini_entropy_function,
                      min_obs_per_leaf=min_obs_per_leaf)
    overall_accuracy = test_cart(tree, dataset)
    f1_scores = tree.get_f1_scores_per_class()
    print(f"Overall accuracy: {overall_accuracy}")
    print(f"F1 scores: {f1_scores}")
    return overall_accuracy, f1_scores

In [10]:
evaluate_tree_full_dataset(3, "GINI")

Overall accuracy: 95.25
F1 scores: {1: tensor(0.9940), 2: tensor(0.9198), 3: tensor(0.9070), 4: tensor(0.9889)}


(95.25,
 {1: tensor(0.9940), 2: tensor(0.9198), 3: tensor(0.9070), 4: tensor(0.9889)})

In [11]:
evaluate_tree_full_dataset(4, "GINI")

Overall accuracy: 96.55
F1 scores: {1: tensor(0.9950), 2: tensor(0.9435), 3: tensor(0.9333), 4: tensor(0.9900)}


(96.55,
 {1: tensor(0.9950), 2: tensor(0.9435), 3: tensor(0.9333), 4: tensor(0.9900)})

In [12]:
evaluate_tree_full_dataset(3, "ENTROPY")

Overall accuracy: 94.85
F1 scores: {1: tensor(0.9940), 2: tensor(0.9140), 3: tensor(0.8961), 4: tensor(0.9890)}


(94.85,
 {1: tensor(0.9940), 2: tensor(0.9140), 3: tensor(0.8961), 4: tensor(0.9890)})

In [13]:
evaluate_tree_full_dataset(4, "ENTROPY")

Overall accuracy: 95.89999999999999
F1 scores: {1: tensor(0.9940), 2: tensor(0.9248), 3: tensor(0.9247), 4: tensor(0.9920)}


(95.89999999999999,
 {1: tensor(0.9940), 2: tensor(0.9248), 3: tensor(0.9247), 4: tensor(0.9920)})

2. (20 puntos) Para una profundidad máxima de 3 y 4 nodos: evalúe el CART implementado usando 10 particiones aleatorias del conjunto de datos, con un 70% del conjunto de datos como conjunto de datos de entrenamiento, y el restante 30% como conjunto de datos de prueba. Reporte una tabla con la tasa de aciertos y F1-score promedio de todas las clases, de cada una de las 10 corridas, y el promedio y desviación estándar para las 10 corridas.

    a) Realice lo anterior usando la entropía y el coeficiente de Gini. Compare los resultados y comente las posibles ventajas y desventajas de cada función de error. En la comparativa, incluya el tiempo de ejecución en entrenar cada modelo con cada función de error distinta.

In [14]:
def split_dataset(csv_name='wifi_localization.txt'):
    data_frame = pandas.read_table(csv_name, sep=r'\s+', names=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'ROOM'),
                                   dtype={'A': np.int64, 'B': np.float64, 'C': np.float64, 'D': np.float64,
                                          'E': np.float64, 'F': np.float64, 'G': np.float64, 'ROOM': np.float64})
    shuffled_values = data_frame.sample(frac=1).reset_index(drop=True).values
    split_index = int(len(shuffled_values) * 0.7)
    train_set = shuffled_values[:split_index]
    test_set = shuffled_values[split_index:]
    return torch.tensor(train_set), torch.tensor(test_set)


def evaluate_train_test_dataset_tree(train_dataset, test_dataset, max_cart_depth, gini_entropy_function, min_obs_per_leaf=2):
    tree = train_cart(train_dataset, max_cart_depth=max_cart_depth, gini_entropy_function=gini_entropy_function,
                      min_obs_per_leaf=min_obs_per_leaf)
    overall_accuracy = test_cart(tree, test_dataset)
    f1_scores = tree.get_f1_scores_per_class()
    print(f"Overall accuracy: {overall_accuracy}")
    print(f"F1 scores: {f1_scores}")
    return overall_accuracy, f1_scores

In [15]:
partitions = [split_dataset() for x in range(10)]
results_gini = {}
results_entropy = {}

In [16]:
%%time
for index, partition in enumerate(partitions):
    print("Run",index + 1)
    results_gini[f"{index}, 3"] = evaluate_train_test_dataset_tree(partition[0], partition[1], max_cart_depth=3, gini_entropy_function="GINI")

Run 1


Overall accuracy: 94.16666666666667
F1 scores: {1: tensor(0.9899), 2: tensor(0.8919), 3: tensor(0.8842), 4: tensor(0.9938)}
Run 2
Overall accuracy: 93.33333333333333
F1 scores: {1: tensor(0.9931), 2: tensor(0.8867), 3: tensor(0.8721), 4: tensor(0.9865)}
Run 3
Overall accuracy: 91.5
F1 scores: {1: tensor(0.9772), 2: tensor(0.8529), 3: tensor(0.8554), 4: tensor(0.9730)}
Run 4
Overall accuracy: 93.33333333333333
F1 scores: {1: tensor(1.), 2: tensor(0.8776), 3: tensor(0.8765), 4: tensor(0.9865)}
Run 5
Overall accuracy: 93.83333333333333
F1 scores: {1: tensor(0.9866), 2: tensor(0.8940), 3: tensor(0.8647), 4: tensor(0.9940)}
Run 6
Overall accuracy: 95.5
F1 scores: {1: tensor(0.9936), 2: tensor(0.9172), 3: tensor(0.9091), 4: tensor(0.9936)}
Run 7
Overall accuracy: 93.83333333333333
F1 scores: {1: tensor(0.9905), 2: tensor(0.8926), 3: tensor(0.8837), 4: tensor(0.9859)}
Run 8
Overall accuracy: 95.16666666666667
F1 scores: {1: tensor(0.9969), 2: tensor(0.9153), 3: tensor(0.9030), 4: tensor(0.989

In [17]:
%%time
for index, partition in enumerate(partitions):
    print("Run",index+1)
    results_gini[f"{index}, 4"] = evaluate_train_test_dataset_tree(partition[0], partition[1], max_cart_depth=4, gini_entropy_function="GINI")

Run 1
Overall accuracy: 95.16666666666667
F1 scores: {1: tensor(0.9899), 2: tensor(0.9071), 3: tensor(0.9103), 4: tensor(0.9938)}
Run 2
Overall accuracy: 93.5
F1 scores: {1: tensor(0.9896), 2: tensor(0.8809), 3: tensor(0.8902), 4: tensor(0.9832)}
Run 3
Overall accuracy: 93.66666666666667
F1 scores: {1: tensor(0.9901), 2: tensor(0.9024), 3: tensor(0.8800), 4: tensor(0.9733)}
Run 4
Overall accuracy: 94.16666666666667
F1 scores: {1: tensor(1.), 2: tensor(0.9040), 3: tensor(0.8814), 4: tensor(0.9865)}
Run 5
Overall accuracy: 95.0
F1 scores: {1: tensor(0.9833), 2: tensor(0.9155), 3: tensor(0.8975), 4: tensor(0.9940)}
Run 6
Overall accuracy: 95.0
F1 scores: {1: tensor(0.9936), 2: tensor(0.9046), 3: tensor(0.9010), 4: tensor(0.9936)}
Run 7
Overall accuracy: 94.83333333333334
F1 scores: {1: tensor(0.9874), 2: tensor(0.9161), 3: tensor(0.9028), 4: tensor(0.9859)}
Run 8
Overall accuracy: 96.33333333333334
F1 scores: {1: tensor(0.9969), 2: tensor(0.9357), 3: tensor(0.9299), 4: tensor(0.9892)}
Run

In [18]:
%%time
for index, partition in enumerate(partitions):
    print("Run",index+1)
    results_entropy[f"{index}, 3"] = evaluate_train_test_dataset_tree(partition[0], partition[1], max_cart_depth=3, gini_entropy_function="ENTROPY")

Run 1
Overall accuracy: 94.16666666666667
F1 scores: {1: tensor(0.9899), 2: tensor(0.8919), 3: tensor(0.8842), 4: tensor(0.9938)}
Run 2
Overall accuracy: 93.33333333333333
F1 scores: {1: tensor(0.9931), 2: tensor(0.8867), 3: tensor(0.8721), 4: tensor(0.9865)}
Run 3
Overall accuracy: 93.66666666666667
F1 scores: {1: tensor(0.9740), 2: tensor(0.9060), 3: tensor(0.8800), 4: tensor(0.9864)}
Run 4
Overall accuracy: 93.0
F1 scores: {1: tensor(1.), 2: tensor(0.8834), 3: tensor(0.8562), 4: tensor(0.9865)}
Run 5
Overall accuracy: 93.83333333333333
F1 scores: {1: tensor(0.9866), 2: tensor(0.8940), 3: tensor(0.8647), 4: tensor(0.9940)}
Run 6
Overall accuracy: 95.5
F1 scores: {1: tensor(0.9936), 2: tensor(0.9172), 3: tensor(0.9091), 4: tensor(0.9936)}
Run 7
Overall accuracy: 94.83333333333334
F1 scores: {1: tensor(0.9842), 2: tensor(0.9196), 3: tensor(0.9028), 4: tensor(0.9859)}
Run 8
Overall accuracy: 95.16666666666667
F1 scores: {1: tensor(0.9969), 2: tensor(0.9153), 3: tensor(0.9030), 4: tensor

In [19]:
%%time
for index, partition in enumerate(partitions):
    print("Run",index+1)
    results_entropy[f"{index}, 4"] = evaluate_train_test_dataset_tree(partition[0], partition[1], max_cart_depth=4, gini_entropy_function="ENTROPY")

Run 1
Overall accuracy: 94.16666666666667
F1 scores: {1: tensor(0.9899), 2: tensor(0.8873), 3: tensor(0.8889), 4: tensor(0.9938)}
Run 2
Overall accuracy: 93.0
F1 scores: {1: tensor(0.9790), 2: tensor(0.8809), 3: tensor(0.8902), 4: tensor(0.9733)}
Run 3
Overall accuracy: 93.0
F1 scores: {1: tensor(0.9740), 2: tensor(0.8881), 3: tensor(0.8718), 4: tensor(0.9864)}
Run 4
Overall accuracy: 94.33333333333334
F1 scores: {1: tensor(1.), 2: tensor(0.8986), 3: tensor(0.8944), 4: tensor(0.9865)}
Run 5
Overall accuracy: 95.16666666666667
F1 scores: {1: tensor(0.9833), 2: tensor(0.9132), 3: tensor(0.9073), 4: tensor(0.9940)}
Run 6
Overall accuracy: 95.0
F1 scores: {1: tensor(0.9936), 2: tensor(0.9046), 3: tensor(0.9010), 4: tensor(0.9936)}
Run 7
Overall accuracy: 94.66666666666667
F1 scores: {1: tensor(0.9842), 2: tensor(0.9172), 3: tensor(0.8982), 4: tensor(0.9859)}
Run 8
Overall accuracy: 96.16666666666667
F1 scores: {1: tensor(0.9969), 2: tensor(0.9288), 3: tensor(0.9297), 4: tensor(0.9892)}
Run

3. (15 puntos extra) Utilizando Optuna o weights and biases, optimice la profundidad máxima de los nodos. Muestre los gráficos de la optimización realizada con la herramienta. 

In [20]:
def objective(trial):
    dataset = read_dataset(csv_name="wifi_localization.txt")
    max_depth = trial.suggest_int('max_depth', 1, 8)
    tree = train_cart(dataset, max_cart_depth=max_depth, gini_entropy_function="ENTROPY")
    accuracy = test_cart(tree, dataset)
    return accuracy

In [21]:
%%time
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=36)

lowest_depth = None
for trial in study.best_trials:
    depth = trial.params['max_depth']
    lowest_depth = depth if lowest_depth is None or depth < lowest_depth else lowest_depth

print(f"Best parameters: {lowest_depth}")
print(f"Best accuracy: {study.best_value}%")

[I 2024-04-07 15:41:22,961] A new study created in memory with name: no-name-e4252b96-26b8-4d46-a67e-6ec06c506940
[I 2024-04-07 15:41:25,922] Trial 0 finished with value: 97.15 and parameters: {'max_depth': 6}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:28,748] Trial 1 finished with value: 96.89999999999999 and parameters: {'max_depth': 5}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:30,798] Trial 2 finished with value: 94.85 and parameters: {'max_depth': 3}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:33,689] Trial 3 finished with value: 97.15 and parameters: {'max_depth': 8}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:35,191] Trial 4 finished with value: 93.8 and parameters: {'max_depth': 2}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:37,773] Trial 5 finished with value: 95.89999999999999 and parameters: {'max_depth': 4}. Best is trial 0 with value: 97.15.
[I 2024-04-07 15:41:39,256] Trial 6 finished with value: 93.8 and param

Best parameters: 6
Best accuracy: 97.15%
CPU times: total: 3min 24s
Wall time: 1min 35s


In [22]:
import plotly

fig = optuna.visualization.plot_optimization_history(study)
fig.update_layout(xaxis_title='Trial', yaxis_title='Accuracy')
fig.show()