## Données
Les données à utiliser pour ce TP se trouvent dans les fichiers data.csv et data_test.csv.
Les 5 premières colonnes spécifient les variables indépendantes tandis que la dernière colonne correspond à la variable dépendante (label).
Le fichier data_test.csv sert à évaluer les performances des arbres développés à partir des données du fichier data.csv.

J'ai défini la fonction suivante afin de convertir un fichier csv en np.array:

In [1]:
import csv
import numpy as np
import random


def convert_csv2array(name: str) -> tuple[np.ndarray, np.ndarray]:
    """
    Convert a csv file into numpy array

    Parameters
    ----------
    name: string
        Path to the file

    Return value
    ------------
    header and data of csv
    """
    file = open(name, 'r')
    data = []
    reader = csv.reader(file)
    for line in reader:
        data.append(line)
    data = np.array(data)
    return data[0, :], data[1:, :].astype(int)

header, data = convert_csv2array('data.csv')

# Exercice 1: Entropie et gain d’information

1. Calculer l’entropie de la variable dépendante.

In [2]:
def entropy(data: np.ndarray, column: int) -> float:
    """
    Give the entropy of the given column of data

    Parameters
    ----------
    data: np.ndarray
        Data
    column: int
        column on which we want to compute entropy

    Return value
    ------------
    The entropy of the given column of data 
    """
    values, count = np.unique(data[:, column], return_counts=True)
    result = 0
    for i in range(len(values)):
        nb = count[i] / len(data)
        result += -nb * np.log2(nb)
    return result

print(f"Entropy of the independant variable: {entropy(data, 5)}")

Entropy of the independant variable: 0.9738003694382252


2. Calculer le gain d’information réalisé après l’application de trois critères de décision aléatoires.

In [3]:
def conditionnal_entropy(data: np.ndarray, column_1: int, column_2: int) -> float:
    """
    Compute conditionnal entropy H(X|Y)

    Parameters
    ----------
    data: np.ndarray
        Data used to compute conditionnal entropy
    column_1: int
        Column of X
    column_2: int
        Column of Y

    Return value
    ------------
    Return the result of H(X|Y)
    """
    values, count = np.unique(data[:, column_2], return_counts=True)
    result = 0
    for i in range(len(values)):
        newdata = data[data[:, column_2] == values[i]]
        nb = count[i] / len(data)
        result += nb * entropy(newdata, column_1)
    return result

def mutual_information(data: np.ndarray, column_1: int, column_2: int) -> float:
    """
    Compute mutual information I(X; Y)
    
    The gain in information is represented by the mutual information between 2 random variables

    Parameters
    ----------
    data: np.ndarray
        Data used to compute mutual information
    column_1: int
        column of X
    column_2: int
        column of Y

    Return value
    ------------
    Return the result of I(X; Y) = H(X) - H(X|Y)
    """
    return entropy(data, column_1) - conditionnal_entropy(data, column_1, column_2)

# Choose the criteria randomly
randnumber_1 = random.randint(0, 4)
randnumber_2 = random.randint(0, 4)
randnumber_3 = random.randint(0, 4)
while randnumber_1 == randnumber_2:
    randnumber_2 = random.randint(0, 4)
while randnumber_3 == randnumber_1 or randnumber_3 == randnumber_2:
    randnumber_3 = random.randint(0, 4)

# Compute information gain for each criteria
randnumbers = [randnumber_1, randnumber_2, randnumber_3]
gain_information = []
for i in range(3):
    gain_information.append(mutual_information(data, 5, randnumbers[i]))
    print(f"For the criteria {header[randnumbers[i]]}, the gain of information is: {gain_information[i]}")

For the criteria B, the gain of information is: 0.1520457522567905
For the criteria D, the gain of information is: 0.16681614778733878
For the criteria A, the gain of information is: 0.02894167191284036


3. Pour les mêmes critères de décision, calculer l’index Gini.

In [4]:
def gini(data: np.ndarray, column: int) -> float:
    """
    Compute the gini index

    Parameters
    ----------
    data: np.ndarray
        Data used to compute gini index
    column: int
        column used to compute gini index

    Return value
    ------------
    Return the index gini
    """
    values, count = np.unique(data[:, column], return_counts=True)
    result = 0
    for i in range(len(values)):
        nb = count[i] / len(data)
        result += nb * nb
    return 1 - result

gini_information = []
for i in range(3):
    gini_information.append(gini(data, randnumbers[i]))
    print(f"For the criteria {header[randnumbers[i]]}, the gini index is: {gini_information[i]}")

For the criteria B, the gini index is: 0.5886499999999999
For the criteria D, the gini index is: 0.7758499999999999
For the criteria A, the gini index is: 0.5733999999999999


4. Quel est le critère de décision préférable selon le gain d’information ? Selon l’index Gini ?

On cherche à maximiser le gain d'information ou à minimiser l'index Gini.


Le critère qui maximise le gain d'information n'est pas forcément égal au critère qui minimise l'index Gini.
On a les résultats suivants:

In [5]:
index_of_max_gain = gain_information.index(max(gain_information))
index_of_min_gini = gini_information.index(min(gini_information))
print(f"The criteria of decision which is preferable according to information gain is the criteria {header[randnumbers[index_of_max_gain]]}")
print(f"The criteria of decision which is preferable according to gini index is the criteria {header[randnumbers[index_of_min_gini]]}")

The criteria of decision which is preferable according to information gain is the criteria D
The criteria of decision which is preferable according to gini index is the criteria A


C'est pourquoi on verra dans la suite du travail pratique que un arbre construit avec l'index Gini est différent d'un arbre construit avec le gain d'information.

# Exercice 2: ID3

1. Implémenter l’algorithme ID3 avec comme critères possibles le gain d’information et l’index Gini.

In [6]:
def id3(data: np.ndarray, header: np.ndarray, index_data_to_train: int, use_gini: bool = False):
    """
    This function compute a decision tree from a set of data

    Parameters
    ----------
    data: np.ndarray
        Data used to compute the decision tree
    header: np.ndarray
        Header used to register nodes of the tree
    index_data_to_train: int
        Indicate which column of our data we want to train the decision tree for
    use_gini: bool
        Inidicate if we want to compute the tree thanks to index gini.
        Default computation is made with information gain/mutual information
        So the default value is set to False
    """
    # Get the number of column of data
    n = len(data[0])

    # Case 1 column: count number of each element and return priority element
    if n == 1:
        values, count = np.unique(data[:], return_counts=True)
        maximum = 0
        index = 0
        for i in range(len(values)):
            if count[i] > maximum:
                maximum = count[i]
                index = i
        return int(values[i])

    # Test if there is some different values for the column to train
    # If not, return the value
    test_unique = np.unique(data[:, index_data_to_train])
    if len(test_unique) == 1:
        return int(test_unique[0])

    # Compute mutual information between the column to train and each other column
    # The column choosen maximise the mutual information
    # In case we want to use gini index, we want to minimise the gini index.
    maximum = 0
    minimum = 1
    index = 0
    for i in range(n):
        if i != index_data_to_train:
            if use_gini:
                info = gini(data, i)
                if info < minimum:
                    minimum = info
                    index = i
            else:
                info = mutual_information(data, index_data_to_train, i)
                if info > maximum:
                    maximum = info
                    index = i

    # Get all the values of the column choosen.
    # For each values, compute id3 on a new data with just the line
    # where the value of the column choosen is equal to the selected value
    # and with the column choosen deleted.
    # Compute a tree thanks to the results of id3 for each new data
    values = np.unique(data[:, index])
    tree = {}
    for i in values:
        newdata = np.delete(data[data[:, index] == i], index, 1)
        newheader = np.delete(header, index, 0)
        if (index < index_data_to_train):
            tree[int(i)] = id3(newdata, newheader, index_data_to_train - 1, use_gini)
        else:
            tree[int(i)] = id3(newdata, newheader, index_data_to_train, use_gini)
    
    # Return the tree obtained
    return {header[index]: tree}

tree = id3(data, header, 5)
tree_gini = id3(data, header, 5, True)

assert tree == tree_gini, "The decision tree obtained thanks to information gain is different from decision tree obtained thanks to gini index"


AssertionError: The decision tree obtained thanks to information gain is different from decision tree obtained thanks to gini index

2. Comparer l’arbre obtenu à l’aide d’ID3 gain d’information avec celui produit par la démonstration.

Afin de savoir si les arbres obtenus sont identiques, je vais mesurer différentes informations comme la profondeur maximale de l'arbre, et choisir différents chemins pour voir si par hasard les chemins obtenus avec mon arbre sont identiques aux chemins obtenus avec l'arbre obtenu de la démonstration.

Tout d'abord, on a la profondeur maximale de l'arbre de la démonstration qui est donné par 4 critères pour arriver à une décision.
Pour mesurer la profondeur de mon arbre, j'ai implémenté les fonctions suivantes:

In [7]:
print("My decision tree")
print(tree)

def compute_max_path(tree: dict) -> dict[str, int]:
    """
    This function go throught the decision tree and return the maximum path from root node to leaf

    Parameters
    ----------
    tree: dictionary
        A decision tree
    
    Return value
    ------------
    The maximum path of the tree
    """
    if type(tree) != int and len(list(tree.keys())) != 0:
        item = list(tree.keys())[0]
        tree = tree[item]
        if type(tree) != int:
            keys = list(tree.keys())
            maximum = 0
            for i in range(len(keys)):
                mypath = compute_max_path(tree) + 1
                if mypath > maximum:
                    maximum = mypath
        else:
            return 1
    return maximum

print(f"Depth of my tree: {compute_max_path(tree)}")

assert compute_max_path(tree) == 4, "Depth of trees must be equal so that the trees are similar"


My decision tree
{'E': {0: {'C': {0: 0, 1: 1, 2: {'A': {0: 1, 1: {'B': {0: 0, 1: 1}}}}, 3: 1}}, 1: {'D': {0: {'A': {1: 0, 2: 1}}, 1: {'B': {0: {'A': {0: 0, 2: 1}}, 1: 0, 2: {'C': {0: 1, 1: 0}}}}, 2: {'A': {0: 0, 2: {'B': {0: 1, 1: 0}}}}, 3: 0, 4: 0}}, 2: {'D': {0: 1, 1: {'B': {0: 1, 1: 0, 2: {'C': {0: 1, 1: 0}}}}, 2: 0, 3: 0, 4: 0}}}}
Depth of my tree: 4


On peut constater que notre arbre a la même profondeur que l'arbre donné dans la démonstration.
Donc on est en bonne voie pour avoir deux arbres identiques.

Maintenant, on va prendre une dizaine de données et voir si les chemins dans les arbres sont les mêmes pour ces 10 données.
On choisit les données suivantes:

In [8]:
for i in range(20):
    print(data[i])

[0 1 3 1 1 0]
[0 1 2 1 2 0]
[0 0 3 4 0 1]
[0 1 1 3 1 0]
[0 0 1 3 0 1]
[0 0 2 3 1 0]
[1 1 1 3 0 1]
[0 0 2 4 1 0]
[2 1 2 2 1 0]
[1 2 1 3 0 1]
[2 1 2 1 2 0]
[0 1 1 1 2 0]
[0 1 3 3 1 0]
[2 2 3 0 1 1]
[0 1 1 4 2 0]
[2 2 3 0 1 1]
[0 1 2 2 2 0]
[0 1 2 4 1 0]
[0 0 1 4 1 0]
[2 2 2 0 1 1]


Voici les chemins obtenus pour chacune des données:

In [9]:
def get_path(tree: dict, data: list, header: list) -> list:
    """
    This function go throught the decision tree and return a random path from root node to leaf

    Parameters
    ----------
    tree: dictionary
        A decision tree
    
    Return value
    ------------
    A dictionnary with the random path created
    """
    result = []
    header = list(header)
    while type(tree) != int and len(list(tree.keys())) != 0:
        item = list(tree.keys())[0]
        result.append(item)
        tree = tree[item]
        if type(tree) != int:
            tree = tree[data[header.index(item)]]
    return result
mytree_paths = []
for i in range(20):
    mytree_paths.append(get_path(tree, data[i], header))
    print(f"{data[i]} -> {mytree_paths[i]}")

[0 1 3 1 1 0] -> ['E', 'D', 'B']
[0 1 2 1 2 0] -> ['E', 'D', 'B']
[0 0 3 4 0 1] -> ['E', 'C']
[0 1 1 3 1 0] -> ['E', 'D']
[0 0 1 3 0 1] -> ['E', 'C']
[0 0 2 3 1 0] -> ['E', 'D']
[1 1 1 3 0 1] -> ['E', 'C']
[0 0 2 4 1 0] -> ['E', 'D']
[2 1 2 2 1 0] -> ['E', 'D', 'A', 'B']
[1 2 1 3 0 1] -> ['E', 'C']
[2 1 2 1 2 0] -> ['E', 'D', 'B']
[0 1 1 1 2 0] -> ['E', 'D', 'B']
[0 1 3 3 1 0] -> ['E', 'D']
[2 2 3 0 1 1] -> ['E', 'D', 'A']
[0 1 1 4 2 0] -> ['E', 'D']
[2 2 3 0 1 1] -> ['E', 'D', 'A']
[0 1 2 2 2 0] -> ['E', 'D']
[0 1 2 4 1 0] -> ['E', 'D']
[0 0 1 4 1 0] -> ['E', 'D']
[2 2 2 0 1 1] -> ['E', 'D', 'A']


Voici les chemins obtenus pour chacune des données dans l'arbre de décision de démonstration:

In [10]:
demo_paths = [
    ['E', 'D', 'B'],
    ['E', 'D', 'B'],
    ['E', 'C'],
    ['E', 'D'],
    ['E', 'C'],
    ['E', 'D'],
    ['E', 'C'],
    ['E', 'D'],
    ['E', 'D', 'A', 'B'],
    ['E', 'C'],
    ['E', 'D', 'B'],
    ['E', 'D', 'B'],
    ['E', 'D'],
    ['E', 'D', 'A'],
    ['E', 'D'],
    ['E', 'D', 'A'],
    ['E', 'D'],
    ['E', 'D'],
    ['E', 'D'],
    ['E', 'D', 'A']
]

assert demo_paths == mytree_paths, "Error: paths are different for given values"

On peut donc remarquer que les chemins sont les mêmes pour les mêmes données de nos deux arbres.

De plus, si on compare les deux arbres, on a 'E' qui est le noeud racine dans les 2 arbres.
Si on a une valeur de 0, on entre dans le sous arbre de noeud racine 'C' pour les 2 arbres.
Puis, si on a une valeur de 0, dans les 2 arbres on obtient une décision de 0,
et si on a une valeur de 1, on obtient une décision de 1, et si on a une valeur de 2, on obtient une décision d'un sous-arbre de sommet 'A' dans les deux arbres, et si on a une valeur de 3, on obtient une valeur de 1 dans les 2 arbres.

Dans les deux arbres, le sous-arbre 'A' possède la valeur 0 qui ammène au choix 1, et la valeur 1 qui ammène au sous-arbre de sommet 'B'.

Le sous arbre de sommet 'B' possède 2 valeurs: une valeur 1 menant à un choix de 1, et une valeur 0 menant à un choix de 0 dans les deux arbres.

Donc la partie vérifiée de l'arbre de la démonstration est identique à la partie véfiriée de mon arbre.

Donc on a les deux arbres qui ont la même profondeur, qui ont les mêmes chemins emprunté pour les mêmes données, et qui possèdent une structure qui est en partie la même (je n'ai pas fait explicitement l'analyse de la structure des arbres pour le reste des arbres, mais c'est la même chose.)

Tout ces éléments nous laissent supposer avec très forte raison que les deux arbres sont identiques.

3. Implémenter une procédure de génération de données à partir d’un arbre de décision.

In [11]:
def gen_data(tree: dict, item_unknow: str) -> dict[str, int]:
    """
    This function go throught the decision tree and return a random path from root node to leaf

    Parameters
    ----------
    tree: dictionary
        A decision tree
    item_unknow: str
        Name of the unknow item
    
    Return value
    ------------
    A dictionnary with the random path created
    """
    data = {}
    mytree = tree
    while type(mytree) != int and len(list(mytree.keys())) != 0:
        item = list(mytree.keys())[0]
        mytree = mytree[item]
        if type(mytree) != int:
            keys = list(mytree.keys())
            randnumber = random.randint(0, len(keys) - 1)
            data[item] = keys[randnumber]
            mytree = mytree[keys[randnumber]]
        else:
            data[item_unknow] = mytree
    if type(mytree) == int:
        data[item_unknow] = mytree

    return data

def complete_gen_data(tree: dict, item_unknow: str, header: list, definition_domain: list) -> np.ndarray:
    """
    Allow to complete datas generated by a path in the tree by adding missing fields
    and random values taken from the definition domain of this fields for each missing field

    Parameters
    ----------
    tree: dictionary
        A decision tree
    item_unknow: str
        Name of the unknow item
    header: np.array
        Name of each criteria
    definition_domain: list
        Definition domain of each criteria

    Return value
    ------------
    Return a np.array with the value assignated for each criteria in the order given by the header
    """
    data = gen_data(tree, item_unknow)
    for i in range(len(header)):
        item = header[i]
        if item not in data:
            values = definition_domain[i]
            random_number = random.randint(0, len(definition_domain[i]) - 1)
            data[item] = values[random_number]
    newdata = []
    for i in range(len(header)):
        newdata.append(data[header[i]])
    return np.array(newdata)

def gen_multiple_datas(tree: dict, item_unknow: str, header: list, definition_domain: list, number: int) -> np.ndarray:
    """
    Allow to generate multiple datas from a tree

    Parameters
    ----------
    tree: dictionary
        A decision tree
    item_unknow:
        Name of the unknow item
    header: list
        Name of the criteria
    definition_domain: list
        Definition domain of each criteria
    number: int
        Number of data which would be generated

    Return value
    ------------
    Return a np.array of all the datas generated thanks to the decision tree
    """
    result = []
    for i in range(number):
        result.append(complete_gen_data(tree, item_unknow, header, definition_domain))
    return np.array(result)

# Create the definition domain of each criteria
definition_domain = []
for i in range(len(header)):
    definition_domain.append(np.unique(data[i]))

# Example of execution of the function to generate datas
print(gen_multiple_datas(tree, 'c', header, definition_domain, 100))


[[3 0 0 0 2 1]
 [1 1 4 0 2 1]
 [0 0 0 1 0 0]
 [0 2 2 0 0 1]
 [0 2 1 1 0 1]
 [1 1 1 1 0 1]
 [3 1 3 3 2 0]
 [0 1 2 3 0 1]
 [0 2 1 3 0 1]
 [0 2 1 1 1 0]
 [0 2 0 2 2 0]
 [0 0 0 2 1 0]
 [1 1 2 0 0 1]
 [0 2 2 1 0 1]
 [1 0 2 3 0 0]
 [0 1 1 4 2 0]
 [0 1 0 3 0 0]
 [1 1 0 4 1 0]
 [0 2 3 0 0 1]
 [0 1 1 2 2 0]
 [3 2 3 2 2 0]
 [1 1 3 1 0 1]
 [3 2 4 0 2 1]
 [1 0 4 3 1 0]
 [0 1 1 1 1 0]
 [1 2 0 3 2 0]
 [0 0 4 2 1 0]
 [0 2 1 4 1 0]
 [0 2 2 1 0 1]
 [0 2 2 1 0 1]
 [0 1 4 2 1 0]
 [2 0 4 2 1 1]
 [3 0 4 3 1 0]
 [3 0 3 1 2 1]
 [2 0 3 0 1 1]
 [1 2 3 3 2 0]
 [1 2 0 0 0 0]
 [0 2 0 4 2 0]
 [1 1 0 1 0 0]
 [3 1 1 1 1 0]
 [0 0 0 3 2 0]
 [0 2 4 2 1 0]
 [1 1 3 4 1 0]
 [0 2 1 0 2 1]
 [1 0 3 0 0 1]
 [0 0 0 1 1 0]
 [0 2 4 3 2 0]
 [3 0 0 1 2 1]
 [0 1 3 1 0 1]
 [0 0 0 3 1 0]
 [1 0 4 2 2 0]
 [1 2 4 0 1 0]
 [3 0 0 0 2 1]
 [0 2 2 0 0 1]
 [1 1 2 0 0 1]
 [1 0 3 1 0 1]
 [2 2 4 0 1 1]
 [1 0 2 1 0 0]
 [3 0 1 0 0 1]
 [1 1 0 0 0 0]
 [1 0 0 2 2 0]
 [0 1 3 2 1 0]
 [1 0 4 1 2 1]
 [1 2 0 0 0 0]
 [3 0 0 4 1 0]
 [1 1 0 3 2 0]
 [0 2 2 0 

4. À l’aide d’ID3 gain d’information, construire 5 arbres à partir d’échantillons aléatoires de 80% des données et utiliser comme prédiction finale un vote de majorité.

In [17]:
def get_percentage_of_data(data: np.ndarray, percentage: int) -> np.ndarray:
    """
    Allow to get random sampling of a percentage of datas

    Parameters
    ----------
    data: np.ndarray
        Datas we want a percentage of
    percentage: int
        Percentage of datas we want

    Return value
    ------------
    Return a tuple with the np.array with the right percentage of the datas we want,
    and with the np.array of the datas which weren't taken
    """
    indices = np.arange(len(data))
    indices = np.unique((indices * percentage/100).round().astype(int))
    last_indices = np.arange(len(data))
    mask = np.isin(last_indices, indices).astype(int)
    mask -= 1
    mask[mask < 0] = 1
    last_indices = np.unique(last_indices * mask).astype(int)
    newdata = data.copy()
    np.random.shuffle(newdata)
    return newdata[indices], newdata[last_indices[1:]]

# Construct the five decision trees here
trees = []
training = []
tests = []
for i in range(5):
    data_training, data_tests = get_percentage_of_data(data, 80)
    training.append(data_training)
    tests.append(data_tests)
    trees.append(id3(data_training, header, 5))

5. Comparer les performances du premier arbre obtenu avec celles de l’ensemble de 5 arbres selon les métriques suivantes: accuracy, precision, recall, F1 score.

In [18]:
# Define the labels here used to evaluation
FN, FP, TN, TP = range(4)

def eval(tree: dict, data: np.ndarray, header: np.ndarray, target: int) -> int:
    """
    Allow to eval a specific data, with a target value, thanks to a given decision tree

    Parameters
    ----------
    tree: dictionary
        Decision tree
    data: np.ndarray
        Data we want to evaluate
    header: np.ndarray
        Header of data
    target: int
        target column of decision value of data

    Return value
    ------------
    Return a label in {FN, FP, TN, TP}
    """
    data = np.array(data).astype(int)
    while type(tree) != int:
        keys = list(tree.keys())
        index = np.where(header == keys[0])[0][0]
        tree = tree[header[index]]
        keys = list(tree.keys())
        if data[index] not in keys:
            if data[target] == 0:
                return TN
            return FP
        tree = tree[data[index]]
    if data[target] == 1 and tree == 1:
        return TP
    if data[target] == 0 and tree == 1:
        return FN
    if data[target] == 0 and tree == 0:
        return TN
    if data[target] == 1 and tree == 0:
        return FP

def evaluation(tree: dict, datas: np.ndarray, header: np.ndarray, target: int):
    """
    Evaluate a set of datas for a specific decision tree

    Parameters
    ----------
    tree: dictionary
        Decision tree
    datas: np.ndarray
        Datas we want to evaluate
    header: np.ndarray
        Header of datas
    target: int
        target column of decision value of datas

    Return value
    ------------
    Return a list of the ampiric probability to have respectively FN, FP, TN and TP.
    """
    n = len(datas)
    FN_count = 0
    FP_count = 0
    TN_count = 0
    TP_count = 0
    for i in datas:
        result = eval(tree, i, header, target)
        if result == FN: FN_count += 1
        if result == FP: FP_count += 1
        if result == TN: TN_count += 1
        if result == TP: TP_count += 1
    return FN_count/n, FP_count/n, TN_count/n, TP_count/n

def accuracy(tree: dict, test_datas: np.ndarray, header: np.ndarray, target: int) -> float:
    """
    Return the error rate of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas

    Return value
    ------------
    Return the probability of making an error in our prediction
    """
    fn, fp, tn, tp = evaluation(tree, test_datas, header, target)
    return (fp + fn)/(tp + tn + fp + fn)

def precision(tree, test_datas: np.ndarray, header: np.ndarray, target: int):
    """
    Return the precision of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas

    Return value TODO !!!
    ------------
    Return the precision and the recall of the tree
    """
    fn, fp, tn, tp = evaluation(tree, test_datas, header, target)
    return tp / (tp + fp), tp / (tp + fn)

def f1_score(tree, test_datas: np.ndarray, header: np.ndarray, target: int) -> float:
    """
    Return the f1 score of the decision tree for a given set of test datas

    Parameters
    ----------
    tree: dictionary
        The decision tree
    test_datas: np.ndarray
        The set of tests data
    header: np.ndarray
        Header of test datas
    target: int
        target column of decision value of test datas
    
    Return value
    ------------
    Return the f1 score of the tree
    """
    p, r = precision(tree, test_datas, header, target)
    return (2 * p * r) / (p + r)


# Get data test
_, data_test = convert_csv2array('data_test.csv')

# Get metrics of the tree computed with 100% of the datas
accuracy_target = accuracy(tree, data_test, header, 5)
precision_target_p, precision_target_r = precision(tree, data_test, header, 5)
f1_score_target = f1_score(tree, data_test, header, 5)

# Get average metrics of the 5 threes computed with 80% of the datas
average_accuracy = 0
average_precision_p = 0
average_precision_r = 0
average_f1_score = 0
for i in range(len(trees)):
    average_accuracy += accuracy(trees[i], data_test, header, 5)
    p, r = precision(trees[i], data_test, header, 5)
    average_precision_p += p
    average_precision_r += r
    average_f1_score += f1_score(trees[i], data_test, header, 5)

average_accuracy /= len(trees)
average_precision_p /= len(trees)
average_precision_r /= len(trees)
average_f1_score /= len(trees)

# Print results
print("Tree")
print("accuracy: " + str(accuracy_target))
print("precision (p, r): (" + str(precision_target_p) + ", " + str(precision_target_r) + ")")
print(f"F1 score: {f1_score_target}")
print("")

print("Average of 5 trees")
print(f"accuracy: {average_accuracy}")
print(f"precision (p, r): ({average_precision_p}, {average_precision_r})")
print(f"F1 score: {average_f1_score}")


Tree
accuracy: 0.22
precision (p, r): (0.6504065040650407, 0.9876543209876543)
F1 score: 0.7843137254901962

Average of 5 trees
accuracy: 0.265
precision (p, r): (0.6520325203252033, 0.8927274852485038)
F1 score: 0.7515041947305535


On peut donc observer que le taux d'erreur (acurracy) de l'arbre entraîné avec 100% des données est plus faible que la moyenne du taux d'erreur des 5 arbres entraînés avec 80% des données.

Cependant, on peut observer que la précision p de l'arbre est plus petite que la précision p de la moyenne des arbres, ce qui signifie que
$$p_1 = \frac{tp_1}{tp_1 + fp_1} \lt p_2 = \frac{tp_2}{tp_2 + fp_2}$$

Mais on peut observer que l'on a également le recall de l'arbre qui est plus grand que le recall de la moyenne des arbres.
$$r_1 = \frac{tp_1}{tp_1 + fn_1} \gt r_2 = \frac{tp_2}{tp_2 + fn_2}$$

Le $F_1$ score permet alors de comparer les deux métriques obtenues:

$$F_1 = \frac{2}{(\frac{1}{p} + \frac{1}{r})}$$
$$= \frac{2}{\frac{r+p}{rp}}$$
$$= \frac{2rp}{r + p}$$

Dans notre cas, le $F_1$ score de l'arbre entraîné avec toutes les données est plus grand que le $F_1$ score moyen des autres arbres obtenus avec 80% des données.

6. Selon le F1 score, quel modèle devrait être privilégié ?

Notre $F_1$ score peut s'écrire sous la forme:

$$F_1 = \frac{2}{(\frac{1}{p} + \frac{1}{r})}$$
$$= \frac{2}{\frac{tp + fn}{tp} + \frac{tp + fp}{tp}}$$
$$= \frac{tp}{\frac{1}{2}(tp + fn + tp + fp)}$$
$$= \frac{tp}{tp + \frac{1}{2}(fp + fn)}$$

Comme notre but est de minimiser les `false positive (fp)` et les `false negative (fn)`, le meilleur $F_1$ score est le $F_1$ score le plus grand, car plus `(fp + fn)` est grand, plus le $F_1$ score est petit.

Dans notre cas, le $F_1$ score obtenu est plus grand pour l'arbre entraîné avec 100% des données, donc je pense que ce modèle devrait être privilégié.