# Implement Entropy Calculation for Decision Tree

La formule mathématique de l'entropie est la suivante :

La formule de l'entropie pour un arbre de décision binaire est donnée par :
$$E(T) = - \sum_{i=1}^{c} p_i \cdot \log_2(p_i))$$

- $E(T)$ est l'entropie de l'arbre.
- $c$ est le nombre de classes.
- $p_i$ est la probabilité de la classe $i$ (c'est-à-dire, le nombre d'éléments à l'intérieur de la classe $c_i$ divisé par le nombre total d'éléments).


**Objectif** : Implémenter une fonction en Python qui calcule l'entropie pour un ensemble de données donné, représenté sous forme de DataFrame.

---

The Mathematical formula for Entropy is as follows:

The entropy formula for a binary decision tree is given by:
$$E(T) = - \sum_{i=1}^{c} p_i \cdot \log_2(p_i))$$

- $E(T)$ is the entropy of the tree.
- $c$ is the number of classes.
- $p_i$ is the probability of class $i$ (i.e., the number of elements inside class $c_i$ divided by the total number of elements).


**Objective**: Implement a function in Python that calculates the entropy for a given dataset represented as a DataFrame.

In [1]:
# import modules  
import pandas as pd  
import math

Opening the CSV file (PlayTennis.csv) and storing it in a dataframe (df). Reading the data (print).

---

Ouverture du fichier CSV (_PlayTennis.csv_) et stockage dans un dataframe (_df_). Lecture des données (_print_).

In [2]:
# making dataframe  
df = pd.read_csv("PlayTennis.csv")  
   
# output the dataframe 
print(df)

   Temperature Humidity   Outlook    Wind Play Tennis
0          Hot     High     Sunny    Weak          No
1          Hot     High     Sunny  Strong          No
2          Hot     High  Overcast    Weak         Yes
3         Mild     High      Rain    Weak         Yes
4         Cool   Normal      Rain    Weak         Yes
5         Cool   Normal      Rain  Strong          No
6         Cool   Normal  Overcast  Strong         Yes
7         Mild     High     Sunny    Weak          No
8         Cool   Normal     Sunny    Weak         Yes
9         Mild   Normal      Rain    Weak         Yes
10        Mild   Normal     Sunny  Strong         Yes
11        Mild     High  Overcast  Strong         Yes
12         Hot   Normal  Overcast    Weak         Yes
13        Mild     High      Rain  Strong          No


**Exercice 1 :** Écrivez une fonction Python appelée `count_classes` qui compte le nombre de classes uniques présentes et les occurrences de chaque classe pour un vecteur donné en entrée. Une structure de dictionnaire peut être utilisée.

---

**Exercise 1:** Write a Python function called `count_classes` that counts the number of unique classes present and the occurrences of each class given a vector as input. A dictionary structure can be used.

In [3]:
def count_classes(vector):
    """
    Count the number of unique classes and occurrences in the input vector.

    Parameters:
    - vector: A list or array containing class labels.

    Returns:
    - class_counts: A dictionary where keys are unique classes, and values are their occurrences.
    - num_classes: The total number of unique classes.
    """
    class_counts = {}
    
    for label in vector:
        if label in class_counts:
            class_counts[label] += 1
        else:
            class_counts[label] = 1

    num_classes = len(class_counts)

    return class_counts, num_classes

Testez vos fonctions en utilisant un vecteur d'échantillon. _Assurez-vous que vos fonctions sont correctement implémentées et fournissent des résultats significatifs._

---

Test your functions using a sample vector. _Ensure that your functions are correctly implemented and provide meaningful results._

In [4]:
# Example usage:
# Replace 'your_vector' with the actual vector you want to analyze
test_vector = [1, 2, 1, 3, 2, 1, 3, 4, 4, 4, 5]
class_counts, num_classes = count_classes(test_vector)

print("Class Counts:", class_counts)
print("Number of Unique Classes:", num_classes)

Class Counts: {1: 3, 2: 2, 3: 2, 4: 3, 5: 1}
Number of Unique Classes: 5


**Exercice 2 :** Implémentez la fonction `calculate_entropy` pour calculer l'entropie du vecteur donné en utilisant la formule fournie précédemment. Cette fonction prend un vecteur en entrée. Utilisez la fonction `count_classes` dans cette implémentation.

---

**Exercise 2:** Implement the function `calculate_entropy` to calculate the entropy of the given vector using the formula provided earlier. This function takes a vector as input. Use the `count_classes` function within this implementation.

In [5]:
def calculate_entropy(vector):
    """
    Calculate the entropy of a given vector using the provided formula.

    Parameters:
    - vector: A list or array containing class labels.

    Returns:
    - entropy: The entropy value for the input vector.
    """
    class_counts, num_classes = count_classes(vector)

    # Calculate the probabilities of each class
    class_probabilities = [count / len(vector) for count in class_counts.values()]

    # Calculate entropy using the formula
    entropy = -sum(p * math.log2(p) for p in class_probabilities)

    return entropy

Testez vos fonctions en utilisant un vecteur d'échantillon. _Assurez-vous que vos fonctions sont correctement implémentées et fournissent des résultats significatifs._

---

Test your functions using a sample vector. _Ensure that your functions are correctly implemented and provide meaningful results._

In [6]:
entropy_value = calculate_entropy(test_vector)
print("Entropy of the Vector:", entropy_value)

Entropy of the Vector: 2.2312702546075758


**Exercice 3 :** Écrivez une fonction `calculate_entropy_for_features` qui permet d'itérer sur la fonction `calculate_entropy` pour calculer l'entropie de chaque _feature_ (c'est-à-dire, un vecteur) dans le dataframe original (c'est-à-dire, _df_).

---

**Exercise 3:** Write a function `calculate_entropy_for_features` that allows iterating through the `calculate_entropy` function to calculate the entropy of each _feature_ (i.e., a vector) in the original dataframe (i.e., _df_).

In [7]:
def calculate_entropy_for_features(df):
    """
    Calculate entropy for each feature (column) in the DataFrame.

    Parameters:
    - df: The DataFrame containing the dataset.

    Returns:
    - feature_entropies: A dictionary where keys are feature names, and values are their corresponding entropies.
    """
    feature_entropies = {}

    for column in df.columns[:-1]:  # Exclude the last column assuming it's the target variable
        feature_vector = df[column]
        entropy_value = calculate_entropy(feature_vector)
        feature_entropies[column] = entropy_value

    return feature_entropies

Testez vos fonctions en utilisant un vecteur d'échantillon. _Assurez-vous que vos fonctions sont correctement implémentées et fournissent des résultats significatifs._

---

Test your functions using a sample vector. _Ensure that your functions are correctly implemented and provide meaningful results._

In [8]:
# Example usage:
feature_entropies = calculate_entropy_for_features(df)
print("Feature Entropies:", feature_entropies)

Feature Entropies: {'Temperature': 1.5566567074628228, 'Humidity': 1.0, 'Outlook': 1.5774062828523454, 'Wind': 0.9852281360342516}


**Exercice 4 :** Écrivez une fonction `find_max_entropy_feature` qui permet d'identifier l'attribut ayant l'entropie maximale et de le supprimer du dataframe initial (_df_) en utilisant la fonction précédemment écrite `calculate_entropy_for_features`. Cette fonction doit retourner trois arguments : le nom de l'attribut ayant l'entropie maximale, la valeur maximale de l'entropie en question, et finalement le nouveau dataframe ne contenant plus l'attribut en question.

---

**Exercise 4:** Write a function `find_max_entropy_feature` that allows identifying the attribute with the maximum entropy and removing that attribute from the initial dataframe (_df_) using the previously written function `calculate_entropy_for_features`. This function should return three arguments: the name of the attribute with maximum entropy, the maximum value of the entropy in question, and finally, the new dataframe without the mentioned attribute.

In [9]:
def find_max_entropy_feature(df):
    """
    Find the feature with the maximum entropy in the DataFrame and remove it.

    Parameters:
    - df: The DataFrame containing the dataset.

    Returns:
    - max_entropy_feature: The name of the feature with the maximum entropy.
    - df_after_removal: The DataFrame after removing the column with the maximum entropy feature.
    """
    max_entropy_feature = None
    max_entropy_value = float('-inf')  # Initialize with negative infinity

    for column in df.columns:
        feature_vector = df[column]
        entropy_value = calculate_entropy(feature_vector)

        if entropy_value > max_entropy_value:
            max_entropy_value = entropy_value
            max_entropy_feature = column

    # Remove the column with the maximum entropy feature
    df_after_removal = df.drop(columns=[max_entropy_feature])

    return max_entropy_feature, max_entropy_value, df_after_removal

In [10]:
# Example usage:
max_entropy_feature, max_entropy_value, df_after_removal = find_max_entropy_feature(df)
print("Max Entropy Feature:", max_entropy_feature)
print("Max Entropy Feature:", max_entropy_value)
print("DataFrame After Removal:\n", df_after_removal)

Max Entropy Feature: Outlook
Max Entropy Feature: 1.5774062828523454
DataFrame After Removal:
    Temperature Humidity    Wind Play Tennis
0          Hot     High    Weak          No
1          Hot     High  Strong          No
2          Hot     High    Weak         Yes
3         Mild     High    Weak         Yes
4         Cool   Normal    Weak         Yes
5         Cool   Normal  Strong          No
6         Cool   Normal  Strong         Yes
7         Mild     High    Weak          No
8         Cool   Normal    Weak         Yes
9         Mild   Normal    Weak         Yes
10        Mild   Normal  Strong         Yes
11        Mild     High  Strong         Yes
12         Hot   Normal    Weak         Yes
13        Mild     High  Strong          No


**Exercice 5 :** Écrivez une fonction `build_decision_tree` qui effectue cette opération de manière récursive jusqu'au dernier attribut, qui sera ajouté sans nécessiter de calcul en appelant la fonction `find_max_entropy_feature`. La séquence des attributs constituera l'ordre de construction de votre arbre de décision dans le dataframe, allant de l'attribut le plus crucial (la racine) à celui moins significatif (les feuilles).

---

**Exercise 5:** Write a function `build_decision_tree` that performs this operation recursively until the last attribute, which will be added without requiring a calculation by calling the _find_max_entropy_feature_ function. The sequence of attributes will determine the construction order of your decision tree in the dataframe, ranging from the most crucial attribute (the root) to the less significant ones (the leaves).

In [11]:
def build_decision_tree(df):
    """
    Build a decision tree using a recursive approach until the last attribute.

    Parameters:
    - df: The DataFrame containing the dataset.

    Returns:
    - decision_tree_order: The order of attributes for constructing the decision tree.
    """
    decision_tree_order = []

    while df.shape[1] > 1:  # Continue until only one attribute (column) is left
        max_entropy_feature, _, df = find_max_entropy_feature(df)
        decision_tree_order.append(max_entropy_feature)

    # Add the last remaining attribute without calculating its entropy
    decision_tree_order.append(df.columns[0])

    return decision_tree_order

Testez vos fonctions en utilisant un vecteur d'échantillon. _Assurez-vous que vos fonctions sont correctement implémentées et fournissent des résultats significatifs._

---

Test your functions using a sample vector. _Ensure that your functions are correctly implemented and provide meaningful results._

In [12]:
# Example usage:
decision_tree_order = build_decision_tree(df)
print("Decision Tree Order:", decision_tree_order)

Decision Tree Order: ['Outlook', 'Temperature', 'Humidity', 'Wind', 'Play Tennis']
