# Information Theory in the world of Machine Learning

## Entropy

The entropy of a random variable is the average level of uncertainty associated with the variables potential state\
The measure of the expected amount of information to describe the state of the variable condisering the distribution of probabilities across all potential states

In [27]:
from typing import List
from math import log2
import numpy as np

def entropy(probabilities: List[float]) -> float:
    
    H = -sum(p * log2(p) for p in probabilities if p > 0)
    
    return np.round(H, 3)
    

In [28]:
probabilities: List[float] = [0.25, 0.25, 0.25,0.25]

try:
    sum(probabilities) == 1
except:
    print("Error: The probabilities are not valid")
    
print(entropy(probabilities=probabilities))
    
    

2.0


## Shanon Entropy

This is the measure of the average amount of information contained in a message\
It quantifies the unpredictability of info content

In [29]:
from typing import Union

def shannon_entropy(data: Union[List[Union[float, str, int]], str])->float:
    
    chars, counts = np.unique(data, return_counts=True)

    # Count of the unique characters in the message
    char_counts: list = list(zip(chars, counts))
    print("Count of the unique characters in the message:")
    for char, count in char_counts:
        print(f"('{char}', {count})")

    # Compute Shannon entropy
    probabilities: float = counts / len(data)
    
    return np.round(-np.sum(probabilities * np.log2(probabilities)), 3)

# Example: Calculate Shannon entropy for a text message
message1 = "Hello world"
# Example: Calculate Shannon entropy for a boolean message
message2 = [1,0,1,1,0,1,1,0]

print(f"Shannon entropy of '{message1}': {shannon_entropy(list(message1))} bits")
print(f"Shannon entropy of '{message1}': {shannon_entropy(list(message2))} bits")


Count of the unique characters in the message:
(' ', 1)
('H', 1)
('d', 1)
('e', 1)
('l', 3)
('o', 2)
('r', 1)
('w', 1)
Shannon entropy of 'Hello world': 2.845 bits
Count of the unique characters in the message:
('0', 3)
('1', 5)
Shannon entropy of 'Hello world': 0.954 bits


## Entropy in Machine Learning

Since entropy is the measure of uncertainty and the objective of ML is to minimize the uncertainty the two are linked

### Information gain

This is the measure of the reduction in Entropy achieved by splitting a dataset according to a particular feature (this is used in tree algorithms to select the features)\
This is the amount of information a feature can provide about a class

Example:\
We have a dataset with cancerous (C) and non cancerous cells (NC)


In [30]:
import pandas as pd
import numpy as np
from typing import Dict

# Data for the DataFrame
data: Dict[str, Union[str, float]] = {
    'Samples': ['C1', 'C2', 'C3', 'C4', 'NC1', 'NC2', 'NC3'],
    'Mutation 1': [1, 1, 1, 0, 0, 0, 1],
    'Mutation 2': [1, 1, 0, 1, 0, 1, 1],
    'Mutation 3': [1, 0, 1, 1, 0, 0, 0],
    'Mutation 4': [0, 1, 1, 0, 0, 0, 0]
}

# Create the DataFrame
df: pd.DataFrame = pd.DataFrame(data, index=None)

# Print the DataFrame
print(df)

  Samples  Mutation 1  Mutation 2  Mutation 3  Mutation 4
0      C1           1           1           1           0
1      C2           1           1           0           1
2      C3           1           0           1           1
3      C4           0           1           1           0
4     NC1           0           0           0           0
5     NC2           0           1           0           0
6     NC3           1           1           0           0


We can create a very simple decision tree with 1 parent node which is highly impute with all the features and 2 pure child nodes one with just the cancerous cells and the other one all the non cancerous cells\
Then we wanna know how to split the data in order to classify the future nodes the best we can (which means than the node childs 1 and 2 must be as pure a possible)

 **Parent Node:** The parent node is represented with its high impurity (4C + 3NC)
* **Child Nodes Left:** Pure node with only Cancerous cells (P=4/7)
* **Child Nodes Right:** Pure node with only Non Cancerous cells (P=3/7)

In [31]:
# Definition of the variables to calculate the entropy
sum_elements_mut1: int = df['Mutation 1'].shape[0]
sum_zeros_in_mut1: int = (df['Mutation 1'] == 0).sum()
sum_ones_in_mut1: int = (df['Mutation 1'] == 1).sum()

prob_NC_mut1: float  = sum_zeros_in_mut1 / sum_elements_mut1
prob_C_mut1: float = sum_ones_in_mut1 / sum_elements_mut1

# Display the probabilities for the cancerous and non cancerous cells
print(f"Probabilities of the cancerous cells for {df['Mutation 1'].name}: {np.round(prob_C_mut1, 3)}")
print(f"Probabilities of the cancerous cells for {df['Mutation 1'].name}: {np.round(prob_NC_mut1, 3)}")

try:
    prob_C_mut1 + prob_NC_mut1 == 1.0
except:
    print("Error: The probabilities do not add up to 1")
    
# Calculate the entropy of the parent node
feature_nodes_mut1: List[float] = [prob_C_mut1, prob_NC_mut1]
H_parent_node_mut1: float = entropy(feature_nodes_mut1)
print(f"\nThe entropy of the parent node is {np.round(H_parent_node_mut1, 3)}") 

Probabilities of the cancerous cells for Mutation 1: 0.571
Probabilities of the cancerous cells for Mutation 1: 0.429

The entropy of the parent node is 0.985


In [32]:
# Create a function to calculate the entopy of a node of a tree
def calculate_entropy_mutations(dataframe: pd.DataFrame, feature: str, verbose: bool = False)-> float:
    # Definition of the variables to calculate the entropy
    sum_elements_mut: int = dataframe[feature].shape[0]
    sum_zeros_in_mut: int = (dataframe[feature] == 0).sum()
    sum_ones_in_mut: int = (dataframe[feature] == 1).sum()
    
    if verbose == True:
        print(f"Sum of elements in {feature}: {sum_elements_mut}")
        print(f"Sum of zeros in {feature}: {sum_zeros_in_mut}")
        print(f"Sum of ones in {feature}: {sum_ones_in_mut}")
    
    prob_NC_mut: float  = sum_zeros_in_mut / sum_elements_mut
    prob_C_mut: float = sum_ones_in_mut / sum_elements_mut
    
    # Display the probabilities for the cancerous and non cancerous cells
    if verbose == True:
        print(f"Probabilities of the cancerous cells for {dataframe[feature].name}: {prob_C_mut1}")
        print(f"Probabilities of the cancerous cells for {dataframe[feature].name}: {prob_NC_mut1}")
    
    try:
        prob_C_mut + prob_NC_mut == 1.0
    except:
        print("Error: The probabilities do not add up to 1")
    
    # Calculate the entropy of the parent node
    feature_nodes_mut: List[float] = [prob_C_mut, prob_NC_mut]
    H_parent_node_mut: float = entropy(feature_nodes_mut)
    
    return np.round(H_parent_node_mut, 3)

mut_1 = "Mutation 1"
mut_2 = "Mutation 2"
mut_3 = "Mutation 3"
mut_4 = "Mutation 4"

H_parent_node_mut1 = calculate_entropy_mutations(df, mut_1)
H_parent_node_mut2 = calculate_entropy_mutations(df, mut_2)
H_parent_node_mut3 = calculate_entropy_mutations(df, mut_3)
H_parent_node_mut4 = calculate_entropy_mutations(df, mut_4)

print(f"The entropy of the parent node is {H_parent_node_mut1}") 
print(f"The entropy of the parent node is {H_parent_node_mut2}") 
print(f"The entropy of the parent node is {H_parent_node_mut3}") 
print(f"The entropy of the parent node is {H_parent_node_mut4}") 
    
    
    

The entropy of the parent node is 0.985
The entropy of the parent node is 0.863
The entropy of the parent node is 0.985
The entropy of the parent node is 0.863


The entropy of the parent node (0.985) is very close to 1 which means that the variables in the parent nodes are highly mixed but this means that this is good for learning attributes to the features

Now we can calculate the entropy of the child nodes for the 4 Mutations features:
* **Child Nodes Left:** Its content is exclusively cells with mutation from the Mutation 1 which means that we have the first 3 cancerous cells and the last non cancerous cell 
* **Child Nodes Right:** Its content is exclusively cells without mutation from the Mutation 1 feature which means that we have the last cancerous cell and the first two cancerous cell 

We need to calculate the average entropy of the child nodes which is a variable of the Information Gain formula:

*Information Gain* = *Entropy of the parent node* - *Average Entropy of the child nodes* 

In [33]:
# Filter the column of the dataframe just for the cancerous cells
cancerous_cells_df: pd.DataFrame = df[df['Samples'].str.startswith('C')]
cancerous_cells_df.head(4)

Unnamed: 0,Samples,Mutation 1,Mutation 2,Mutation 3,Mutation 4
0,C1,1,1,1,0
1,C2,1,1,0,1
2,C3,1,0,1,1
3,C4,0,1,1,0


Calculate the entropy of Child Node Left with only cancerous cells for all the mutation features

In [34]:
mut_1: str = "Mutation 1"
mut_2: str = "Mutation 2"
mut_3: str = "Mutation 3"
mut_4: str = "Mutation 4"

H_child_left_mut1 = calculate_entropy_mutations(cancerous_cells_df, mut_1)
H_child_left_mut2 = calculate_entropy_mutations(cancerous_cells_df, mut_2)
H_child_left_mut3 = calculate_entropy_mutations(cancerous_cells_df, mut_3)
H_child_left_mut4 = calculate_entropy_mutations(cancerous_cells_df, mut_4)

print(f"The entropy of the Child Node Left for Mutation 1 feature is {H_child_left_mut1}") 
print(f"The entropy of the Child Node Left for Mutation 2 feature is {H_child_left_mut2}") 
print(f"The entropy of the Child Node Left for Mutation 3 feature is {H_child_left_mut3}") 
print(f"The entropy of the Child Node Left for Mutation 4 feature is {H_child_left_mut4}") 

The entropy of the Child Node Left for Mutation 1 feature is 0.811
The entropy of the Child Node Left for Mutation 2 feature is 0.811
The entropy of the Child Node Left for Mutation 3 feature is 0.811
The entropy of the Child Node Left for Mutation 4 feature is 1.0


Calculate the entropy of Child Node Right with only cancerous cells for all the mutation features

In [35]:
# Filter the column of the dataframe just for the non cancerous cells
non_cancerous_cells_df = df[df['Samples'].str.startswith('NC')]
non_cancerous_cells_df.head(3)


mut_1 = "Mutation 1"
mut_2 = "Mutation 2"
mut_3 = "Mutation 3"
mut_4 = "Mutation 4"

H_child_right_mut1 = calculate_entropy_mutations(non_cancerous_cells_df, mut_1)
H_child_right_mut2 = calculate_entropy_mutations(non_cancerous_cells_df, mut_2)
H_child_right_mut3 = calculate_entropy_mutations(non_cancerous_cells_df, mut_3)
H_child_right_mut4 = calculate_entropy_mutations(non_cancerous_cells_df, mut_4)

print(f"The entropy of the Child Node Right for Mutation 1 feature is {H_child_right_mut1}") 
print(f"The entropy of the Child Node Right for Mutation 2 feature is {H_child_right_mut2}") 
print(f"The entropy of the Child Node Right for Mutation 3 feature is {H_child_right_mut3}") 
print(f"The entropy of the Child Node Right for Mutation 4 feature is {H_child_right_mut4}") 

The entropy of the Child Node Right for Mutation 1 feature is 0.918
The entropy of the Child Node Right for Mutation 2 feature is 0.918
The entropy of the Child Node Right for Mutation 3 feature is -0.0
The entropy of the Child Node Right for Mutation 4 feature is -0.0


Calculate the average entropy of the two child nodes

In [36]:
def calc_avg_entropy(entropy_node1: float, entropy_node2: float, verbose: bool = False)->float:
    
    avg_entropy: float = np.round(((4/7 * entropy_node1) + (3/7 * entropy_node2)), 3)
    
    if verbose:
        print(f"Average entropy: {avg_entropy}")

    return np.round(avg_entropy, 3)

In [37]:
avg_entropy_mut1 = calc_avg_entropy(H_child_left_mut1, H_child_right_mut1)
avg_entropy_mut2 = calc_avg_entropy(H_child_left_mut2, H_child_right_mut2)
avg_entropy_mut3 = calc_avg_entropy(H_child_left_mut3, H_child_right_mut3)
avg_entropy_mut4 = calc_avg_entropy(H_child_left_mut4, H_child_right_mut4)


Calculate the information theory

In [38]:
def calc_inf_gain(entropy_parent: float, avg_entropy_childs: float, verbose: bool = False)->float:
    
    information_gain_mut: float = np.round(entropy_parent - avg_entropy_childs, 3)
    
    if verbose:
        print(f"Average entropy: {information_gain_mut}")

    return np.round(information_gain_mut, 3)

In [39]:
information_gain_mut1 = calc_inf_gain(H_parent_node_mut1, avg_entropy_mut1, True)
information_gain_mut2 = calc_inf_gain(H_parent_node_mut2, avg_entropy_mut2, True)
information_gain_mut3 = calc_inf_gain(H_parent_node_mut3, avg_entropy_mut3, True)
information_gain_mut4 = calc_inf_gain(H_parent_node_mut4, avg_entropy_mut4, True)

Average entropy: 0.128
Average entropy: 0.006
Average entropy: 0.522
Average entropy: 0.292


The greatest information gain from the feature Mutation 3 means that the dataframe can be splitted with the information from this feature to be the purest possible

In [40]:
def split_dataframe(df: pd.DataFrame, mutation_column: str, verbose: bool = False) -> tuple:

    left_child: pd.DataFrame = df[df[mutation_column] == 1]
    right_child: pd.DataFrame = df[df[mutation_column] == 0]
    
    if verbose:
        print("Left Child:")
        print(left_child)
        print("\nRight Child:")
        print(right_child)

    return left_child, right_child



In [41]:
# Split the DataFrame based on 'Mutation 3'
left_child, right_child = split_dataframe(df, 'Mutation 3', True)

Left Child:
  Samples  Mutation 1  Mutation 2  Mutation 3  Mutation 4
0      C1           1           1           1           0
2      C3           1           0           1           1
3      C4           0           1           1           0

Right Child:
  Samples  Mutation 1  Mutation 2  Mutation 3  Mutation 4
1      C2           1           1           0           1
4     NC1           0           0           0           0
5     NC2           0           1           0           0
6     NC3           1           1           0           0


As we can see the Right child is has one variable from the cancerous which means that we could reiterate the calculations for this dataframe to find a new optimal split of the child node\
In order to resolve this missclassification we will create a new split after the Right Node

In [42]:
# We need to drop the Mutation 3 feature because we cannot split another time with it
right_child = right_child.drop('Mutation 3', axis=1)

In [43]:
# We will calculate the new parent entropy

mut_1 = "Mutation 1"
mut_2 = "Mutation 2"
mut_4 = "Mutation 4"

H_parent_node_mut1 = calculate_entropy_mutations(right_child, mut_1)
H_parent_node_mut2 = calculate_entropy_mutations(right_child, mut_2)
H_parent_node_mut4 = calculate_entropy_mutations(right_child, mut_4)

print(f"The entropy of the parent node is {H_parent_node_mut1}") 
print(f"The entropy of the parent node is {H_parent_node_mut2}") 
print(f"The entropy of the parent node is {H_parent_node_mut4}") 

The entropy of the parent node is 1.0
The entropy of the parent node is 0.811
The entropy of the parent node is 0.811


In [44]:
def filter_with_name(dataframe: pd.DataFrame, cat_name: str, name_to_filter: str)-> tuple:
    
    filtered_df: pd.DataFrame = dataframe[dataframe[cat_name].str.startswith(name_to_filter)]
        
    return filtered_df


In [45]:
# Filter the column of the dataframe just for the cancerous cells
cancerous_cells_df: pd.DataFrame = filter_with_name(right_child, "Samples", "C")
cancerous_cells_df.head(4)

Unnamed: 0,Samples,Mutation 1,Mutation 2,Mutation 4
1,C2,1,1,1


In [46]:
# Filter the column of the dataframe just for the non cancerous cells
non_cancerous_cells_df: pd.DataFrame = filter_with_name(right_child, "Samples", "NC")
non_cancerous_cells_df.head(4)

Unnamed: 0,Samples,Mutation 1,Mutation 2,Mutation 4
4,NC1,0,0,0
5,NC2,0,1,0
6,NC3,1,1,0


In [47]:
# Calculate the new left node
mut_1: str = "Mutation 1"
mut_2: str = "Mutation 2"
mut_4: str = "Mutation 4"

H_child_left_mut1 = calculate_entropy_mutations(cancerous_cells_df, mut_1)
H_child_left_mut2 = calculate_entropy_mutations(cancerous_cells_df, mut_2)
H_child_left_mut4 = calculate_entropy_mutations(cancerous_cells_df, mut_4)

print(f"The entropy of the Child Node Left for Mutation 1 feature is {H_child_left_mut1}") 
print(f"The entropy of the Child Node Left for Mutation 2 feature is {H_child_left_mut2}") 
print(f"The entropy of the Child Node Left for Mutation 4 feature is {H_child_left_mut4}") 

The entropy of the Child Node Left for Mutation 1 feature is -0.0
The entropy of the Child Node Left for Mutation 2 feature is -0.0
The entropy of the Child Node Left for Mutation 4 feature is -0.0


In [48]:
# Calculate the new right node
mut_1: str = "Mutation 1"
mut_2: str = "Mutation 2"
mut_4: str = "Mutation 4"

H_child_right_mut1 = calculate_entropy_mutations(non_cancerous_cells_df, mut_1)
H_child_right_mut2 = calculate_entropy_mutations(non_cancerous_cells_df, mut_2)
H_child_right_mut4 = calculate_entropy_mutations(non_cancerous_cells_df, mut_4)

print(f"The entropy of the Child Node Right for Mutation 1 feature is {H_child_right_mut1}") 
print(f"The entropy of the Child Node Right for Mutation 2 feature is {H_child_right_mut2}") 
print(f"The entropy of the Child Node Right for Mutation 4 feature is {H_child_right_mut4}") 

The entropy of the Child Node Right for Mutation 1 feature is 0.918
The entropy of the Child Node Right for Mutation 2 feature is 0.918
The entropy of the Child Node Right for Mutation 4 feature is -0.0


In [49]:
# calculate the new average entropies
avg_entropy_mut1 = calc_avg_entropy(H_child_left_mut1, H_child_right_mut1, True)
avg_entropy_mut2 = calc_avg_entropy(H_child_left_mut2, H_child_right_mut2, True)
avg_entropy_mut4 = calc_avg_entropy(H_child_left_mut4, H_child_right_mut4, True)

Average entropy: 0.393
Average entropy: 0.393
Average entropy: -0.0


In [50]:
# Calculate the new information gains
information_gain_mut1 = calc_inf_gain(H_parent_node_mut1, avg_entropy_mut1, True)
information_gain_mut2 = calc_inf_gain(H_parent_node_mut2, avg_entropy_mut2, True)
information_gain_mut4 = calc_inf_gain(H_parent_node_mut4, avg_entropy_mut4, True)

Average entropy: 0.607
Average entropy: 0.418
Average entropy: 0.811


In [51]:
# Now we can split the righ node with the feature mutation 4
new_left_child, new_right_child = split_dataframe(right_child, 'Mutation 4', True)

Left Child:
  Samples  Mutation 1  Mutation 2  Mutation 4
1      C2           1           1           1

Right Child:
  Samples  Mutation 1  Mutation 2  Mutation 4
4     NC1           0           0           0
5     NC2           0           1           0
6     NC3           1           1           0


## We get an accuracy of 100% with our very simple dataset but it is a case of overfitting because the model learns too much specifications of the dataset and will struggle to generelize to new unseen data
## Solution:
### We can chose the attribute with the highest information gain ratio from the attributes whose info gain is average of higher -> biases against considering attributes with a large number of distinct values while not gibing an unfair avantage to attributes with very little value

# Code for feature selection
### We can use already built functions that calculate very efficiently all our previous calculations

In [60]:
from sklearn.feature_selection import mutual_info_classif

# Data for the DataFrame
data: Dict[str, Union[str, float]] = {
    'Samples': ['C1', 'C2', 'C3', 'C4', 'NC1', 'NC2', 'NC3'],
    'Mutation 1': [1, 1, 1, 0, 0, 0, 1],
    'Mutation 2': [1, 1, 0, 1, 0, 1, 1],
    'Mutation 3': [1, 0, 1, 1, 0, 0, 0],
    'Mutation 4': [0, 1, 1, 0, 0, 0, 0]
}

# Modify the 'Samples' column
data['Samples'] = [1 if s.startswith('C') else 0 for s in data['Samples']]

# Create the DataFrame
df: pd.DataFrame = pd.DataFrame(data, index=None)

# Calculate information gain
X = df[['Mutation 1', 'Mutation 2', "Mutation 3", "Mutation 4"]]
y = df['Samples']
info_gain = mutual_info_classif(X, y)

for feature, gain in zip(X.columns, info_gain):
    print(f"Information gain for {feature}: {gain:.4f}")


Information gain for Mutation 1: 0.0714
Information gain for Mutation 2: 0.0000
Information gain for Mutation 3: 0.1952
Information gain for Mutation 4: 0.0238


# Code for Mutual Information
## It measures the amount of information obtained about one random variable through another random variable
## It gives the quantity of dependency between two variable X and Y
## Mutual information is intimately linked to that of entropy of a random variable

In [63]:
from sklearn.metrics import mutual_info_score

def mutual_information(x: List[int], y: List[int])->float:
    
    return np.round(mutual_info_score(x, y), 3)

# Example: Calculate mutual information between two variables
target = np.array(data['Samples'])
feature_1 = np.array(data['Mutation 1'])
feature_2 = np.array(data['Mutation 2'])
feature_3 = np.array(data['Mutation 3'])
feature_4 = np.array(data['Mutation 4'])

print(f"Mutual information between the target and feature_1: {mutual_information(feature_1, target)}")
print(f"Mutual information between the target and feature_2: {mutual_information(feature_2, target)}")
print(f"Mutual information between the target and feature_3: {mutual_information(feature_3, target)}")
print(f"Mutual information between the target and feature_4: {mutual_information(feature_4, target)}")


Mutual information between the target and feature_1: 0.089
Mutual information between the target and feature_2: 0.004
Mutual information between the target and feature_3: 0.362
Mutual information between the target and feature_4: 0.202


# Code for Kullback-Leibler Divergence (Relative Entropy)
## This is the measure of the difference between two probability distributions

In [66]:
import numpy as np
from scipy.stats import entropy

def kl_divergence(p: List[float], q: List[float])->float:
    
    return np.round(entropy(p, q), 3)

# Example: Compare two probability distributions
p = np.array([0.2, 0.5, 0.3])
q = np.array([0.1, 0.4, 0.5])

print(f"KL divergence: {kl_divergence(p, q)}")


KL divergence: 0.097
