# Entropy
![image.png](attachment:8938cd58-2f2a-4d74-9c2d-17afff773d3d.png)

<span style="font-size: 11pt; color: steelblue; font-weight: bold">Entropy</span> is a fundamental concept in various fields, including information theory, statistics, thermodynamics and **Machine Learning**. It plays a crucial role in the study of data compression, coding theory, and understanding the complexity of data. 
***
<span style="font-size: 18pt; color: turquoise; font-weight: bold">Entropy is a measure of uncertainty or randomness.</span> 
***
In the context of information theory, <u>it quantifies the average amount of information needed to describe an event drawn from a probability distribution</u>.   

In layman terms, Entropy tells us how much surprise or unpredictability there is in the data.

The formula to calculate entropy for a discrete random variable with probability distribution $P$ is:

$$ H(P) = - \sum_{i} P(i) \log_{2}(P(i)) $$

Where:
- $P(i)$ is the probability of the $i$-th event occurring.

### Interpreting the values of Entropy 

<span style="font-size: 11pt; color: steelblue; font-weight: bold">The range of entropy values is between $0$ and $\log_2(n)$, inclusive.</span>

The range of values that entropy can take depends on the number of events and their corresponding probabilities in the probability distribution. In information theory, entropy is always a non-negative value. When dealing with a discrete random variable, the entropy will be bounded based on the number of events and their probabilities.

For a discrete random variable with $n$ events, the minimum entropy occurs when one event has a probability of $1$ – (**certainty**) and all other events have a probability of $0$ — (**impossibility**). In this case, the entropy will be $0$, meaning there is no uncertainty or surprise because we already know the outcome.

On the other hand, the maximum entropy occurs when all events are equally likely. In this scenario, the entropy will be at its highest value. Each event contributes equally to the uncertainty, and there is maximum surprise.

For example:
- If there are two equally likely events (each with a probability of 0.5), the maximum entropy will be $log_2(2) = 1$.
- If there are four equally likely events (each with a probability of 0.25), the maximum entropy will be $\log_2(4) = 2$.

In general, the less uniform the probabilities are in the distribution, the lower the entropy will be. Conversely, if the probabilities are more evenly distributed, the entropy will be higher.

### <span style="font-size: 18pt; color: goldenrod; font-weight: bold">Entropy in Machine Learning</span>
Entropy plays a crucial role in Machine Learning, particularly in Decision Trees models and information gain calculations. 

Here are some key aspects of the importance of entropy in machine learning:

1. **Decision Trees**: Decision trees are a popular type of machine learning model used for both classification and regression tasks. <u>They work by recursively splitting the data based on the most informative features</u>. <span style="font-size: 11pt; color: steelblue; font-weight: bold">Entropy is used as a criterion to measure the impurity or randomness of a dataset at a particular node</span>. By choosing splits that minimize entropy (maximize information gain), decision trees can efficiently partition the data into homogeneous subsets, leading to effective predictions.

2. **Information Gain**: Information gain is a concept used to evaluate the usefulness of a feature in decision tree algorithms. <span style="font-size: 11pt; color: steelblue; font-weight: bold">It quantifies the reduction in entropy (or increase in homogeneity)</span> achieved by splitting the data based on a specific feature. Features with higher information gain are considered more informative and are favored for creating the decision tree's branches.

3. **Feature Selection**: <u>Entropy can be employed in feature selection techniques to identify the most relevant features for a particular task</u>. By computing the information gain for each feature, we can rank them based on their contribution to reducing uncertainty in the target variable. This process aids in choosing the subset of features that best represent the data, leading to simpler and more accurate models.

4. **Ensemble Learning**: Ensemble methods, such as Random Forests, combine multiple decision trees to make more robust predictions. <u>The selection of features and nodes in each tree is guided by entropy and information gain</u>. The diversity introduced by different trees helps to reduce overfitting and improve the overall performance of the model.

5. **Clustering**: In unsupervised learning, entropy can be used to evaluate the quality of clusters. Entropy-based cluster evaluation metrics, like the **Normalized Mutual Information (NMI)**, compare the cluster assignments to true labels (if available) and assess the clustering accuracy.

6. **Model Evaluation**: Entropy-based metrics can be employed to evaluate the performance of probabilistic models. For instance, in classification tasks, the cross-entropy loss function measures the dissimilarity between predicted probabilities and true labels, guiding the model towards more accurate predictions.

Overall, Entropy is a fundamental concept in Machine Learning that aids in:
- Decision making, 
- Feature selection, 
- Model evaluation,
- Creating robust models. 

Entropy provides a principled way to measure uncertainty and information gain, allowing Machine Learning algorithms to make more informed and effective decisions based on the available data.

## Computational Examples:

Probabilities should always sum up to 1 for the correct computation of entropy. Also, we have to make sure that the probabilities are non-negative and, also, if we want to allow for zero probabilities in our dataset, – we would have to use some techniques to avoid division by zero.

In [1]:
import numpy as np

def entropy(probabilities):
    """
    Calculate the entropy of a probability distribution.

    Parameters:
        probabilities (array-like): A 1-D array or list containing the probabilities of events.
                                   The probabilities must sum up to 1.
    Returns:
        float: The calculated entropy value.

    Raises:
        AssertionError: If the probabilities do not sum up to 1.
    """
    # Convert probabilities to a NumPy array for easier computation
    probabilities = np.array(probabilities)

    # Ensure the probabilities sum up to 1 (necessary for valid probabilities)
    assert np.isclose(np.sum(probabilities), 1.0), "Probabilities must sum to 1"

    # Add epsilon to avoid numerical issues with log(0)
    epsilon = 1e-12
    nonzero_probs = np.maximum(probabilities, epsilon)

    # Calculate the entropy
    entropy_val = -np.sum(nonzero_probs * np.log2(nonzero_probs))

    return entropy_val


In [2]:
# Example: A random variable with two equally probable outcomes (1/2 each)
probabilities = [0.5, 0.5]
result_entropy = entropy(probabilities)

print(f"Entropy: {result_entropy:.4f}")

Entropy: 1.0000


In [3]:
# Example: A random variable with two outcomes (1, 0)
probabilities = [1.0, 0.0]
result_entropy = entropy(probabilities)

print(f"Entropy: {result_entropy:.4f}")

Entropy: 0.0000


In [4]:
# Example: A random variable with for equally probable outcomes (1/4 each)
probabilities = [0.25, 0.25, 0.25, 0.25]
result_entropy = entropy(probabilities)

print(f"Entropy: {result_entropy:.4f}")

Entropy: 2.0000
