# Spare/not dealt with material (2025)

In [1]:
# Python
## Load libraries
import numpy as np


Plundering done from here:
https://colab.research.google.com/github/d2l-ai/d2l-en-colab/blob/master/chapter_appendix-mathematics-for-deep-learning/information-theory.ipynb#scrollTo=608e73cb 

We aim to make these ideas precise…. Information theory, entropy, maximum entropy methods, mutual information, transfer entropy. i.e. quantifying complexity. How information exists/is stored/transferred and what it tells us about the system.

**Self-information:** The self-information is a measure of the amount of the information gained or 'surprise' associated with the occurrence of an event. The higher the self-information, the more informative the event/more surprising the fact the event occurred. It is calculated as $-\log_2(p)$, where $p$ is the probability of the event.

In [None]:
def self_information(p):
    """
    Calculates the self-information of an event with probability p.

    INPUTS:
        p (float): The probability of the event.

    OUTPUTS:
        float: The self-information of the event.
    """
    return -np.log2(p)

#Example usage:
self_information(1 / 64)

**Entropy**

Entropy is a key concept in thermodynamics, statistical mechanics and information theory.

It is a measure for the amount of 'disorder' and information in a system.

In information theory, entropy describes how much randomness is present in a signal or a random event.

Entropy of the degree distribution provides an average measurement of the heterogeneity of the network:
\begin{equation*}
  H = -\sum\limits_{k} P(k) \log P(k),
\end{equation*}
where $P(k)$ is the degree distribution.

Maximum value of entropy is obtained for a uniform degree distribution.

Minimum value $H_{\mathrm{min}}=0$ is achieved whenever all vertices have the same degree ($k$-regular network).

**Entropy**

Network entropy has been related to the robustness of networks, i.e.,their resilience to attacks, and in biological applications, the contribution of vertices to the network entropy is correlated with lethality in protein interactions networks.


**Information entropy:** measures the amount of uncertainty/randomness/unpredictability in a set of data or a probability distribution. It quantifies the average amount of information required to describe the outcomes of a random variable. The higher the entropy the more uncertain we are about the outcomes.

A fair die has a uniform distribution of outcomes

In [None]:
def information_entropy(p):
    """
    Calculate the information entropy of a probability distribution.

    INPUTS:
        p (array-like): Input array representing a probability distribution.

    OUTPUTS:
        float: Entropy value of the probability distribution.

    """
    information_entropy = - p * np.log2(p)
    
    out = np.nansum(information_entropy) # `nansum` sums up non-nan numbers, ignoring events with probability=0
    return out

fair_die_entropy = information_entropy(np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])) #a fair die
loaded_die_entropy = information_entropy(np.array([0.5, 0.1, 0.1, 0.1, 0.1, 0.1])) #a loaded die

print(fair_die_entropy, loaded_die_entropy)

# Thermodynamics
Thermodynamics deals only with large systems, containing many constituents.

phase space has a huge number of dimensions. 

The system may or may not be complex (irrelevant for the truth of the second law)

## Entropy
entropy is simply a fancy word for “disorder”

It is a quantitative measure

The second law of thermodynamics is about order compared with disorder. One easy-to- grasp statement, which is an accurate one, is the following. It concerns the time-evolution of an isolated system, which means a system lacking any kind of interaction or connection with the rest of the universe. The spontaneous evolution of an isolated system can never lead to a decrease of its entropy (= disorder). The entropy is always increasing as long as the system evolves. If the system eventually reaches equilibrium and stops evolving, its entropy becomes constant. 

everybody knows the second law intuitively already. 
## 'Entropy'
dimensionless entropy, which measures our lack of knowledge, is a purely subjective quantity.

if entropy is really totally subjective, why does it appear so ob- jective to so many people? Ask any physical chemist. You will be told that entropy is a fundamental, permanent property of matter in bulk. You can measure it, you can calculate it accurately, you can look it up in tables, etc. And the answer to that new paradox is: large numbers. It is well-known that very large numbers have a way of making probabilities turn into absolute certainty.

The fundamental building block of Information Theory

### “the paradox of the arrow of time”
it may stay constant, but this happens only under ideal conditions never realized in practice; in the real world, it is always increasing. Therefore the evolution of such a system is always irreversible: if you have observed a certain evolution of your system, you know that the backwards evolution, with the same things happening in reverse order of time, can never be possible, because it would make entropy decrease with time. And this comes as a big shock because, in mechanics, any possible motion is also possible in the reverse direction. The second law says that in thermodynamics this is never true: for an isolated system, no thermodynamic motion is ever reversible. It is the opposite of what happens in mechanics! How could this ever come about, since we are told that thermodynamics is just mechanics plus statistics?

In [None]:
def joint_entropy(p_xy):
    joint_ent = -p_xy * np.log2(p_xy)
    # Operator `nansum` will sum up the non-nan number
    out = np.nansum(joint_ent)
    return out

joint_entropy(np.array([[0.1, 0.5], [0.1, 0.3]]))

def conditional_entropy(p_xy, p_x):
    p_y_given_x = p_xy/p_x
    cond_ent = -p_xy * np.log2(p_y_given_x)
    # Operator `nansum` will sum up the non-nan number
    out = np.nansum(cond_ent)
    return out

conditional_entropy(np.array([[0.1, 0.5], [0.2, 0.3]]), np.array([0.2, 0.8]))

def mutual_information(p_xy, p_x, p_y):
    p = p_xy / (p_x * p_y)
    mutual = p_xy * np.log2(p)
    # Operator `nansum` will sum up the non-nan number
    out = np.nansum(mutual)
    return out

mutual_information(np.array([[0.1, 0.5], [0.1, 0.3]]),
                   np.array([0.2, 0.8]), np.array([[0.75, 0.25]]))

def cross_entropy(y_hat, y):
    ce = -np.log(y_hat[range(len(y_hat)), y])
    return ce.mean()

labels = np.array([0, 2])
preds = np.array([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])

cross_entropy(preds, labels)

Relative entropy

The relative entropy measures the distance between two distributions and it is also called Kullback-Leibler distance. It is given by:

In [None]:
# For 6 possible outcomes...
# import numpy as np
import math

for i in range(1,6):
    H = -(i/6)*math.log2(i/6)-(1-i/6)*math.log2(1-i/6)
    print(H)

An example with maximally random situation to avoid confusion

Connection of entropy to complexity

Primary resource for this slide set: https://necsi.edu/chaos-complexity-and-entropy

In Thermodynamics entropy is simply a fancy word for “disorder”

You should consider the connection

Include better description of bit cutting down space in half:
    
<center>
<img src="3B1B_bits.png" width="500"/>
</center>



In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate 1,000 random samples from a normal distribution with mean 0 and standard deviation 1
data_normal = np.random.normal(0, 1, 1000)

mean_normal = 0
stddev_normal = 1
x_normal = np.linspace(-3, 3, 1000)  # Range of values for PDF calculation
pdf_normal = norm.pdf(x_normal, loc=mean_normal, scale=stddev_normal)
entropy_normal = shannon_entropy(pdf_normal)

# Generate random samples from a uniform distribution between 0 and 1
data_uniform = np.random.uniform(0, 1, 1000)

# Generate random samples from an exponential distribution with rate parameter (lambda) of 0.5
data_exponential = np.random.exponential(1/0.5, 1000)

# Generate random samples from a Poisson distribution with a mean rate of 3
data_poisson = np.random.poisson(3, 1000)

# Generate random samples from a binomial distribution with 10 trials and a success probability of 0.3
data_binomial = np.random.binomial(10, 0.3, 1000)

# Generate random samples from a geometric distribution with a success probability of 0.2
data_geometric = np.random.geometric(0.2, 1000)

# Create a histogram with 20 bins
plt.hist(data_normal, bins=20, density=True, alpha=0.6, color='b')




In [None]:

# Create a figure with 1 row and 3 columns of subplots
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Define the range and number of outcomes
lower_bound = 0
upper_bound = 7
num_outcomes = 1000

# Generate random values from the uniform distribution
samples_uniform = np.random.uniform(lower_bound, upper_bound, num_outcomes)

# Create a histogram
hist, bin_edges = np.histogram(samples_uniform, bins=20, range=(lower_bound, upper_bound))

# Calculate the probability associated with each bin
probabilities_uniform = hist / num_outcomes  # Normalize by the total number of outcomes

# Calculate entropy using the estimated probabilities
entropy_uniform = shannon_entropy(probabilities_uniform)
axs[0].hist(samples_uniform, bins=20, density=True, color='powderblue')
# axs[1].plot(x_uniform, pdf_normal, 'r-', lw=2)
# axs[0].set_title(f'Uniform Distribution\nEntropy: {entropy_uniform:.2f}')
axs[0].set_xticks([])  # Remove x-axis ticks and labels
axs[0].set_yticks([])  # Remove y-axis ticks and labels

# Normal (Gaussian) Distribution
mean_normal = 0
stddev_normal = 1
x_normal = np.linspace(-3, 3, 1000)
# Create a histogram to estimate probabilities
hist, bin_edges = np.histogram(samples_normal, bins=20, range=(-3, 3), density=True)

# Calculate the probability associated with each bin
probabilities_normal = hist * (bin_edges[1] - bin_edges[0])


pdf_normal = norm.pdf(x_normal, loc=mean_normal, scale=stddev_normal)
entropy_normal = shannon_entropy(probabilities_normal)
axs[1].hist(samples_normal, bins=20, density=True, color='pink')
# axs[1].plot(x_normal, pdf_normal, 'r-', lw=2)
# axs[1].set_title(f'Normal Distribution\nEntropy: {entropy_normal:.2f}')
axs[1].set_xticks([])  # Remove x-axis ticks and labels
axs[1].set_yticks([])  # Remove y-axis ticks and labels

# Exponential Distribution
rate_exponential = 0.1
x_exponential = np.linspace(0, 10, 1000)
samples_exponential = np.random.exponential(1/rate_exponential, 1000)

pdf_exponential = expon.pdf(x_exponential, scale=1/rate_exponential)
# Create a histogram to estimate probabilities
hist, bin_edges = np.histogram(samples_exponential, bins=20, range=(0, 10), density=True)

# Calculate the probability associated with each bin
probabilities_exponential = hist * (bin_edges[1] - bin_edges[0])

entropy_exponential = shannon_entropy(probabilities_exponential)
axs[2].hist(samples_exponential, bins=20, density=True, color='navajowhite')
# axs[2].plot(x_exponential, pdf_exponential, 'g-', lw=2)
# axs[2].set_title(f'Exponential Distribution\nEntropy: {entropy_exponential:.2f}')
axs[2].set_xticks([])  # Remove x-axis ticks and labels
axs[2].set_yticks([])  # Remove y-axis ticks and labels

# Adjust spacing between subplots
plt.tight_layout()

# Save the figure to an image file (e.g., PNG, PDF, etc.)
plt.savefig("entropy_figure.png")  # Provide the desired file name and format

# Display the figure
plt.show()



In [None]:
def shannon_entropy(probabilities):
    entropy = 0
    for p in probabilities:
        if p > 0:
            entropy -= p * np.log2(p)
    return entropy
    
    # Define a function to calculate Shannon entropy
def shannon_entropy(probabilities):
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy


# Define the range and number of outcomes
lower_bound = 0
upper_bound = 7
num_outcomes = 1000

# Generate random values from the uniform distribution
samples_uniform = np.random.uniform(lower_bound, upper_bound, num_outcomes)

# Create a histogram
hist, bin_edges = np.histogram(samples_uniform, bins=20, range=(lower_bound, upper_bound))

# Calculate the probability associated with each bin
probabilities_uniform = hist / num_outcomes  # Normalize by the total number of outcomes

# Calculate entropy using the estimated probabilities
entropy_uniform = shannon_entropy(probabilities_uniform)
hist

In [None]:
# For inclusion in 2024
"A new method of estimating the entropy and redundancy of a language is described." - Shannon's original paper looked at the redundancy of a language too
https://www.princeton.edu/~wbialek/rome/refs/shannon_51.pdf

If the language is translated into binary digits(0 or 1) in the most efficient way, the entropy is the average number of binary digits required per letter of the original language. 

<!-- multi-scale: modelling entire structure is essentially impossible and the ability to think critically about the distinction between 'wholes' and 'parts' as well as understanding when it is possible to reduce the dimensionality with things like coarse-graining is key  -->


Having defined $h(x)$...With the base of the log set to 2, each 'bit' of information you receive/surprise you experience corresponds to the total amount of possibilities being cut in half...

... And this motivates a second way to view entropy.

### B: Entropy as required information 

On average, what's the minimum number of yes/no questions required to determine the state of $X$?

For the fair die: 
- Outcomes $1,2,5,6$: 3 questions each
- Outcomes $3,4$: 2 questions each

We can see this with a binary decision tree:

![](images/BinaryDecisionTree_FairDie.png)

The expected number of questions is:

$$
E[\text{questions}] = \left( \frac{1}{6} \times 3 \right) + \left( \frac{1}{6} \times 3 \right) + \left( \frac{1}{6} \times 2 \right) + \left( \frac{1}{6} \times 3 \right) + \left( \frac{1}{6} \times 3 \right) + \left( \frac{1}{6} \times 2 \right) \approx 2.67
$$

Entropy bound:

$$
H(X) = -\sum_x \tfrac{1}{6}\log_2 \tfrac{1}{6} 
= \log_2 6 \approx 2.59
$$


For the loaded die: 
- asking "is it 2", you'll be correct two times in three
- if no ($1/3$ of the time): you still need to distinguish between the 5 remaining outcomes, each with probability $1/15$.
    - this needs \~$\log_2 5 \approx 2.32$ extra questions

$$
E[\text{questions}] 
= \tfrac{2}{3}(1) + \tfrac{1}{3}(1+2.32) 
\approx 1.78
$$

To minimise the number of questions, we might structure the decision tree as:

![](images/BinaryDecisionTree_LoadedDie.png)

The expected number of questions is then:

$$
E[\text{questions}] = \left(\tfrac{2}{3} \times 1 \right) + \left(\tfrac{1}{3} \times 3.32 \right) \approx 0.67 + 1.11 = 1.78
$$

Entropy bound:

$$
H(X) = -\Big[\tfrac{2}{3}\log_2\tfrac{2}{3} + 5\cdot \tfrac{1}{15}\log_2\tfrac{1}{15}\Big] 
\approx 1.69
$$

Efficient questioning strategies approach the entropy of the distribution. Entropy is the *ideal lower bound* (best possible average).
In the case of the fair die, you need to ask more questions (that is to say, you need more information) to uniquely specify the state of the die, while in the case of the loaded one, since you already know that it's more likely to be two, you need less information

The number of unique yes/no questions required is approximately given by the entropy function $H(X)$, when the logarithm base is 2.

Note that inefficiencies in the questioning strategy will cause the average to be slightly greater than the entropy of the system.
#### An alternative perspective...

On average, what's the minimum number of yes/no questions required to determine the state of $X$?

For the fair die: 
- on average, it will take you $\approx2.59$ questions to uniquely determine the state of the die
    - $I(x) = -\log_2\left(\frac{1}{6}\right) \approx 2.59 \text{ bits}$
    - $H(X) = -6\frac{1}{6}\log_2\left(\frac{1}{6}\right) \approx 2.59 \text{ bits}$

For the loaded die with probability of rolling a 2 $p(2) = \frac{2}{3}$
- if you start with "is it 2", you'll be correct two times in three
    - $I(2) = -\log_2\left(\frac{2}{3}\right) = \log_2\left(\frac{3}{2}\right) \approx 0.585 \text{ bits}$
- Each of the outcomes 1, 3, 4, 5, 6 has a probability of $p(x) = \frac{1}{15}$
    - $I(x) = -\log_2\left(\frac{1}{15}\right) = \log_2(15) \approx 3.906 \text{ bits}$
- on average, it will only take $\approx1.7$ guesses i.e. the **expected information content** (or Shannon entropy) is the weighted average of the information content over all possible outcomes:
    - $H(X) = \frac{2}{3} \cdot 0.585 + \frac{5}{15} \cdot 3.906= 0.39 + 1.302 \approx 1.692 \text{ bits} $

In the case of the fair die, you need to ask more questions (that is to say, you need more information) to uniquely specify the state of the die, while in the case of the loaded one, since you already know that it's more likely to be two, you need less information

This (i.e. the number of unique yes/no questions required) is given by the entropy function $H(X)$, when the logarithm base is 2.

<!-- This is linked with the idea of entropy as a measure of the “capacity” of a random variable to disclose information. You can’t really communicate that much information in a coin, since it can only be heads or tails. In contrast, you can communicate a huge amount of information in the English alphabet with its 26 characters (and correspondingly higher entropy). -->

The entropy represents the best-case scenario, while the average number of questions reflects real-world questioning dynamics.

When the probabilities are powers of $1/2$ (like a fair coin, or sometimes a fair 4- or 8-sided die), the binary tree is exact, and each outcome really does take the same number of questions.

When they are not (like 5-sided or loaded dice), some outcomes take more questions, some fewer, and the average is what matters.