# Entropy and Information

:::{admonition} **What you will learn**

- Entropy expression expressed in log2 quantifies amount of information in macrostates.
- Information theoretic interpretation of entropy given a probability distributions over microstates. 
- Maximum Entropy principle: as the most unbiased way to infer distributions given empricial constraints.

:::

### Surprise

**Which of these two statements conveys the most information?**

- I will eat some food tomorrow.
- I will see a giraffe walking by my apartment. 

**A measure of information (whatever it may be) is closely related to the element of... surprise!**

- has very high probability and so conveys little information,
- has very low probability and so conveys much information. 

> If we quanitfy suprise we will quantify information

### Addititivity of Information

**Knowledge leads to gaining information**

Which is more surprising (contains more information)?

- E1: The card is heart? $P(E_1) = \frac{1}{4}$

- E2:The card is Queen? $P(E_2)  =  \frac{4}{52} = \frac{1}{13}$

- E3: The card is Queen of hearts? $P(E_1 \, and\,  E_2) = \frac{1}{52}$ 

**Knowledge of event should add up our information: $I(E) \geq 0$**

1. We learn the card is heart $I(E_1)$

2. We learn the card is Queen $I(E_2)$

3. $I(E_1 and E_2) = I(E_1) + I(E_2)$



**A logarithm of probability is a good candidate function for information!**

$$log_2 P(E_1) P(E_2) = log_2 P(E_1) + log_2(E_2)$$

- What about the sign? 

$$I_i = -log_2 p_i$$

### Why bit (base two)

- Consider symmetric a 1D random walk with equal jump probabilities. We can view **Random walk = string of Yes/No questions**. 
- Imagine driving to a location how many left/right turn informations you need to reach destination? 

- You gain one bit of information when you are told Yes/No answer

$$I(X=0) = I(X=1) = -log_2 \frac{1}{2} = 1$$

- To decode N step random walk trajectory we need N bits. 

$$(x_0,x_1,...x_N) = 10111101001010100100$$

### Shannon  Measure of Information 


:::{admonition} **Shanon Entropy**
:class: important 

$$H = -\sum_i p_i log_2 p_i$$

- $H$ Entropy(Information) measured in bits
- $p_i$ probability of microstate $i$, e.g coin flip or die roll outcomes

:::

- John von Neumann advice to [To Calude Shanon], "You should call it Entropy, for two reasons. In the first place you uncertainty function has been used in statistical mechanics under that name. In the second place, and more importantly, no one knows what entropy really is, so in a debate you will always have the advantage.” 

- Entropy that appears here and in thermodyanmics are one and same quantity but expressed in differetn units! We reserve letter $S$ for entropy in units of Boltzman constant that sets the units of thermal energy $k_B$. For now lets just roll with $k_B=1$ we wont be doing any thermodynamics in here.  

- Note that when all $\Omega$ number of microstates of a macrostate are equally likely, e.g fair coin $\Omega=2$ or die $\Omega =6$ entropy is maximimized and becomes log of number of micostates in a macrostate!

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def binary_entropy(p):
    """
    Compute the binary Shannon entropy for a given probability p.
    Avoid issues with log(0) by ensuring p is never 0 or 1.
    """
    return -p * np.log2(p) - (1 - p) * np.log2(1 - p)

# Generate probability values, avoiding the endpoints to prevent log(0)
p_vals = np.linspace(0.001, 0.999, 1000)
H_vals = binary_entropy(p_vals)

# Create a figure with two subplots side-by-side
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Binary Shannon Entropy Function
ax[0].plot(p_vals, H_vals, lw=2, color='midnightblue',
           label=r"$H(p)=-p\log_2(p)-(1-p)\log_2(1-p)$")
ax[0].set_xlabel(r"Probability $p$)", fontsize=14)
ax[0].set_ylabel("Entropy (bits)", fontsize=14)
ax[0].set_title("Binary Shannon Entropy", fontsize=16)
ax[0].legend(fontsize=12)
ax[0].grid(True)

# Plot 2: Example Distributions and Their Entropy
# Define a few example two-outcome distributions:
distributions = {
    "Uniform (0.5, 0.5)": [0.5, 0.5],
    "Skewed (0.8, 0.2)": [0.8, 0.2],
    "Extreme (0.99, 0.01)": [0.99, 0.01]
}

# Colors for each distribution
colors = ["skyblue", "salmon", "lightgreen"]

# For visual separation, use offsets for the bars
width = 0.25
x_ticks = np.arange(2)  # positions for the two outcomes

for i, (label, probs) in enumerate(distributions.items()):
    # Compute the Shannon entropy for the distribution
    entropy_val = -np.sum(np.array(probs) * np.log2(probs))
    # Offset x positions for clarity
    x_positions = x_ticks + i * width - width
    ax[1].bar(x_positions, probs, width=width, color=colors[i],
              label=f"{label}\nEntropy = {entropy_val:.2f} bits")
    
# Set labels and title for the bar plot
ax[1].set_xticks(x_ticks)
ax[1].set_xticklabels(["Outcome 1", "Outcome 2"], fontsize=12)
ax[1].set_ylabel("Probability", fontsize=14)
ax[1].set_title("Example Distributions", fontsize=16)
ax[1].legend(fontsize=12)
ax[1].grid(True, axis='y')

plt.tight_layout()
plt.show()


::::{admonition} **Exercise: Information per letter $I(m)$ to decode the message** 
:class: note, dropdown

- Let $m$  represent the letters in an alphabet. For example:

  - **Korean:** 24 letters
  - **English:** 26 letters
  - **Russian:** 33 letters

- The information content associated with these alphabets satisfies:
  
  $$
  I(\text{Russian}) > I(\text{English}) > I(\text{Korean})
  $$

- The information of a sequence of letters is additive, regardless of the order in which they are transmitted:

  $$
  I(m_1, m_2) = I(m_1) + I(m_2)
  $$

- **Question** If the symbols of English alphabet (+ blank) appear equally probably, what is the information carried by a single symbol? This must be $log_2(26 + 1) = 4.755$ bits, but for actual English sentences, it is known to be about **$1.3$ bits. Why?**

:::{dropdown} **Solution**

- Not every letter has equal probability or frequency of appearing in a sentence!

:::

::::


::::{admonition} **Exercise: entropy of die rolls** 
:class: note, dropdown

- How much knowledge we need to find out outcome of fair dice?

- We are told die shows a digit higher than 2 (3, 4, 5 or 6). How much knowledge does this information carry? 


:::{dropdown} **Solution**

 - $H(E_1) = log_2 6$
 
 
 - $H(E_1) - H(E_2) = log_2 6 - log_2 4$

:::

::::


::::{admonition} **Exercise: Two cats** 
:class: note, dropdown

There are two kittens. We are told that at least one of them is a male. What is the information we get from this message?

:::{dropdown} **Solution**

$$E_1 = \{mm,mf,fm, ff \} $$

$$E_2 = \{mm,mf,fm\}$$

$$H(E_1) -H(E_2) = log_2 4 -log_2 3 = 0.41$$

:::

::::

::::{admonition} **Exercise: Monty Hall problem** 
:class: note, dropdown

There are five boxes, of which one contains a prize. A game participant is asked to choose one box. After they choose one of the five boxes, the “coordinator” of the game identifies as empty three of the four unchosen boxes. What is the information of this message? 

:::{dropdown} **Solution**

- $H(E_1) = log_2 5 = 2.322$


- $H(E_2) = -\frac{1}{5} log_2 5 - \frac{4}{5} log_2 \frac{4}{5} = 0.722$

- $H(E_1)-H(E_2) = 1.6$

:::

::::

::::{admonition} **Exercise: Why are there non-integer number of YES/NO questions??** 
:class: note, dropdown

Explain the origin of the non-integer information. Why it takes less than one-bit to encode information? 

:::{dropdown} **Solution**


- We have encountered a fraction of bit of information several times now. What does it imply in terms of number of YES/NO questions. That is becasue in some cases single YES/NO question can rule out more than one elementary event.

- In other words we can ask clever questions that can get us to answer faster than doing YES/No on every single possibility

- 999 blue balls and 1 red ball. how many questions we need to ask to determin the colors of all balls? $S = 9.97$ bit or 0.01 bit per ball. Divide the container by 500 and 500 and ask where the red ball is? 1 questions rules out 500 balls at once. 

:::

::::

### Entropy, micro and macro states probability

:::{admonition} **Entropy for equally probably microstates**
:class: important 

- When all microstates in sample space $\Omega$ are equally $p_i=\frac{1}{\Omega}$ leads to a simple expression of Entropy first obtained by Boltzmann

$$S(\Omega) = -\sum_i \frac{1}{\Omega} log \frac{1}{\Omega} = log \Omega$$

- **Entropy for a macrostate** with $\Omega(A)$ number of microstates has probability $P(A) = \frac{\Omega(A)}{\Omega}$ and is given:

$$S(A) = log \Omega(A) = log P(A) + const$$

- Entropy of a macrostate **quantifies how probable that macrostate is!**

:::


::: {admonition} **Flashback to Random Walk, Binomial, and Large Deviation Theorem**  
:class: tip, dropdown  

- [Where have we seen the entropy expression before?](https://dpotoyan.github.io/Statmech4ChemBio/1_stats/Probabilities_Counting.html#large-deviation-theory)
- When we took the **log of the binomial distribution**! But why did we call it entropy?  

$$
S(n) = \log \frac{N!}{n! (N-n)!} = N \left[ - f \log f - (1-f) \log (1-f) \right] = N s(f)
$$

- Here, $ f = n/N $ represents the **fraction (empirical probability) of steps to the right** in a random walk.  
- $ s(f) = S/N $ is the **entropy per particle (or per step)**, while $ S $ is the **total entropy**.  
- **Different macrostates have different entropy, depending on the number of microstates they contain!**  

  $$
  P(n) = \frac{\Omega(n)}{\Omega_{\text{total}}} =  \frac{N!}{n! (N-n)!} \cdot \frac{1}{2^N}
  $$

- where $ \Omega_{\text{total}} = 2^N $ is the **total number of microstates**.  
- Once again, we can think of **the entropy of a macrostate as being related to its probability:**  

  $$
  S(n) \sim \log P(n)
  $$

**Connection to the Large Deviation Theorem**  
When we express probability in terms of entropy, we recover the **Large Deviation Theorem**, which states that **fluctuations from the most likely macrostates are exponentially suppressed**:

$$
P(f) \sim e^{N s(f)}
$$

This result highlights how entropy naturally governs the likelihood of macrostates in statistical mechanics.  

:::


#### Descriptions of Entropy

- **Measure of Required Information:**  
  - Entropy quantifies the number of yes/no questions needed to precisely identify the microstate of a system. For example, determining the exact trajectory of an N-step random walk or the detailed molecular distribution in a container both require specifying a particular microstate.

- **Indicator of Diversity and Uncertainty:**  
  - Entropy reflects the diversity of microstates in a macrostate available to a system. A higher entropy implies a larger number of possible microstates, leading to greater uncertainty about the system's actual state.
  - When all microstates are equally likely we see that entropy is just log of probability of that macrostate! Thus we may exepct a system if given a choice to spointenaeusly evolve to more likely macrostate instead of less likely. 

- **Implications:**  
  - In systems with high entropy, the vast diversity of microstates means that more "work" is needed—in an informational sense—to pinpoint a specific microstate.
  - If we want to reduce the entropy we see that one must carry out physical work! E.g if we want to ask less yes/no questions about gas atoms then we must can compress it. 
  - Hence we expect the systems that evolve sponteanueously and irreversibly to increase entropy and not the other way around!

### Is Information Physical?

:::{figure-md} markdown-fig  

<img src="./figs/max-dem.png" alt="diffflux" style="width:35%">

Maxwell’s demon controlling the door that allows the passage of single molecules from one side to the other. The initial hot gas gets hotter at the end of the process while the cold gas gets colder.
:::  

- **Wheeler's "It from Bit":**  
  - Every "it" — every particle, every field, every force, and even the fabric of space-time — derives its function, meaning, and very existence from binary answers to yes-or-no questions. In essence, Wheeler's idea of "It from Bit" posits that at the most fundamental level, the physical universe is rooted in information. This perspective implies that all aspects of reality are ultimately information-theoretic in origin.

- **Maxwell's Demon:**  
  - Maxwell's Demon is a thought experiment that challenges the second law of thermodynamics by envisioning a tiny being capable of sorting molecules based on their speeds. By selectively allowing faster or slower molecules to pass through a gate, the demon appears to reduce entropy without expending energy. However, the act of gathering and processing information incurs a thermodynamic cost, ensuring that the overall entropy balance is maintained. This paradox underscores that information is a physical quantity with measurable effects on energy and entropy.

### Jaynes' Maximum Entropy (MaxEnt) Principle

- Probability is an expression of incomplete information. Given that we have some information, how should we construct a probability distribution that reflects that knowledge, but is otherwise unbiased? 
- The best general procedure, known as Jaynes' Maximum Entropy (MaxEnt) Principle, is to choose the probabilities $ p_k $ to **maximize the Shannon entropy** of the distribution, **subject to constraints** that express what we do know.
- Maximizing entropy in this way ensures that we select the least biased distribution possible, given the constraints.

::: {admonition} **MaxEnt**
:class: important 

- **MaxEnt: Maximize Entripy subject to constraints on given observables: $x^a$, $x^b$, etc**

$$
S = - \sum_k p_k \log p_k,
$$

$$
\sum_k p_k \, x_k^a = \langle x^a \rangle, \quad \sum_k p_k \, x_k^b = \langle x^b \rangle, \quad \text{etc.}
$$

$$
J[p] = S(p) - \lambda_0 \left( \sum_k p_k - 1 \right) - \lambda_1 \left( \sum_k p_k \, x_k^a - \langle x^a \rangle \right) - \cdots,
$$

- $J[p]$ is the function we optimize, and the $\lambda_i$ are **Lagrange multipliers** enforcing the respective constraints.

:::


### Applications of  MaxEnt

#### 1. Fair Die Example

For a fair $ N $-sided die, we have no prior information favoring one outcome over another. The only constraint is the normalization condition:

$$ \sum_{i=1}^{N} p_i = 1. $$

We maximize:

$$ J[p_1, p_2, ..] = - \sum p_i \log p_i - \lambda \left( \sum_i p_i - 1 \right). $$

Taking the derivative and solving for $ p_i $, we obtain:

$$ p_1 = p_2 = ... = p_N = \frac{1}{N}. $$

This confirms our intuition that all outcomes are equally probable.


#### 2. Biased Die Example

Suppose we have additional information: the average outcome of rolling a die is $ \langle x \rangle = 5.5 $. The entropy function to maximize becomes:

$$ J[p_1, p_2, ..] = - \sum p_i \log p_i - \lambda \left( \sum_i p_i - 1 \right) - B \left( \sum_i p_i x_i - 5.5 \right). $$

Solving the variational equation, we find that the optimal probability distribution follows an exponential form:

$$ p_i = \frac{e^{- B x_i}}{Z}, $$

where $ Z $ is the partition function ensuring normalization:

$$ Z = \sum_{i=1}^{6} e^{- B x_i}. $$

To determine $ B $, we use the constraint $ \langle x \rangle = 5.5 $:

$$ \sum_{i=1}^{6} x_i \frac{e^{- B x_i}}{Z} = 5.5. $$

This equation can be solved numerically for $ B $. In many cases, Newton's method or other root-finding techniques can be employed to find the exact value of $ B $. This distribution resembles the Boltzmann factor in statistical mechanics, where higher outcomes are exponentially less probable.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize

# Define die outcomes
x = np.arange(1, 7)  # Possible outcomes {1,2,3,4,5,6}

# Define the target mean constraint
target_mean = 5.5  # Expected value constraint

# Define the entropy function to maximize (negative because we minimize in optimization)
def entropy(logp):
    """
    Compute the negative entropy (since we minimize in optimization).
    logp contains log-probabilities to ensure numerical stability.
    """
    p = np.exp(logp)  # Convert log-probabilities back to probabilities
    p /= np.sum(p)  # Ensure proper normalization
    return -np.sum(p * np.log(p))  # Negative entropy for maximization

# Define the constraint function to enforce the expected value condition
def constraint(logp):
    """
    Constraint function to enforce the mean constraint ⟨x⟩ = 5.5.
    """
    p = np.exp(logp)  # Convert log-probabilities back to probabilities
    p /= np.sum(p)  # Ensure proper normalization
    return np.sum(x * p) - target_mean  # Difference from the desired mean

# Initial guess: uniform log-probabilities (log of 1/6 for each face)
logp0 = np.log(np.ones(6) / 6)

# Define optimization constraints
constraints = {"type": "eq", "fun": constraint}

# Perform numerical optimization using Sequential Least Squares Programming (SLSQP)
result = minimize(entropy, logp0, constraints=constraints, method="SLSQP")

# Extract the optimized probability distribution
optimal_p = np.exp(result.x)
optimal_p /= np.sum(optimal_p)  # Ensure normalization

# Plot the resulting probability distribution
plt.figure(figsize=(8, 5))
plt.bar(x, optimal_p, color='royalblue', alpha=0.7, edgecolor='black')
plt.xlabel("Die Outcome", fontsize=14)
plt.ylabel("Probability", fontsize=14)
plt.title("Optimized MaxEnt Probability Distribution for a Biased Die", fontsize=16)
plt.xticks(x)
plt.grid(axis="y", linestyle="--", alpha=0.6)

# Display the plot
plt.show()

# Print the optimized probabilities
optimal_p


:::{admonition} **Physical constriants on energy, particle number, volume**
:class: tip, dropdown

**1. Microcanonical Ensemble (Fixed Energy, Volume, and Particle Number)**

- For an isolated gas with a fixed energy $ E $, volume $ V $, and particle number $ N $, we maximize entropy subject to the constraint that only microstates with energy $ E $ are accessible:

$$ J = -\sum_k p_k \log p_k - \lambda \left( \sum_k p_k - 1 \right). $$

- Solving for $ p_k $, we obtain:

$$ p_k = \frac{1}{\Omega}, $$

- where $ \Omega $ is the number of microstates. This is the basis of classical thermodynamics, where entropy is defined as $ S = k_B \log \Omega $.

**2. Canonical Ensemble (Fixed Temperature, Volume, and Particle Number)**

- If the system is in thermal contact with a heat bath at temperature $ T $, energy is allowed to fluctuate. The constraint now involves the mean energy $ \langle E \rangle = U $:

$$ J = -\sum_k p_k \log p_k - \lambda \left( \sum_k p_k - 1 \right) - \beta \left( \sum_k p_k E_k - U \right). $$

- Solving, we obtain the Boltzmann distribution:

$$ p_k = \frac{e^{-\beta E_k}}{Z}, $$

- where $ \beta = 1 / k_B T $ and $ Z = \sum_k e^{-\beta E_k} $ is the partition function. This distribution governs systems in thermal equilibrium.
:::

### Relative Entropy

**Entropy change due to Diffusion**  

- In one-dimensional diffusion (Brownian motion), if a particle starts at $x_0 = 0$ and diffuses freely, its probability distribution after time $t$ follows a Gaussian:  

$$
p(x, t) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{x^2}{4Dt}}
$$


- The **Shannon entropy** of this distribution is given by:

$$
S(t) = -\int p(x, t) \log p(x, t) \, dx.
$$

- Evaluating the integral yields nice compact formulla showing that entropy grows with time as molecule diffuse and sparead all over the container. 

$$
S(t) = \frac{1}{2} \log (4\pi e Dt).
$$

- But if we tried to evaluate integral we would run into serious problem!

 **Problem: Grid Dependence of Shannon Entropy**  

- A major issue with Shannon entropy is that it **depends on the choice of units**. If we refine the grid by choosing a smaller $\Delta x$, the computed entropy does not converge to a well-defined value—it diverges! This makes it unsuitable for studying entropy change in diffusion.  

- To avoid this issue, one can instead use **relative entropy** (Kullback-Leibler divergence), which remains well-defined and independent of discretization.


:::{admonition} **Relative Entropy**
:class: important

$$
D_{\text{KL}}(P || Q) = \sum_x P(x) \ln \frac{P(x)}{Q(x)}
$$

$$
D_{\text{KL}}(P || Q) = \int P(x) \ln \frac{P(x)}{Q(x)} \, dx.
$$

- $Q$ **reference probability** distribution
- $P$ **true probability** distribution or the one we are using/observing. 

:::

- The Kullback-Leibler (KL) divergence, or relative entropy, measures **how much information is lost when using a reference distribution Q** 
- **KL is non-negative and equals zero if and only if $P = Q $ everywhere.** 
- KL divergence is widely used in statistical mechanics, information theory, machine learning and thermodynamics as a measure of information loss when approximating one distribution with another.

### Assymetry of relative entropy is important

#### Assymetry of KL and irreversibility 

- Instead of using entropy alone, let’s compare **KL divergence** between two diffusion processes at times $ t_1 $ and $ t_2 $, where their variances are $ \sigma_1^2 = 2D t_1 $ and $ \sigma_2^2 = 2D t_2 $.

$$
D(p_1 \| p_2) = \frac{1}{2} \left( \frac{t_1}{t_2} - 1 + \log \frac{t_2}{t_1} \right)
$$


- Note the assyetry of relative entropy: $D(p_1 \| p_2) \neq D(p_2 \| p_1)$

- For a **diffusion process**, if we compare the forward evolution of a Gaussian spreading over time with the reversed process (contracting into a peak), we see that:

$$
D_{\text{KL}}(P_{\text{forward}} || P_{\text{backward}}) > 0.
$$

- This indicates that diffusion is an **irreversible** process in the absence of external driving forces (since it tends to increase entropy). In contrast, a time-reversed diffusion process (all particles contracting back into the initial state) would violate the second law of thermodynamics.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Define parameters
D = 1.0  # Diffusion coefficient
t1 = 1.0  # Time for first Gaussian
t2 = 2.0  # Time for second Gaussian
sigma1 = np.sqrt(2 * D * t1)  # Standard deviation at t1
sigma2 = np.sqrt(2 * D * t2)  # Standard deviation at t2

# Define grid for discretization
x_min, x_max = -5, 5  # Range of x values
num_bins_list = [10, 50, 100, 500]  # Different resolutions

# Compute Shannon entropy for discretized Gaussian
shannon_entropies = []

for num_bins in num_bins_list:
    x_bins = np.linspace(x_min, x_max, num_bins + 1)  # Bin edges
    x_centers = (x_bins[:-1] + x_bins[1:]) / 2  # Bin centers
    dx = x_bins[1] - x_bins[0]  # Bin width
    
    # Compute probability mass function for discretized Gaussian
    p_i = stats.norm.pdf(x_centers, 0, sigma1) * dx
    p_i = p_i / np.sum(p_i)  # Normalize

    # Compute Shannon entropy
    S_shannon = -np.sum(p_i * np.log(p_i + 1e-10))  # Avoid log(0)
    shannon_entropies.append(S_shannon)

# Compute differential entropy for continuous Gaussian
S_diff = 0.5 * np.log(4 * np.pi * np.e * D * t1)

# Compute KL divergence between Gaussians at t1 and t2
KL_div = 0.5 * ((t1 / t2) - 1 + np.log(t2 / t1))

# Plot Shannon entropy vs. grid resolution
plt.figure(figsize=(6,4))
plt.plot(num_bins_list, shannon_entropies, 'o-', label='Shannon Entropy')
plt.axhline(y=S_diff, color='r', linestyle='--', label='Differential Entropy')
plt.xlabel('Number of Bins')
plt.ylabel('Entropy')
plt.title('Shannon Entropy vs. Grid Resolution')
plt.legend()
plt.xscale('log')
plt.grid()
plt.show()


#### Assymetry of KL in Machine learning

1. If $ Q(x) $ **assigns very low probability** to a region where $ P(x) $ is high, the term $ \log \frac{P(x)}{Q(x)} $ becomes large, **strongly penalizing $ Q $ for underestimating $ P $**.  
2. If $ Q(x) $ **is broader than $ P(x) $, assigning extra probability mass to unlikely regions**, this does not significantly affect $ D_{\text{KL}}(P || Q) $, because $ P(x) $ is small in those regions.  

- This asymmetry explains why **KL divergence is not a true distance metric**. It penalizes **underestimation** of true probability mass much more than **overestimation**, making it particularly useful in **machine learning** where models are trained to avoid assigning near-zero probabilities to observed data.

In [None]:
# Define two Gaussian distributions with different means and variances
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

mu1, sigma1 = 0, 1  # Mean and standard deviation for P
mu2, sigma2 = 1, 2  # Mean and standard deviation for Q

x_values = np.linspace(-5, 5, 1000)  # Define spatial grid
P = norm.pdf(x_values, loc=mu1, scale=sigma1)  # First Gaussian
Q = norm.pdf(x_values, loc=mu2, scale=sigma2)  # Second Gaussian

# Compute KL divergences
D_KL_PQ = np.trapz(P * np.log(P / Q), x_values)  # D_KL(P || Q)
D_KL_QP = np.trapz(Q * np.log(Q / P), x_values)  # D_KL(Q || P)

# Plot the distributions
plt.figure(figsize=(8, 6))
plt.plot(x_values, P, label=r'$P(x) \sim \mathcal{N}(0,1)$', linewidth=2)
plt.plot(x_values, Q, label=r'$Q(x) \sim \mathcal{N}(1,2)$', linewidth=2, linestyle='dashed')
plt.fill_between(x_values, P, Q, color='gray', alpha=0.3, label=r'Difference between $P$ and $Q$')

# Annotate KL divergences
plt.text(-4, 0.15, rf'$D_{{KL}}(P || Q) = {D_KL_PQ:.3f}$', fontsize=12, color='blue')
plt.text(-4, 0.12, rf'$D_{{KL}}(Q || P) = {D_KL_QP:.3f}$', fontsize=12, color='red')

# Labels and legend
plt.xlabel('$x$', fontsize=14)
plt.ylabel('Probability Density', fontsize=14)
plt.title('Illustration of KL Asymmetry Between Two Gaussians', fontsize=14)
plt.legend(fontsize=12)
plt.grid(True)

plt.show()


### Problems

- Compute entropy of gaussian distribution. Plot entropy as a function of variance

- Using MaxEnt approach find probability distribution with mean and variance equal to $\mu$ and $\sigma^2$ respectively. 

- Simulate 1D random walk and compute entropy by first computing probability distribution $p_N(n)$