# Information Theory  

## What is Information?  
In information theory, information is closely tied to the concept of *surprise*.  

- If an event is **unlikely**, learning that it happened carries **more information** because it is more surprising.  
- On the other hand, if an event is **certain** (i.e., it was guaranteed to happen), it carries **zero information** because it is not surprising at all.  

Mathematically, we measure the information content of an event \( x \) using the formula:  
$$
h(x) = -\log p(x),
$$
where:  
- $p(x) $ is the probability of the event $ x $  
- $ h(x) $ represents the amount of information (or surprise) associated with $ x $ 

---

## Connection to Entropy  

The entropy of a discrete probability distribution represents the *expected information content*.  
It measures the **average amount of surprise** we can expect when observing outcomes from that distribution.  

Mathematically, entropy is defined as:  
$$
H(X) = \mathbb{E}[h(x)] = - \sum_{x} p(x) \log p(x),
$$
where:  
- $ H(X)$ is the entropy,  
- $ p(x) $ is the probability of event x,  
- The summation is taken over all possible outcomes x.  

---

## Entropy and Probability Distribution  

Entropy depends on how the probability is distributed across different outcomes:  

- If the probability distribution is **sharply peaked** (i.e., most of the probability is concentrated around a few values), the entropy is **low**.  
- If the probability distribution is more **spread out** (i.e., all outcomes are roughly equally likely), the entropy is **high**.  

---


![image5](image5.png)
---
## Maximum Entropy  
The maximum possible entropy occurs for a **uniform distribution**, where all outcomes are equally likely.  
This is because there is the **most uncertainty** in predicting the outcome when each event has the same probability.  


# Finding the Maximum Entropy Configuration

The maximum entropy configuration can be found by maximizing the entropy \( H \), subject to the normalization constraint on the probabilities \( p(x_i) \). Let’s solve this step by step.

## Step 1: Define the Entropy Function
The entropy  H  is defined as:
$
H = - \sum_{i} p(x_i) \ln p(x_i),
$
where \( p(x_i) \) represents the probability of the \( i \)-th event.

## Step 2: Define the Constraint
We have the following constraint:
$
\sum_{i} p(x_i) = 1.
$
This means the total probability across all possible outcomes must equal 1.

## Step 3: Set Up the Lagrange Function
To maximize the entropy H  under the constraint $ \sum_{i} p(x_i) = 1 $, we use the method of **Lagrange multipliers**.  
We define the Lagrange function \( \mathcal{L} \) as:
$$
\mathcal{L}(p(x_i), \lambda) = - \sum_{i} p(x_i) \ln p(x_i) + \lambda \left( \sum_{i} p(x_i) - 1 \right),
$$
where $ \lambda $ is the Lagrange multiplier that enforces the constraint $ \sum_{i} p(x_i) = 1 $.

## Step 4: Differentiate the Lagrange Function
Now, we want to maximize $ \mathcal{L} $. To do this, we take the partial derivative of $ \mathcal{L} $ with respect to each $ p(x_i) $ and set it equal to 0.

### Differentiate:
$
\frac{\partial \mathcal{L}}{\partial p(x_i)} = - \left( \ln p(x_i) + 1 \right) + \lambda.
$

Set this equal to 0:

$
\left( \ln p(x_i) + 1 \right) + \lambda = 0.
$

This simplifies to:

$
\ln p(x_i) = \lambda - 1.
$

## Step 5: Solve for $ p(x_i) $
To solve for \( p(x_i) \), take the exponential of both sides:
$$
p(x_i) = e^{\lambda - 1}.
$$
Since this must hold for all \( i \), we have:
$$
p(x_1) = p(x_2) = \dots = p(x_n) = e^{\lambda - 1}.
$$
In other words, all the probabilities are equal.

## Step 6: Use the Normalization Condition
From the normalization condition $ \sum_{i} p(x_i) = 1 $, and since there are  n  possible outcomes:
$$
n \cdot p(x_i) = 1.
$$

Thus:

$$
p(x_i) = \frac{1}{n}.
$$

## Step 7: Conclusion - Maximum Entropy Configuration
The maximum entropy configuration occurs when all probabilities are equal, i.e., when:
$$
p(x_i) = \frac{1}{n}.
$$
This corresponds to a **uniform distribution**.

## Final Interpretation:
When all outcomes are equally likely, we have the greatest uncertainty (or the highest entropy). This is why the uniform distribution maximizes entropy under the given constraint.


# Differential Entropy and Maximum Entropy Configuration

### Differential Entropy
For continuous probability distributions, the **differential entropy** is given by:

$$
H[x] = - \int_{-\infty}^{\infty} p(x) \ln p(x) \, dx. \tag{1.104}
$$

### Maximum Entropy for Continuous Variables
In the case of discrete distributions, we saw that the maximum entropy configuration corresponds to an equal distribution of probabilities across the possible states of the variable.

Now, let us consider the maximum entropy configuration for a continuous variable.  
To ensure that the maximum is well-defined, it is necessary to constrain the **first and second moments** of \( p(x) \), while also preserving the **normalization constraint**.

Thus, we maximize the differential entropy subject to the following constraints:

1. **Normalization Constraint:**
   $$
   \int_{-\infty}^{\infty} p(x) \, dx = 1 \tag{1.105}
   $$

2. **Constraint on the Mean:**
   $$
   \int_{-\infty}^{\infty} x p(x) \, dx = \mu \tag{1.106}
   $$

3. **Constraint on the Variance:**
   $$
   \int_{-\infty}^{\infty} (x - \mu)^2 p(x) \, dx = \sigma^2 \tag{1.107}
   $$

### Constrained Maximization with Lagrange Multipliers
To perform the constrained maximization, we introduce **Lagrange multipliers** and maximize the following functional with respect to \( p(x) \):

$$
\mathcal{L} = - \int_{-\infty}^{\infty} p(x) \ln p(x) \, dx 
+ \lambda_1 \left( \int_{-\infty}^{\infty} p(x) \, dx - 1 \right)
+ \lambda_2 \left( \int_{-\infty}^{\infty} x p(x) \, dx - \mu \right)
+ \lambda_3 \left( \int_{-\infty}^{\infty} (x - \mu)^2 p(x) \, dx - \sigma^2 \right).
$$




## Step 1: Functional Derivative

To maximize \( \mathcal{L} \), we take the functional derivative $ \frac{\delta \mathcal{L}}{\delta p(x)} $ and set it equal to 0. We will now differentiate each term:

### 1. First Term:  
$$
\frac{\delta}{\delta p(x)} \left( - \int_{-\infty}^{\infty} p(x) \ln p(x) dx \right)
= -( \ln p(x) + 1 )
$$

### 2. Second Term (Normalization constraint):  
$$
\frac{\delta}{\delta p(x)} \left( \lambda_1 \left( \int_{-\infty}^{\infty} p(x) dx - 1 \right) \right)
= \lambda_1
$$

### 3. Third Term (Mean constraint):  
$$
\frac{\delta}{\delta p(x)} \left( \lambda_2 \int_{-\infty}^{\infty} x p(x) dx \right)
= \lambda_2 x
$$

### 4. Fourth Term (Variance constraint):  
Expand \( (x - \mu)^2 \) inside the integral:
$$
\frac{\delta}{\delta p(x)} \left( \lambda_3 \int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx \right)
= \lambda_3 (x - \mu)^2
$$

---

## Step 2: Setting up the Functional Derivative

Summing all the partial derivatives, we get:

$$
\frac{\delta \mathcal{L}}{\delta p(x)} = - \left( \ln p(x) + 1 \right) + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2
$$

Set this equal to 0:

$$
\left( \ln p(x) + 1 \right) + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = 0
$$

---

## Step 3: Solving for \( p(x) \)

Rearrange the equation:

$$
\ln p(x) = - 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2
$$

Simplify:

$$
\ln p(x) = C + \lambda_2 x + \lambda_3 (x - \mu)^2
$$
where $( C = -1 + \lambda_1 )$.

---

## Step 4: Exponentiate Both Sides

To isolate p(x) , exponentiate both sides:

$$
p(x) = \exp \left( C + \lambda_2 x + \lambda_3 (x - \mu)^2 \right)
$$

Since $ \exp(A + B) = \exp(A) \cdot \exp(B) $, we can write:

$$
p(x) = A \cdot \exp \left( \lambda_2 x + \lambda_3 (x - \mu)^2 \right)
$$
where $ A = e^C $.

---

## Step 5: Identifying the Solution as a Gaussian Distribution

Notice that the expression inside the exponential has a quadratic form $ (x - \mu)^2 $, which is characteristic of a **Gaussian distribution**.

By choosing appropriate values for $ \lambda_2  and \lambda_3 $ , we get the standard form of the Gaussian:
$$
p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{(x - \mu)^2}{2 \sigma^2} \right)
$$

---

## Final Answer:
The maximum entropy distribution, subject to the constraints on mean and variance, is a **Gaussian distribution**:

$$
p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{(x - \mu)^2}{2 \sigma^2} \right)
$$


The entropy of a probability distribution measures the uncertainty or randomness associated with that distribution. A higher entropy means greater uncertainty in predicting the outcome.

By maximizing entropy, you are essentially selecting the least biased distribution that satisfies the constraints you impose