# [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory))

## Definition: Entropy (Average Uncertainty)

Entropy measures how uncertain we are about the outcome of a random variable before we observe it. 

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and $X:\Omega \to \mathcal{X} \subset \mathbb{R}^n$ a discrete random vector with alphabet $\mathcal{X}$. Let $\mathbb{P}_X$ be the probability measure (law) induced by $X$.

Define the **(joint) entropy of $X$** as the expectation of its information content:

$$
\begin{aligned}
H_{\mathbb{P}}(X)
&:= \mathbb{E}\!\left[ I_{\mathbb{P}_X}(X) \right] \\
&= \sum_{x \in \mathcal{X}} \mathbb{P}_X(x)\, I_{\mathbb{P}_X}(x) \\
&= \sum_{x \in \mathcal{X}} \mathbb{P}_X(x)\,(-\log \mathbb{P}_X(x)) \\
&= -\sum_{x \in \mathcal{X}} \mathbb{P}_X(x)\log \mathbb{P}_X(x).
\end{aligned}
$$

**Note:**  
* When no ambiguity arises, the dependence on the probability measure is omitted, and we write  
  $$ H(X) \quad \text{instead of} \quad H_{\mathbb{P}}(X). $$
* When no ambiguity arises, we also write $p$ instead of $\mathbb{P}_X$, so that
  $$
  H(X) = -\sum_{x \in \mathcal{X}} p(x)\log p(x).
  $$

**Interpretation:**
* High entropy means the outcome of the random variable is (on average) highly unpredictable.
* Low entropy means the outcome of the random variable is (on average) highly predictable.

### Properties

1. $H(X) \ge 0$
2. Let $X$ be a discrete random variable with alphabet of length $n$. Then
$
H(X) \le \log n
$
with equality if and only if $X$ is uniform.

3. $X$ is constant $\Longleftrightarrow$ $H(X)=0$

4. The entropy is concave as a function of the induced probability $\mathbb{P}_X$.

    More precisely, let $\mathcal{M}_1(\mathcal{X})$ denote the set of all probability
    measures on the measurable space $(\mathcal{X}, \mathcal{P}(\mathcal{X}))$.
    Define
    $$
    \begin{aligned}
    H : \mathcal{M}_1(\mathcal{X}) &\longrightarrow [0,\infty) \\
    H(\mu) &:= -\sum_{x \in \mathcal{X}} \mu(x)\,\log \mu(x),
    \end{aligned}
    $$
    with the convention $0 \log 0 := 0$.

    Then $H$ is a strictly concave functional on $\mathcal{M}_1(\mathcal{X})$,    
    i.e., for all $\mu_1,\mu_2 \in \mathcal{M}_1(\mathcal{X})$ and all $\lambda \in (0,1)$,
    $$
    H\big( \lambda \mu_1 + (1-\lambda)\mu_2 \big)
    \;>\;
    \lambda H(\mu_1) + (1-\lambda)H(\mu_2).
    $$


    Notice that for a discrete random variable $X$ with law $\mathbb{P}_X$,
    $$
    H_{\mathbb{P}}(X)
    = -\sum_{x \in \mathcal{X}} \mathbb{P}_X(x)\log \mathbb{P}_X(x)
    = H(\mathbb{P}_X).
    $$

    $H(\mathbb{P}_X)$ is called the entropy of the distribution $\mathbb{P}_X$, the nature of the argument (a probability measure) avoids ambiguity with the entropy of random vectors.

5. $H$ is permutation inavaraiant. Let $X=(X_1,X_2,\dots,X_n)$ be a random vector and let $\pi_n$ a permutation of the varaiables in $X$, denote 
    the corresponding permuted vector as $\pi_n(X)=(X_{\pi(1)},X_{\pi(2)},\dots,X_{\pi(n)})$. Then we have 
    $$H(\pi_n(X))=H(X).$$

#### **Proof (1.):**

It is straight forward from the positivity of $I(x)$

#### **Proof (2.):**

Let $\varphi(t)=-\log t$, which is convex on $(0,\infty)$, and take a convex combination
$\sum_{i=1}^n \alpha_i v_i$ with $v_i,\alpha_i>0$ and $\sum_{i=1}^n \alpha_i=1$.

By [Jensen’s inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality) ,
$$
\varphi\left(\sum_{i=1}^n \alpha_i v_i\right)
\le
\sum_{i=1}^n \alpha_i \varphi\left(v_i\right).
$$

If we denote the alphabet of $X$ by $\mathcal{X}=\{x_1,x_2,\dots,x_n\}$ and use the previous inequality for
$\alpha_i=p(x_i)$ and $v_i=\frac{1}{p(x_i)}$, for $i=1,2,\dots,n$, we obtain

$$
\begin{aligned}
-\log\left(\sum_{i=1}^n p(x_i)\frac{1}{p(x_i)}\right)
&\le
-\sum_{i=1}^n p(x_i) \log\left(\frac{1}{p(x_i)}\right),\\
-\log(n)
&\le
\sum_{i=1}^n p(x_i)\log(p(x_i)),\\
-\sum_{i=1}^n p(x_i)\log(p(x_i))
&\le
\log(n),\\
H(X)
&\le
\log(n).
\end{aligned}
$$

Equality holds in [Jensen’s inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality)  if and only if
$$
v_1 = v_2 = \cdots = v_n
\quad\text{for all $i$ with } \alpha_i > 0,
$$
that is,
$$
\frac{1}{p(x_1)} = \frac{1}{p(x_2)} = \cdots = \frac{1}{p(x_n)}.
$$

Hence all positive probabilities are equal:
$$
p(x_1) = p(x_2) = \cdots = p(x_n) = \frac{1}{n},
$$
so $X$ is uniform on its (nonzero) alphabet.

Therefore,
$$
H(X) \le \log n,
$$
with equality if and only if $X$ is uniformly distributed on $\{x_1,\dots,x_n\}$.

#### **Proof (3.):**

It $X=c$ cte, then 
$$
H(X) = \sum_{x \in \mathcal{X}} p(x) (-\log p(x)) = 1(-\log 1) = 0.
$$
On the other hand, if $H(X)=0$ we get
$$
\sum_{x \in \mathcal{X}} p(x) (-\log p(x)) = 0.
$$
the for every $x$ we get $p(x)\in \{0,1\}$, otherwise one of the terms in the series is positve (and all are no negative). Tnen $X$ is constant.


#### **Proof (4.):**

Let's start by prohing that the function $h(t)= -t\log(t)$ is strictly concave in its domain $t>0$. Derivating twice we get
$$
\begin{align*}
h(t) &= -t\log(t)\\
h'(t)&= -\log(t)-1\\
h''(t)&= -\frac{1}{t} < 0, \quad  t>0.
\end{align*}
$$
Then for any $\lambda\in (0,1)$, and $s,t>0$ we have the inequality
$$h(\lambda t + (1-\lambda) s)> \lambda h(t) + (1-\lambda) h(s).$$

Using the concavity of $h$ we can prove the concavity of $H$. Let $,s,t\in (0,1)$ and $\mu_1,\mu_2 \in \mathcal{P}(\mathcal{X})$, then we have
$$
\begin{align*}
H(\lambda \mu_1+(1-\lambda) \mu_2)&=\sum_{x\in\mathcal{X}}h(\lambda \mu_1(x)+(1-\lambda) \mu_2(x)),\\
&> \sum_{x\in\mathcal{X}}\lambda h(\mu_1(x))+(1-\lambda) h(\mu_2(x)),\\
&=\lambda \sum_{x\in\mathcal{X}}h(\mu_1(x))+(1-\lambda)  \sum_{x\in\mathcal{X}}h(\mu_2(x)),\\
&=\lambda H(\mu_1)+(1-\lambda) H(\mu_2).
\end{align*}
$$
So $H$ is strictly concave.

#### **Proof (5.):**

Notice that the alphabet of $\pi_n(X)$ is $\pi_n(\mathcal{X})$, where $\mathcal{X}$ is the alphabet of $X$. Then we have
$$
\begin{align*}
H(\pi_n(X))&=- \sum_{y\in \pi_n(\mathcal{X})} \mathbb{P}_{\pi_n(X)}(y)\log\left( \mathbb{P}_{\pi_n(X)}(y)\right),\\
&=- \sum_{y\in \pi_n(\mathcal{X})} \mathbb{P}(\pi_n(X)=y)\log\left(\mathbb{P}(\pi_n(X)=y)\right),\\
&=- \sum_{y\in \pi_n(\mathcal{X})} \mathbb{P}\left(\cap_{i=1}^nX^{-1}_{\pi_n(i)}(y_i)\right)\log\left(\mathbb{P}\left(\cap_{i=1}^nX^{-1}_{\pi_n(i)}(y_i)\right)\right),\\
&=- \sum_{y\in \pi_n(\mathcal{X})} \mathbb{P}\left(\cap_{i=1}^nX^{-1}_{i}(y_{\pi_n^{-1}(i)})\right)\log\left(\mathbb{P}\left(\cap_{i=1}^nX^{-1}_{i}(y_{\pi_n^{-1}(i)})\right)\right),\\
&=- \sum_{y\in \pi_n(\mathcal{X})} \mathbb{P}(X=\pi^{-1}_n(y))\log\left(\mathbb{P}(X=\pi^{-1}_n(y))\right),\\
&=- \sum_{x\in \mathcal{X}} \mathbb{P}(X=x)\log\left(\mathbb{P}(X=x)\right),\\
&=- \sum_{x\in \mathcal{X}} \mathbb{P}_{X}(x)\log\left( \mathbb{P}_{X}(x)\right),\\
&= H(X)
\end{align*}
$$

## Definition: Conditional Entropy (Average Remaining Uncertainty)

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and let 
$X:\Omega\to\mathcal{X}\subset\mathbb{R}^n$ and $Y:\Omega\to\mathcal{Y}\subset\mathbb{R}^m$ be discrete random vectors
with joint probability mass function
$$
p(x,y) := \mathbb{P}(X=x,\,Y=y),
$$
and conditional probability
$$
p(y \mid x) := \mathbb{P}(Y=y \mid X=x),
\qquad \text{for } p(x)>0,
$$
where
$$
p(x) := \mathbb{P}(X=x) = \sum_{y \in \mathcal{Y}} p(x,y).
$$

The **conditional entropy of $Y$ given $X$** is defined as the average of the
entropy of $Y$ conditioned on each value of $X$:
$$
\begin{aligned}
H_{\mathbb{P}}(Y \mid X)
&:= \sum_{x \in \mathcal{X}} \mathbb{P}(X=x)\, H_{\mathbb{P}}(Y \mid X=x) \\[4pt]
&= \sum_{x \in \mathcal{X}} p(x)
\left(
-\sum_{y \in \mathcal{Y}} p(y \mid x)\log p(y \mid x)
\right) \\[4pt]
&= - \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}}
p(x,y)\, \log p(y \mid x).
\end{aligned}
$$

**Note:**
* When no ambiguity arises, the dependence on the probability measure is omitted and we simply write $H(Y \mid X)$
  instead of $H_{\mathbb{P}}(Y \mid X)$.
* When no ambiguity arises, we also write $p(x,y)$, $p(x)$ and $p(y \mid x)$ instead of
  $\mathbb{P}(X=x,Y=y)$, $\mathbb{P}(X=x)$ and $\mathbb{P}(Y=y \mid X=x)$ respectively.

**Interpretation:**
* $H_{\mathbb{P}}(Y \mid X)$ is the average remaining uncertainty in $Y$ after observing $X$.
* $H_{\mathbb{P}}(Y \mid X) = 0$ if and only if $Y$ is completely determined by $X$ (i.e. $Y = f(X)$ almost surely).
* $H_{\mathbb{P}}(Y \mid X) = H_{\mathbb{P}}(Y)$ if and only if $X$ and $Y$ are independent.


### Property: Conditional entropy as subspace entropy

Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space and let 
$X:\Omega\to\mathcal{X}\subset\mathbb{R}^n$ and
$Y:\Omega\to\mathcal{Y}\subset\mathbb{R}^m$ be discrete random vectors.
Fix $x \in \mathcal{X}$ with $p(x)>0$ and define
$$
A := \{\omega \in \Omega : X(\omega)=x\}.
$$
Then
$$
H_{\mathbb{P}}(Y \mid X = x)
= H_{\mathbb{P}(\cdot \mid A)}(Y).
$$

That is, the conditional entropy of $Y$ given $X=x$ is the entropy of $Y$
with respect to the conditional probability space
$$
\big( A,\ \mathcal{F}\!\mid_A,\ \mathbb{P}(\,\cdot \mid A) \big).
$$

#### **Proof:**

By definition,
$$
\begin{aligned}
H_{\mathbb{P}}(Y \mid X = x)
&:= -\sum_{y \in \mathcal{Y}}
\mathbb{P}(Y=y \mid X=x)\,\log \mathbb{P}(Y=y \mid X=x).
\end{aligned}
$$

But for every $y \in \mathcal{Y}$,
$$
\mathbb{P}(Y=y \mid X=x)
= \mathbb{P}(Y=y \mid A)
= \mathbb{P}(\cdot \mid A)\big(Y^{-1}(\{y\})\big).
$$

Therefore,
$$
\begin{aligned}
H_{\mathbb{P}}(Y \mid X = x)
&= -\sum_{y \in \mathcal{Y}}
\mathbb{P}(\cdot \mid A)\big(Y=y\big)\,
\log \mathbb{P}(\cdot \mid A)\big(Y=y\big) \\
&= H_{\mathbb{P}(\cdot \mid A)}(Y),
\end{aligned}
$$
which is exactly the entropy of $Y$ in the probability space
$\big( A,\ \mathcal{F}\!\mid_A,\ \mathbb{P}(\,\cdot \mid A) \big)$.

### Property: Positivity


For all $x \in \mathcal{X}$ with $p(x)>0$,
$$
H_{\mathbb{P}}(Y \mid X = x) \ge 0,
\qquad
H_{\mathbb{P}}(Y \mid X) \ge 0.
$$

**Proof.**
Using the ositivity of the entropy and the previous result, we have
$$ 
\begin{aligned}
H_{\mathbb{P}}(Y \mid X = x)&= H_{\mathbb{P}(\cdot \mid X=x)}(Y)\geq 0, \quad x \in \mathcal{X},\\
H_{\mathbb{P}}(Y \mid X)&:= \sum_{x \in \mathcal{X}} \mathbb{P}(X=x)\, H_{\mathbb{P}}(Y \mid X=x)\geq 0.
\end{aligned}
$$

#### **Proof:**

Notice that $H(Y|X=x)$ is the entropy of the random vector $Y|X=x$ in the space $(\{X=x\},\mathcal{F}\mid\{X=x\},\mathbb{P}\mid \{X=x\})$
$$
H_{\mathbb{P}}(Y \mid X=x) = -\sum_{y \in \mathcal{Y}} p(y \mid x)\log p(y \mid x)=H_{\mathbb{P}\mid \{X=x\}}(Y)
$$

so it is non-negative. $H(Y|X) \ge 0$ is a direct consequence of $H(Y|X=x) \ge 0$.

### Property: Functional dependency
Let $X,Y$ be discrete random vectors with finite alphabets. Then
$$
H(Y \mid X)=0
\quad\Longleftrightarrow\quad
\exists \, f \text{ such that } Y = f(X) \text{ almost surely.}
$$

#### **Proof:** 

Since
$$
H(Y\mid X=x) \ge 0 \quad \text{for all } x.
$$

If $H(Y\mid X)=0$, then
$$
0 = \sum_{x}p(x)\,H(Y\mid X=x)
$$
is a convex combination of nonnegative numbers. Therefore every term with positive weight must be zero:
$$
p(x)>0 \;\Longrightarrow\; H_{\mathbb{P}(\cdot\mid X=x)}(Y)= H(Y\mid X=x)=0.
$$

Then from property 3. of entropy, for every $x$ such that $p(x)>0$, the random variable $Y$ is constant over the set $\{X=x\}$ ($Y|X=x$ is constant), so there exist $f(x)$ such that 
$$
\mathbb{P}(Y = f(x) \mid X=x) = 1.
$$

Then
$$
\mathbb{P}\big(Y = f(X)\big)
= \sum_{x} \mathbb{P}\big(X=x,\,Y=f(x)\big)
= \sum_{x} \mathbb{P}(X=x)\,\mathbb{P}\big(Y=f(x)\mid X=x\big)
= \sum_{x} \mathbb{P}(X=x)\cdot 1
= 1.
$$

So $Y = f(X)$ almost surely.

Conversely, suppose $Y=f(X)$ a.s. Then for any $x$ with $\mathbb{P}(X=x)>0$,
$$
\mathbb{P}(Y=f(x)\mid X=x) = 1,
$$
so the $Y$ is constant over $\{X=x\}$ (conditional distribution of $Y$ given $X=x$ is a point mass), hence
$$
H_{\mathbb{P}(\cdot\mid X=x)}(Y) = H(Y\mid X=x)=0 \quad \text{for all } x \text{ with } p(x)>0.
$$
Therefore
$$
H(Y\mid X) = \sum_{x} \mathbb{P}(X=x)\,H(Y\mid X=x) = 0.
$$


### Property: Conditioning cannot increase entropy
For discrete random vectors $X$ and $Y$ with finite alphabets,
$$
H(Y | X) \le H(Y),
$$
with equality if and only if $Y$ is independent of $X$.

**Proof:**

Notice that $H(Y)=H(\mathbb{P}_Y)$ and $H(Y|X=x)=H(\mathbb{P}_{Y|X=x})$ so using the concavity of $H$ we obtain

$$
\begin{align*}
H_{\mathbb{P}}(Y)&=H(\mathbb{P}_Y)=H\left(\sum_{x}\mathbb{P}_X(x)\mathbb{P}_{Y|X=x}\right)\ge \sum_{x}\mathbb{P}_X(x)H\left(\mathbb{P}_{Y|X=x}\right)= \sum_{x}\mathbb{P}_X(x)H(Y|X=x)=H(Y|X).
\end{align*}
$$
Since the concavity is strict, the equality only happens when $\mathbb{P}_{Y|X=x*}=\mathbb{P}_{Y|X=x}$ for certain $x^*$ in $\mathcal{X}$ and any $x\in\mathcal{X}$. Then
$$\mathbb{P}(Y=y)=\sum_{x}\mathbb{P}_X(x)\mathbb{P}_{Y|X=x}=\mathbb{P}_{Y|X=x*}\sum_{x}\mathbb{P}_X(x)=\mathbb{P}_{Y|X=x*}=\mathbb{P}_{Y|X=x}= \mathbb{P}(Y=y|X=x),\quad x\in\mathcal{X},$$
and $Y$ and $X$ are independant.

### Property: Alternaty Form:

$$
H(Y|X) = H(X,Y) - H(X)
$$

**Proof:**

$$
\begin{aligned}
H(Y| X) 
&= \sum_{x} p(x)\left( - \sum_{y} p(y| x)\log p(y| x)\right)\\
&= -\sum_{x,y} p(y| x)p(x) \log p(y| x)\\
&= -\sum_{x,y} p(x,y)\log p(y| x)\\
&= -\sum_{x,y} p(x,y)\log \left(\frac{p(x,y)}{p(x)}\right)\\
&= -\sum_{x,y} p(x,y)\left(\log p(x,y) - \log p(x)\right)\\
&= -\sum_{x,y} p(x,y)\log p(x,y) +\sum_{x,y} p(x,y)\log p(x)\\
&= -\sum_{x,y} p(x,y)\log p(x,y) +\sum_{x} p(x)\log p(x)\\
&= H(X,Y) - H(X).
\end{aligned}
$$

### Property: Basic inequality:
For two discrete random vectors $X$ and $Y$,
$$
H(X,Y) \le H(X) + H(Y)
$$
with equality if and only if $X$ and $Y$ are independent.

**Proof:** 

From the previous identity we have
$$
H(X,Y) = H(X) + H(Y | X).
$$

Thus,
$$
H(X,Y) \le H(X) + H(Y)
\quad \Longleftrightarrow \quad
H(Y | X) \le H(Y).
$$

Since conditioning cannot increase entropy, and the equality is hold if and only if $X$ and $Y$ are independant, we get the result.

## Definition: Mutual Information
Mutual information measures how much two random vectors depend on each other. 

$$
I(Y;X) = H(Y) - H(Y|X)
$$
It quantifies the amount of uncertainty in one random vector that is removed by knowing the other random vector. 

### Properties: 
1. $I(Y;X)\geq 0$


2. $X$ and $Y$ are independent if and only if
    $$
    I(Y;X) = 0.
    $$
    Two random vectors are independent if the knowing of one random vector tells us nothing about the other.
3. Let $X,Y$ be discrete random vectors with finite alphabets. Then
$$
I(Y;X) = H(Y)
\quad\Longleftrightarrow\quad
\exists \, f \text{ such that } Y = f(X) \text{ almost surely.}
$$
4. Equivalent form and symetry
$$I(Y;X) = H(X) + H(Y) - H(X,Y)=I(X;Y)$$
5. KL-divergence form  
$$I(Y;X) = \sum_{x,y} p(x,y)\log \frac{p(x,y)}{p(x)p(y)}$$

#### **Proof (1.) and (2.)**

This is strightforward from the "conditionning cannot increase entropy" property.

#### **Proof (3.)**

This is strightforward from the entropy "functional dependency" property and the definition of mutual information.

#### **Proof (4.)**

This is strightforward from the alternative from property.

#### **Proof (5.)**

Developing the KL divergence form we have
$$
\begin{align*}
\sum_{x,y} p(x,y)\log \frac{p(x,y)}{p(x)p(y)}&= \sum_{x,y} p(x,y)\left(\log p(x,y) -\log p(x) - \log p(y) \right),\\
&= \sum_{x,y} p(x,y)\log p(x,y) - \sum_{x,y} p(x,y)\log p(x)-  \sum_{x,y} p(x,y)\log p(y),\\
&= - \sum_{x} p(x)\log p(x)-  \sum_{y} p(y)\log p(y) + \sum_{x,y} p(x,y)\log p(x,y) ,\\
&=H(X) + H(Y) - H(X,Y),\\
& = H(Y;X)
\end{align*}
$$

## Definition: Continuous case

If $X,Y$ are continuous, sums become integrals:

$$
I(X;Y) = \int\int p(x,y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)dxdy
$$

Now it is not bounded above anymore. Still non-negative.

## Definition: Cross-Entropy (Average Surprise Under a Model)

Let $\mathcal{M}_1(\mathcal{X})$ denote the set of all probability measures on the measurable space $(\mathcal{X},\,\mathcal{P}(\mathcal{X}))$.

We define the **cross-entropy** as the functional

$$
\begin{aligned}
H : \mathcal{M}_1(\mathcal{X})^2 &\longrightarrow [0,\infty] \\
H(p \Vert q) 
&:= \sum_{x \in \mathcal{X}} p(x)\,\big(-\log q(x)\big),
\end{aligned}
$$

with the convention $0\log 0 := 0$.

**Remark:**
If there exists $x \in \mathcal{X}$ such that
$$
p(x) > 0 \quad \text{and} \quad q(x) = 0,
$$
then
$$
H(p \Vert q) = \infty.
$$
This reflects the fact that outcomes that actually occur under $p$ are assigned zero probability under the model $q$.

**Interpretation:**
* High cross-entropy $H(p\Vert q)$ means that outcomes generated by the true distribution $p$ are, on average, very hard to predict when using the model $q$.
* Low cross-entropy $H(p\Vert q)$ means that outcomes generated by $p$ are, on average, easy to predict using $q$, indicating that $q$ is close to the true distribution.

