# Mutual Information 

It is a generalization of [Information content](https://en.wikipedia.org/wiki/Information_content)

## Alphabet

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and
$X : \Omega \to \mathbb{R}^n$ be a discrete random vector.
Define its alphabet by
$$
\text{supp}(\mathbb{P}_{X}) = \{x \in \mathbb{R} : \mathbb{P}(X = x) > 0\},
$$
where $\mathbb{P}_{X}$ is the probability measure induced by $X$, i.e. the probability measure obtained by the Carathéodory extension theorem applied to

$$
\mathbb{P}_{X}\!\left(\prod_{i=1}^{n} B_i\right)
= \mathbb{P}\!\left(\bigcap_{i=1}^{n} X_i^{-1}(B_i)\right),
$$

for any Borel set $\prod_{i=1}^{n} B_i \subset \mathbb{R}^n$.

**Note:** In measure theory, $\mathbb{P}_{X}$ is called the *distribution* or *law* of $X$.  

## Probability Mass Function (pmf)

Let
* $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space
* $X : \Omega \to \mathbb{R}^n$ be a discrete random vector with (finite or countable) alphabet $\mathcal{X} \subset \mathbb{R}$.

Define the probability measure (law) induced by $X$ as
$$
\begin{aligned}
\mathbb{P}_{X} : \mathcal{B}(\mathbb{R}) &\to [0,1],\\
\mathbb{P}_{X}(B)  &= \mathbb{P}\big(X^{-1}(B)\big).
\end{aligned}
$$

The **probability mass function (pmf)** of $X$ is then the function
$$
\begin{aligned}
p_X : \mathcal{X} &\to [0,1],\\
p_X(x) &= \mathbb{P}_{X}(\{x\}) = \mathbb{P}(X = x).
\end{aligned}
$$

**Note:**  

The fundamental question of Information Theory is:

How much uncertainty is in $X$ before observing it, and how much information is gained after observing it?

This is not variance. It is a different notion of uncertainty based on logarithms and probability.

## Definition: Pointwise Mutual Information (PMI)

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and
$X=(X_1,\dots,X_n)$ and $Y=(Y_1,\dots,Y_m)$ two discrete random vectors.

We define the **pointwise mutual information** between $X$ and $Y$ as

$$
\begin{align*}
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}} &: \mathcal{X} \times \mathcal{Y} \to \mathbb{R}\\
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(x,y)
&:= \log\!\left( \frac{p_{(X,Y)}(x,y)}{p_X(x)\,p_Y(y)} \right),\\
&= \log\!\left( \frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)\mathbb{P}(Y=y)} \right).
\end{align*}
$$

The associated **pointwise mutual-information random variable** is the composition
$$
\begin{align*}
I_{\mathbb{P}_{(X,Y)}}(X \Cap Y) : \Omega &\to \mathbb{R},\\
I_{\mathbb{P}_{(X,Y)}}(X \Cap Y)(\omega)
&:= \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}\big( X(\omega),\,Y(\omega)\big).
\end{align*}
$$

**Note:** We omit explicit reference to $\mathbb{P}_{(X,Y)}$ whenever the context is clear. When no ambiguity arises, we simply write $\operatorname{pmi}$ and $\operatorname{pmi}(X \Cap Y)$ in place of $\operatorname{pmi}_{\mathbb{P}_{(X,Y)}}$ and $\operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(X \Cap Y)$ respetively. 

## Property: PMI Symetry
$$
\begin{align*}
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}} (x,y)&= \operatorname{pmi}_{\mathbb{P}_{(Y,X)}} (y,x) \\
I_{\mathbb{P}_{(X,Y)}}(X \Cap Y) &= I_{\mathbb{P}_{(Y,X)}}(Y \Cap X)
\end{align*}
$$


**Proof:** 
$$
\begin{align*}
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}} (x,y)&=\log\!\left( \frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)\mathbb{P}(Y=y)} \right),\\
&=\log\!\left( \frac{\mathbb{P}(Y=y,X=x)}{\mathbb{P}(Y=y)\mathbb{P}(X=x)} \right),\\
&= \operatorname{pmi}_{\mathbb{P}_{(Y,X)}} (y,x) \\
I_{\mathbb{P}_{(X,Y)}}(X \Cap Y)(\omega) &= \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}\big( X(\omega),\,Y(\omega)\big),\\
&= \operatorname{pmi}_{\mathbb{P}_{(Y,X)}}\big(Y(\omega), X(\omega)\,\big),\\
&=I_{\mathbb{P}_{(Y,X)}}(Y \Cap X)(\omega).
\end{align*}
$$

## Property: PMI independance
$$
\begin{align*}
X\perp Y &\iff \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(x,y)=0,\quad (x,y)\in\mathcal{(X, Y)} \\
X\perp Y &\iff I_{\mathbb{P}_{(X,Y)}}(X \Cap Y)=0.
\end{align*}
$$


**Proof:**
lets denote the alphabet of $(X,Y)$ by $\mathcal{(X, Y)}$ then we have
$$
\begin{align*}
X\perp Y &\iff \mathbb{P}(X=x,Y=y)=\mathbb{P}(X=x)\mathbb{P}(Y=y),\quad (x,y)\in\mathcal{(X, Y)}\\
&\iff \frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)\mathbb{P}(Y=y)}=1,\quad (x,y)\in\mathcal{(X, Y)}\\
&\iff \log\left(\frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)\mathbb{P}(Y=y)}\right)=0,\quad (x,y)\in\mathcal{(X, Y)}\\
&\iff \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(x,y)=0,\quad (x,y)\in\mathcal{(X, Y)} \\
&\iff \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(X \Cap Y)=0,\\
\end{align*}
$$

## Definition: Pointwise Information Content

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and
$X=(X_1,\dots,X_n)$ a discrete random vector.

We define the **pointwise information content of $X$** as

$$
\begin{align*}
I_{\mathbb{P}_{X}} &: \mathcal{X} \to \mathbb{R}_+\\
I_{\mathbb{P}_{X}}(x) &:= \operatorname{pmi}_{\mathbb{P}_{(X,X)}}(x,x),\\
&= \log\!\left( \frac{p_{(X,X)}(x,x)}{p_X(x)\,p_X(x)} \right),\\
&= \log\!\left( \frac{\mathbb{P}(X=x,X=x)}{\mathbb{P}(X=x)\mathbb{P}(X=y)} \right),\\
&= \log\!\left( \frac{\mathbb{P}(X=x)}{\mathbb{P}(X=x)\mathbb{P}(X=y)} \right),\\
&= -\log\!\left(\mathbb{P}(X=x) \right).
\end{align*}
$$

The associated **pointwise information-content random variable** is the composition

$$
\begin{align*}
I_{\mathbb{P}_{X}}(X) : \Omega &\to \mathbb{R}_+,\\
I_{\mathbb{P}_{X}}(X)(\omega)
&:= I_{\mathbb{P}_{X}}\big( X(\omega)\big).
\end{align*}
$$



**Iterpretation:**  
- If an outcome $x$ has **small probability**, then $I_{\mathbb{P}_X}(x)$ is **large** → the outcome is *very surprising*.  
- If an outcome $x$ has **large probability**, then $I_{\mathbb{P}_X}(x)$ is **small** → the outcome is *barely surprising*.  
- In other words, **information content measures how surprising a specific realization is.**

**Note:** 
* We omit explicit reference to $\mathbb{P}_X$ whenever the context is clear. When no ambiguity arises, we simply write $I(x)$ in place of $I_{\mathbb{P}_X}(x)$. 
* $I_{\mathbb{P}_{X}}$ is a particular case of the information content defined on the probability space 
  $(\mathbb{R}^n,\, \mathcal{B}(\mathbb{R}^n),\, \mathbb{P}_{X})$ 
  (or equivalently on $(\mathcal{X}, \mathcal{P}(\mathcal{X}), \mathbb{P}_{X})$ in the purely discrete case, where $\mathcal{X}$ is the alphabet of $X$). 
  Here we identify the singleton $\{x\}$ with its element $x$ via
  $$
  I_{\mathbb{P}_X}(x) := I_{\mathbb{P}_X}(\{x\}).
  $$

### Property: Equivalent Form

$$
\begin{align*}
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(x,y)&=I_{\mathbb{P}_{X}}(x)+I_{\mathbb{P}_{Y}}(y)-I_{\mathbb{P}_{(X,Y)}}(x,y),\\
I_{\mathbb{P}_{(X,Y)}}(X\Cap Y)&=I_{\mathbb{P}_{X}}(X)+I_{\mathbb{P}_{Y}}(Y)-I_{\mathbb{P}_{(X,Y)}}(X,Y).
\end{align*}
$$

**Proof:** For the pmi function we have
$$
\begin{align*}
\operatorname{pmi}_{\mathbb{P}_{(X,Y)}}(x,y)
&= \log\!\left( \frac{\mathbb{P}(X=x,Y=y)}{\mathbb{P}(X=x)\mathbb{P}(Y=y)} \right),\\
&= -\log\!\left(\mathbb{P}(X=x) \right)-\log\!\left( \mathbb{P}(Y=y) \right)+\log\!\left(\mathbb{P}(X=x,Y=y)\right),\\
&=I_{\mathbb{P}_{X}}(x)+I_{\mathbb{P}_{Y}}(y)-I_{\mathbb{P}_{(X,Y)}}(x,y).
\end{align*}
$$
and for the ramdom variable we only use the result for the function and the defintions

$$
\begin{align*}
I_{\mathbb{P}_{(X,Y)}}(X \Cap Y)(\omega)
&:= \operatorname{pmi}_{\mathbb{P}_{(X,Y)}}\big( X(\omega),\,Y(\omega)\big),\\
&=I_{\mathbb{P}_{X}}(X(\omega))+I_{\mathbb{P}_{Y}}(Y(\omega))-I_{\mathbb{P}_{(X,Y)}}(X(\omega),\,Y(\omega)),\\
&=I_{\mathbb{P}_{X}}(X(\omega))+I_{\mathbb{P}_{Y}}(Y(\omega))-I_{\mathbb{P}_{(X,Y)}}((X,Y)(\omega)),\\
&=I_{\mathbb{P}_{X}}(X)(\omega)+I_{\mathbb{P}_{Y}}(Y)(\omega)-I_{\mathbb{P}_{(X,Y)}}(X,Y)(\omega),\\
&=(I_{\mathbb{P}_{X}}(X)+I_{\mathbb{P}_{Y}}(Y)-I_{\mathbb{P}_{(X,Y)}}(X,Y))(\omega).
\end{align*}
$$


### Property: Independence of random vectors add their information contents

$$
\begin{align*}
X\perp Y &\iff I_{\mathbb{P}_{(X,Y)}}(x,y) = I_{\mathbb{P}_{X}}(x)+I_{\mathbb{P}_{Y}}(y),\quad (x,y)\in\mathcal{X}\times\mathcal{Y}\\
X\perp Y &\iff I_{\mathbb{P}_{(X,Y)}}(X, Y) = I_{\mathbb{P}_{X}}(X) + I_{\mathbb{P}_{Y}}(Y).
\end{align*}
$$


**Proof:** It is a direct consequence of the mutual information independance property.

## Code: Information Content

In [1]:
import numpy as np
from collections import Counter

In [2]:
def inf_cont(P):
    return -np.log2(P)

### Arbitrary example



#### Define discrete probability vector

Consider the alphabets
$$
\begin{align*}
\mathcal{X}&=\{0,1,2\},\\
\mathcal{Y}&=\{0,1\}.
\end{align*}
$$

Define the joint probability mass function $p:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by
$$
\begin{aligned}
p(0,0)&=0.05, & p(0,1)&=0.15,\\
p(1,0)&=0.20, & p(1,1)&=0.10,\\
p(2,0)&=0.30, & p(2,1)&=0.20.
\end{aligned}
$$




In [3]:
# Joint probability matrix p(x,y)
#.   Y=0    Y=1
P_XY = np.array([
    [0.05, 0.15],   # X = 0
    [0.20, 0.10],   # X = 1
    [0.30, 0.20]    # X = 2
])

# Generate (i, j) pairs in the same order as flatten()
values = np.array([(i, j) for i in range(P_XY.shape[0])
                          for j in range(P_XY.shape[1])])

#### Check variable independance

In [4]:
P_X = P_XY.sum(axis=1)  # Marginal probabilities for X
P_Y = P_XY.sum(axis=0)  # Marginal probabilities for Y
P_XP_Y = np.outer(P_X, P_Y)  # Product of marginals
print(f"P_(X)P(Y)(x,y)=\n{P_XP_Y}")
print(f"P_(X,Y)(x,y)=\n{P_XY}")

P_(X)P(Y)(x,y)=
[[0.11  0.09 ]
 [0.165 0.135]
 [0.275 0.225]]
P_(X,Y)(x,y)=
[[0.05 0.15]
 [0.2  0.1 ]
 [0.3  0.2 ]]


#### Information content

$$I(x,y) = -\log_2\big(p(x,y)\big)$$

In [5]:
print(f"I(x,y)=\n{inf_cont(P_XY)}")

I(x,y)=
[[4.32192809 2.73696559]
 [2.32192809 3.32192809]
 [1.73696559 2.32192809]]


notice that more unlikeley(unpredictable) values has higher infomraiton content

#### Information content random vector

$$I(X,Y) = -\log_2\big(p(X,Y)\big)$$

In [6]:
def I_sample(P,values,n):
    idx = np.random.choice(len(P), size=n, p=P)
    return values[idx]

I_P = P_XY.flatten()
I_values = inf_cont(P_XY).flatten()
samples = I_sample(I_P,I_values,n=10_000)
counts = Counter(samples)

I_pmf = {val.item():count/len(samples) for val,count in counts.items()}
print(f"PMF of I(X,Y) samples:\n{I_pmf}")
print(f"\n{inf_cont(P_XY)=}")
print(f"{P_XY=}")

PMF of I(X,Y) samples:
{3.321928094887362: 0.0964, 2.736965594166206: 0.1549, 1.7369655941662063: 0.2956, 2.321928094887362: 0.4038, 4.321928094887363: 0.0493}

inf_cont(P_XY)=array([[4.32192809, 2.73696559],
       [2.32192809, 3.32192809],
       [1.73696559, 2.32192809]])
P_XY=array([[0.05, 0.15],
       [0.2 , 0.1 ],
       [0.3 , 0.2 ]])


### Independent example

#### Define discrete probability vector

Consider the alphabets
$$
\begin{align*}
\mathcal{X}&=\{0,1,2\},\\
\mathcal{Y}&=\{0,1\}.
\end{align*}
$$

Define the joint probability mass function $p:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by
$$
\begin{aligned}
p(0,0)&=0.08, & p(0,1)&=0.12,\\
p(1,0)&=0.20, & p(1,1)&=0.30,\\
p(2,0)&=0.12, & p(2,1)&=0.18.
\end{aligned}
$$

In [7]:
# Joint probability matrix p(x,y)
#.   Y=0    Y=1
P_XY = np.array([
    [0.08, 0.12],   # X = 0
    [0.20, 0.30],   # X = 1
    [0.12, 0.18]    # X = 2
])

# Check if it's a valid probability distribution
if  P_XY.sum() == 1:
    print(f"P_(X,Y)(x,y)=\n{P_XY}")
else:
    print("Error: The probabilities do not sum to 1.")

P_(X,Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]


#### Check variable independance

In [8]:
P_X = P_XY.sum(axis=1)  # Marginal probabilities for X
P_Y = P_XY.sum(axis=0)  # Marginal probabilities for Y
P_XP_Y = np.outer(P_X, P_Y)  # Product of marginals
print(f"P_(X)P(Y)(x,y)=\n{P_XP_Y}")
print(f"P_(X,Y)(x,y)=\n{P_XY}")

P_(X)P(Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]
P_(X,Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]


### Property: Independence of random vectors add their information

If $X$ and $Y$ are independent random vectors, then
$$
\begin{align*}
I(x,y) &= I(x)+I(y),\\
\end{align*}
$$


In [9]:
print(f"I(x,y)=\n{inf_cont(P_XY)}")

I(x,y)=
[[3.64385619 3.05889369]
 [2.32192809 1.73696559]
 [3.05889369 2.47393119]]


In [10]:
print(f"I(x)+I(y)=\n{inf_cont(P_X)[:,None]+inf_cont(P_Y)[None,:]}")

I(x)+I(y)=
[[3.64385619 3.05889369]
 [2.32192809 1.73696559]
 [3.05889369 2.47393119]]
