# [Information content](https://en.wikipedia.org/wiki/Information_content)


## Definition: Finite Random Variable

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space.

* A **random variable** is a measurable function  $X : (\Omega, \mathcal{F}) \to (\mathcal{X}, \mathcal{P}(\mathcal{X}))$, where $(\mathcal{X}, \mathcal{P}(\mathcal{X}))$ is a measurable space.

* A **finite random variable** is a random variable whose alphabet $\mathcal{X}$ is a finite set. We call $\mathcal{X}$ the *alphabet* (or *state space*) of $X$.

* The **support** of a finite random variable $X$ is defined as
$$
\operatorname{supp}(X)
= \{\, x \in \mathcal{X} : \mathbb{P}(X = x) > 0 \,\}.
$$

**Convention.**
* All random variables are assumed to be **finite** (their alphabets are finite sets).
* The alphabet of a random variable $X$ is always denoted by the corresponding calligraphic symbol $\mathcal{X}$.
* Unless explicitly stated otherwise, all random variables are defined on the same probability space $(\Omega, \mathcal{F}, \mathbb{P})$.

## Probability Mass Function (pmf)

Let $X$ be a finite random variable. The **probability mass function (pmf)** of $X$ is defined as
$$
\begin{aligned}
p_X : \mathcal{X} &\to [0,1],\\
p_X(x) &= \mathbb{P}(X = x).
\end{aligned}
$$

**Convention.** When no ambiguity arises, we omit the subscript and write $p(x)$ instead of $p_X(x)$.

## Information content of events

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space. We define the information content function on events by

$$
\begin{aligned}
I_{\mathbb{P}}: \mathcal{F} &\to [0,\infty] \\
I_{\mathbb{P}}(A) &:= -\log \mathbb{P}(A).
\end{aligned}
$$

**Iterpretation:**  
The information content of an event measures how *surprising* or *unpredictable* the event is, assuming the probability measure $\mathbb{P}$ is known:
- If $\mathbb{P}(A)$ is small, then $I_{\mathbb{P}}(A)$ is large (the event is very surprising).
- If $\mathbb{P}(A) = 1 \;\Longleftrightarrow\; I(A) = 0$ (the event is certain).
- If $\mathbb{P}(A)=0 \;\Longleftrightarrow\; I_{\mathbb{P}}(A)=\infty$ (the event is impossible under $\mathbb{P}$).

**Note:** 
* When no ambiguity arises, the dependence on the probability measure is omitted, and we write $I(A)$ instead of $I_{\mathbb{P}}(A)$.
* Since $\mathbb{P}$ is a probability measure, $\mathbb{P}(A)\in[0,1]$ and therefore $I_{\mathbb{P}}(A) \in [0,\infty]$.
* The logarithm base determines the units:
    - base $2$ → **bits**  
    - base $e$ → **nats**  
    - base $10$ → **hartleys**

    We will continue with base $2$.


### Property: Independence of events adds their infomration

If $A$ and $B$ are independent events,
$$
I(A \cap B) = I(A) + I(B)
$$


**Proof:** If $A$ and $B$ are independent, then 
$$
\begin{align*}
I(A,B) &= -\log\left(\mathbb{P}(A,B)\right)=-\log\left(\mathbb{P}(A)\mathbb{P}(B)\right),\\
&=-(\log\left(\mathbb{P}(A)\right)+\log\left(\mathbb{P}(B)\right)),\\
&=I(A)+I(B).
\end{align*}
$$


## Definition: Information Content of Random Vectors (Joint Random Variables)

Information answers the question: **How unexpected was this outcome, assuming we know the probabilistic model?**

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space and
$X=(X_1,X_2,\dots,X_n)$ a discrete random vector defined on it.
Let $\mathbb{P}_{X}$ denote the probability measure (law) induced by $X$ and
$$
p_X(x) := \mathbb{P}_X(\{x\}) = \mathbb{P}(X = x).
$$

We define the **information content function associated with $X$** by

$$
\begin{aligned}
I_{\mathbb{P}_{X}}: \mathbb{R}^n &\to [0,\infty] \\
I_{\mathbb{P}_{X}}(x)
&:= -\log p_{X}(x) \\
&= -\log \mathbb{P}(X_1 = x_1,\dots,X_n = x_n).
\end{aligned}
$$

The associated **information-content random variable** is then the composition

$$
\begin{aligned}
I_{\mathbb{P}_{X}}(X) : \Omega &\to [0,\infty] \\
I_{\mathbb{P}_{X}}(X)(\omega)
&:= I_{\mathbb{P}_{X}}\big( X(\omega) \big) \\
&= -\log \mathbb{P}_{X}\big( X(\omega) \big).
\end{aligned}
$$

**Iterpretation:**  
- If an outcome $x$ has **small probability**, then $I_{\mathbb{P}_X}(x)$ is **large** → the outcome is *very surprising*.  
- If an outcome $x$ has **large probability**, then $I_{\mathbb{P}_X}(x)$ is **small** → the outcome is *barely surprising*.  
- In other words, **information content measures how surprising a specific realization is.**

**Note:** 
* We omit explicit reference to $\mathbb{P}_X$ whenever the context is clear. When no ambiguity arises, we simply write $I(x)$ in place of $I_{\mathbb{P}_X}(x)$. 
* $I_{\mathbb{P}_{X}}$ is a particular case of the information content defined on the probability space 
  $(\mathbb{R}^n,\, \mathcal{B}(\mathbb{R}^n),\, \mathbb{P}_{X})$ 
  (or equivalently on $(\mathcal{X}, \mathcal{P}(\mathcal{X}), \mathbb{P}_{X})$ in the purely discrete case, where $\mathcal{X}$ is the alphabet of $X$). 
  Here we identify the singleton $\{x\}$ with its element $x$ via
  $$
  I_{\mathbb{P}_X}(x) := I_{\mathbb{P}_X}(\{x\}).
  $$

### Property: Independence of random vectors add their information

If $X$ and $Y$ are independent random vectors, then
$$
\begin{align*}
I_{\mathbb{P}_{(X,Y)}}(x,y) &= I_{\mathbb{P}_{X}}(x)+I_{\mathbb{P}_{Y}}(y),\\
I_{\mathbb{P}_{(X,Y)}}(X, Y) &= I_{\mathbb{P}_{X}}(X) + I_{\mathbb{P}_{Y}}(Y).
\end{align*}
$$


**Proof:** If $X$ and $Y$ are independent, then $\{X=x\}$ and $\{Y=y\}$ are indpendent events for any $x,y\in\mathbb{R}$
$$
\begin{align*}
I_{\mathbb{P}_{(X,Y)}}(x,y) &=  -\log \mathbb{P}(X= x,Y = y),\\
&=  -\log \left(\mathbb{P}(X= x)\mathbb{P}(Y = y)\right),\\
&=-\log\left(\mathbb{P}(X= x)\right)-\log\left(\mathbb{P}(Y = y)\right),\\
&=I_{\mathbb{P}_{X}}(x)+I_{\mathbb{P}_{Y}}(y).
\end{align*}
$$
From here, it is easy to get the property for the random variable as well. Let $\omega\in\Omega$, then
$$
\begin{align*}
I_{\mathbb{P}_{(X,Y)}}(X, Y)(\omega)
&= I_{\mathbb{P}_{(X,Y)}}\big( X(\omega),Y(\omega) \big) \\
&=I_{\mathbb{P}_{X}}(X(\omega))+I_{\mathbb{P}_{Y}}(Y(\omega)),\\
&=I_{\mathbb{P}_{X}}(X)(\omega) + I_{\mathbb{P}_{Y}}(Y)(\omega),\\
&=(I_{\mathbb{P}_{X}}(X) + I_{\mathbb{P}_{Y}}(Y))(\omega).
\end{align*}
$$
Since $\omega$ is arbitrary in $\Omega$ we get the result $I_{\mathbb{P}_{(X,Y)}}(X, Y)=I_{\mathbb{P}_{X}}(X) + I_{\mathbb{P}_{Y}}(Y)$

## Code:

In [1]:
import numpy as np
from collections import Counter

In [2]:
def inf_cont(P):
    return -np.log2(P)

### Arbitrary example



#### Define discrete probability vector

Consider the alphabets
$$
\begin{align*}
\mathcal{X}&=\{0,1,2\},\\
\mathcal{Y}&=\{0,1\}.
\end{align*}
$$

Define the joint probability mass function $p:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by
$$
\begin{aligned}
p(0,0)&=0.05, & p(0,1)&=0.15,\\
p(1,0)&=0.20, & p(1,1)&=0.10,\\
p(2,0)&=0.30, & p(2,1)&=0.20.
\end{aligned}
$$




In [3]:
# Joint probability matrix p(x,y)
#.   Y=0    Y=1
P_XY = np.array([
    [0.05, 0.15],   # X = 0
    [0.20, 0.10],   # X = 1
    [0.30, 0.20]    # X = 2
])

# Generate (i, j) pairs in the same order as flatten()
values = np.array([(i, j) for i in range(P_XY.shape[0])
                          for j in range(P_XY.shape[1])])

#### Check variable independance

In [4]:
P_X = P_XY.sum(axis=1)  # Marginal probabilities for X
P_Y = P_XY.sum(axis=0)  # Marginal probabilities for Y
P_XP_Y = np.outer(P_X, P_Y)  # Product of marginals
print(f"P_(X)P(Y)(x,y)=\n{P_XP_Y}")
print(f"P_(X,Y)(x,y)=\n{P_XY}")

P_(X)P(Y)(x,y)=
[[0.11  0.09 ]
 [0.165 0.135]
 [0.275 0.225]]
P_(X,Y)(x,y)=
[[0.05 0.15]
 [0.2  0.1 ]
 [0.3  0.2 ]]


#### Information content

$$I(x,y) = -\log_2\big(p(x,y)\big)$$

In [5]:
print(f"I(x,y)=\n{inf_cont(P_XY)}")

I(x,y)=
[[4.32192809 2.73696559]
 [2.32192809 3.32192809]
 [1.73696559 2.32192809]]


notice that more unlikeley(unpredictable) values has higher infomraiton content

#### Information content random vector

$$I(X,Y) = -\log_2\big(p(X,Y)\big)$$

In [6]:
def I_sample(P,values,n):
    idx = np.random.choice(len(P), size=n, p=P)
    return values[idx]

I_P = P_XY.flatten()
I_values = inf_cont(P_XY).flatten()
samples = I_sample(I_P,I_values,n=10_000)
counts = Counter(samples)

I_pmf = {val.item():count/len(samples) for val,count in counts.items()}
print(f"PMF of I(X,Y) samples:\n{I_pmf}")
print(f"\n{inf_cont(P_XY)=}")
print(f"{P_XY=}")

PMF of I(X,Y) samples:
{1.7369655941662063: 0.2982, 2.736965594166206: 0.1479, 2.321928094887362: 0.398, 3.321928094887362: 0.102, 4.321928094887363: 0.0539}

inf_cont(P_XY)=array([[4.32192809, 2.73696559],
       [2.32192809, 3.32192809],
       [1.73696559, 2.32192809]])
P_XY=array([[0.05, 0.15],
       [0.2 , 0.1 ],
       [0.3 , 0.2 ]])


### Independent example

#### Define discrete probability vector

Consider the alphabets
$$
\begin{align*}
\mathcal{X}&=\{0,1,2\},\\
\mathcal{Y}&=\{0,1\}.
\end{align*}
$$

Define the joint probability mass function $p:\mathcal{X}\times\mathcal{Y}\to[0,1]$ by
$$
\begin{aligned}
p(0,0)&=0.08, & p(0,1)&=0.12,\\
p(1,0)&=0.20, & p(1,1)&=0.30,\\
p(2,0)&=0.12, & p(2,1)&=0.18.
\end{aligned}
$$

In [7]:
# Joint probability matrix p(x,y)
#.   Y=0    Y=1
P_XY = np.array([
    [0.08, 0.12],   # X = 0
    [0.20, 0.30],   # X = 1
    [0.12, 0.18]    # X = 2
])

# Check if it's a valid probability distribution
if  P_XY.sum() == 1:
    print(f"P_(X,Y)(x,y)=\n{P_XY}")
else:
    print("Error: The probabilities do not sum to 1.")

P_(X,Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]


#### Check variable independance

In [8]:
P_X = P_XY.sum(axis=1)  # Marginal probabilities for X
P_Y = P_XY.sum(axis=0)  # Marginal probabilities for Y
P_XP_Y = np.outer(P_X, P_Y)  # Product of marginals
print(f"P_(X)P(Y)(x,y)=\n{P_XP_Y}")
print(f"P_(X,Y)(x,y)=\n{P_XY}")

P_(X)P(Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]
P_(X,Y)(x,y)=
[[0.08 0.12]
 [0.2  0.3 ]
 [0.12 0.18]]


### Property: Independence of random vectors add their information

If $X$ and $Y$ are independent random vectors, then
$$
\begin{align*}
I(x,y) &= I(x)+I(y),\\
\end{align*}
$$


In [9]:
print(f"I(x,y)=\n{inf_cont(P_XY)}")

I(x,y)=
[[3.64385619 3.05889369]
 [2.32192809 1.73696559]
 [3.05889369 2.47393119]]


In [10]:
print(f"I(x)+I(y)=\n{inf_cont(P_X)[:,None]+inf_cont(P_Y)[None,:]}")

I(x)+I(y)=
[[3.64385619 3.05889369]
 [2.32192809 1.73696559]
 [3.05889369 2.47393119]]
