# Information Measure

Created mainly from Chapter 6 in the book: 

    Raymond W. Yeung (auth.) - A First Course in Information Theory (2002, Springer)

and my own thoughts

## Definition: Finite Random Variable

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space.

* A **random variable** is a measurable function  $X : (\Omega, \mathcal{F}) \to (\mathcal{X}, \mathcal{P}(\mathcal{X}))$, where $(\mathcal{X}, \mathcal{P}(\mathcal{X}))$ is a measurable space.

* A **finite random variable** is a random variable whose alphabet $\mathcal{X}$ is a finite set. We call $\mathcal{X}$ the *alphabet* (or *state space*) of $X$.

* The **support** of a finite random variable $X$ is defined as
$$
\operatorname{supp}(X)
= \{\, x \in \mathcal{X} : \mathbb{P}(X = x) > 0 \,\}.
$$

**Convention.**
* All random variables are assumed to be **finite** (their alphabets are finite sets).
* The alphabet of a random variable $X$ is always denoted by the corresponding calligraphic symbol $\mathcal{X}$.
* Unless explicitly stated otherwise, all random variables are defined on the same probability space $(\Omega, \mathcal{F}, \mathbb{P})$.

## Definition: Probability Mass Function (pmf)

Let $X$ be a finite random variable. The **probability mass function (pmf)** of $X$ is defined as
$$
\begin{aligned}
p_X : \mathcal{X} &\to [0,1],\\
p_X(x) &= \mathbb{P}(X = x).
\end{aligned}
$$

**Convention.** When no ambiguity arises, we omit the subscript and write $p(x)$ instead of $p_X(x)$.

## Definition: Field Generated by Sets

Let $\tilde{X}_1, \tilde{X}_2, \ldots, \tilde{X}_n$ be sets. The **field** $\mathcal{F}_n$ generated by these sets is the collection of all sets that can be obtained by applying any finite sequence of the usual set operations (union, intersection, complement, and difference) to  $\tilde{X}_1, \tilde{X}_2, \ldots, \tilde{X}_n$.

## Definition: Atoms of the Field

The **atoms** of $\mathcal{F}_n$ are the sets of the form  
$$
\bigcap_{i=1}^{n} Y_i,
$$
where each $Y_i$ is either $\tilde{X}_i$ or $\tilde{X}_i^{\,c}$ (the complement of $\tilde{X}_i$).

There are $2^n$ atoms and $2^{2^n}$ sets in $\mathcal{F}_n$. All atoms in $\mathcal{F}_n$ are disjoint, and every set in $\mathcal{F}_n$ can be expressed uniquely as a union of some subset of these atoms. Unless otherwise stated, we assume that the sets $\tilde{X}_1, \tilde{X}_2, \ldots, \tilde{X}_n$ intersect generically, meaning that all atoms of $\mathcal{F}_n$ are nonempty.

## Definition: Signed Measure

A real-valued function $\mu$ defined on $\mathcal{F}_n$ is called a **signed measure**
if it is **set-additive**; that is, for any disjoint sets $A$ and $B$ in $\mathcal{F}_n$,
$$
\mu(A \cup B) = \mu(A) + \mu(B).
$$

**Note** A signed measure $\mu$ on $\mathcal{F}_n$ is completely specified by its values on the atoms of $\mathcal{F}_n$. The values of $\mu$ on the other sets in $\mathcal{F}_n$ can be obtained via set-additivity.

## Theorem: Inclusion–Exclusion for a Set-Additive Function
Let $\mu$ be a signed measure. Then for any sets $A_1,\ldots,A_n$ and $B$,
$$
\begin{align*}
\mu\!\left( \bigcap_{k=1}^{n} A_k - B \right)
&=\sum_{1 \le i \le n} \mu(A_i - B)-\sum_{1 \le i < j \le n} \mu(A_i \cup A_j - B)
+\cdots\;
+(-1)^{\,n+1}\mu(A_1 \cup \cdots \cup A_n - B),\\
&=\sum_{m=1}^n\sum_{S\subseteq\mathcal{N}_n ,|S|=m}
(-1)^{m+1}
\mu\!\left(\bigcup_{i}\tilde{X}_i - B\right),\\
&=\sum_{S\subseteq \mathcal{N}_n,\,S\neq\varnothing}
(-1)^{|S|+1}
\mu\!\left(\bigcup_{i\in S}\tilde{X}_i - B\right).
\end{align*}
$$
where $\mathcal{N}_n=\{1,2,\dots,n\}$.

#### **Proof:** We use induction on $n$.

##### Base case

For $n=1$:
$$
\mu(A_1 - B) = \mu(A_1 - B),
$$
so the statement is trivially true.

Assume the formula holds for some $n \ge 1$. We prove it for $n+1$.

##### Induction step

1. Rewrite the $(n+1)$-fold intersection:
    $$
    \mu\!\left( \bigcap_{k=1}^{n+1} A_k - B \right)
    =
    \mu\!\left( \left( \bigcap_{k=1}^{n} A_k \right) \cap A_{n+1} - B \right).
    \tag{2}
    $$

2. Apply set-additivity identity:

    Using
    $$
    \mu(E \cap F - B)
    =
    \mu(E - B) + \mu(F - B) - \mu(E \cup F - B),
    \tag{3}
    $$
    we obtain
    $$
    \mu\!\left( \bigcap_{k=1}^{n+1} A_k - B \right)
    =
    \mu\!\left( \bigcap_{k=1}^n A_k - B \right)
    + \mu(A_{n+1} - B)
    -
    \mu\!\left( \left( \bigcap_{k=1}^n A_k \right)\cup A_{n+1} - B \right).
    \tag{4}
    $$

3. Distribute the union:

    The following identity is used on last term in (4):

    $$
    \left( \bigcap_{k=1}^n A_k \right)\cup A_{n+1}
    =
    \bigcap_{k=1}^{n} (A_k \cup A_{n+1}).
    
    $$

    Thus:

    $$
    \mu\!\left( \bigcap_{k=1}^{n+1} A_k - B \right)
    =
    \mu\!\left( \bigcap_{k=1}^n A_k - B \right)
    + \mu(A_{n+1} - B)
    -\mu\!\left( \bigcap_{k=1}^{n} (A_k \cup A_{n+1}) - B \right)
    .\tag{5}
    $$

4. Apply the induction hypothesis twice:

    - First to  
    $\displaystyle \mu\!\left( \bigcap_{k=1}^{n} A_k - B \right)$  
    - Then to  
    $\displaystyle \mu\!\left( \bigcap_{k=1}^{n} (A_k\cup A_{n+1}) - B \right)$

    We substitute both expansions into (4).

5. Full derivation in aligned form**

    $$
    \begin{aligned}
    \mu\!\left( \bigcap_{k=1}^{n+1} A_k - B \right)
    &=
    \Bigg[
    \sum_{1\le i\le n} \mu(A_i - B)
    -
    \sum_{1\le i<j\le n} \mu(A_i\cup A_j - B)
    +\cdots
    +(-1)^{n+1}\mu(A_1\cup\cdots\cup A_n - B)
    \Bigg]
    \\[4pt]
    &\qquad
    + \mu(A_{n+1} - B)
    \\[6pt]
    &\qquad
    -\Bigg[
    \sum_{1\le i\le n} \mu(A_i\cup A_{n+1} - B)
    -
    \sum_{1\le i<j\le n} \mu(A_i\cup A_j\cup A_{n+1} - B)
    +\cdots \\
    &\qquad\qquad\qquad
    +(-1)^{n+1}\mu(A_1\cup\cdots\cup A_n\cup A_{n+1} - B)
    \Bigg]
    \\[8pt]
    &=
    \sum_{1\le i\le n+1} \mu(A_i - B)
    -
    \sum_{1\le i<j\le n+1} \mu(A_i\cup A_j - B)
    +\cdots
    +(-1)^{n+2}\mu(A_1\cup\cdots\cup A_{n+1} - B).
    \end{aligned}
    \tag{7}
    $$
Expression **(7)** matches formula **(1)** with $n$ replaced by $n+1$. Thus the identity holds for all $n\ge 1$ by induction.  

## Theorem: Parametrization of Signed Measures on $\mathcal{F}_n$

Let $\tilde{X}_1,\ldots,\tilde{X}_n$ be sets in a universal set $\Omega_n$, and let  $\mathcal{F}_n$ be the field they generate. The atoms of $\mathcal{F}_n$ are the sets
$$
\bigcap_{i=1}^n Y_i, \qquad \text{ were }\qquad Y_i\in\{\tilde{X}_i,\tilde{X}_i^{\,c}\},
$$
and we assume all atoms except  $A_0=\bigcap_{i=1}^n \tilde{X}_i^{\,c}$ are nonempty.

For any nonempty $G\subseteq\{1,\ldots,n\}$ define  
$$
\tilde{X}_G=\bigcup_{i\in G}\tilde{X}_i,
\qquad
\text{ and }\qquad \mathcal{B}=\{\tilde{X}_G : G\neq\varnothing\}.
$$

Then a signed measure $\mu$ on $\mathcal{F}_n$ is completely determined by the values  
$$
\{\mu(B):B\in\mathcal{B}\},
$$  
which may be chosen arbitrarily. Every such assignment extends uniquely to a signed measure on $\mathcal{F}_n$.

#### **Proof:**

##### Step 1: Preliminars



Let $\mathcal{A}$ be the set of all **nonempty atoms** of $\mathcal{F}_n$.  
Each atom corresponds to a nonempty subset of $\{1,\ldots,n\}$, so
$$
|\mathcal{A}| = 2^n - 1.
$$
By construction,
$$
\mathcal{B} = \{\tilde{X}_G : \varnothing \neq G \subseteq \{1,\ldots,n\}\}
$$
also has one element for each nonempty subset of $\{1,\ldots,n\}$, hence
$$
|\mathcal{B}| = 2^n - 1.
$$
Thus
$$
|\mathcal{A}| = |\mathcal{B}| =: k.
$$

Let
$$
\mathbf{u} =
\begin{bmatrix}
\mu(A)
\end{bmatrix}_{A\in\mathcal{A}}
\in \mathbb{R}^k,
\qquad
\mathbf{h} =
\begin{bmatrix}
\mu(B)
\end{bmatrix}_{B\in\mathcal{B}}
\in \mathbb{R}^k
$$
be column vectors indexed by $\mathcal{A}$ and $\mathcal{B}$, respectively.

##### Step 2: From atoms to unions

Every $B\in\mathcal{B}$ is a **disjoint union** of some nonempty atoms in $\mathcal{A}$. By set-additivity of $\mu$, each $\mu(B)$ is the sum of the $\mu(A)$ over those atoms:
$$
\mu(B) = \sum_{A\in\mathcal{A}} c_{B,A}\,\mu(A),
$$
where $c_{B,A}\in\{0,1\}$ indicates whether the atom $A$ is contained in $B$.

Collecting these relations for all $B\in\mathcal{B}$, we obtain a linear system
$$
\mathbf{h} = C_n \mathbf{u},
\tag{1}
$$
where $C_n$ is a **unique** $k\times k$ matrix (its entries are the $c_{B,A}$).

##### Step 3: From unions back to atoms

Let $A\in\mathcal{A}$ be a nonempty atom of $\mathcal{F}_n$, then $A$ has the form
$$
A=\bigcap_{i\in P}\tilde{X}_i
  - \bigcup_{j\in N}\tilde{X}_j.
$$
where
$$
P=\{i : Y_i=\tilde{X}_i\},
\qquad
N=\{j : Y_j=\tilde{X}_j^{\,c}\},
$$
so that $P\neq\varnothing$ and $P\cup N=\{1,\ldots,n\}$, $P\cap N=\varnothing$. Then

Set
$$
A_k=\tilde{X}_{i_k},\quad i_k\in P,\qquad
B=\bigcup_{j\in N}\tilde{X}_j,
$$
so $A=\bigcap_{k}A_k - B$.

$$
A=\bigcap_{i\in P}\tilde{X}_i
  - \bigcup_{j\in N}\tilde{X}_j.
$$


By the previous inclusion–exclusion Theorem, we have
$$
\begin{align*}
\mu(A)
&=
\mu\!\left(\bigcap_{k}A_k - B\right),\\
&=
\sum_{S\subseteq P,\,S\neq\varnothing}
(-1)^{|S|+1}
\mu\!\left(\bigcup_{i\in S}\tilde{X}_i - B\right),\\
&=\sum_{S\subseteq P,\,S\neq\varnothing}
(-1)^{|S|+1}\left(
\mu\!\left(\bigcup_{i\in S}\tilde{X}_i \cup B\right) - \mu\!\left(B\right) \right),\\
&=\sum_{S\subseteq P,\,S\neq\varnothing}
(-1)^{|S|+1}
\mu\!\left(\bigcup_{i\in S}\tilde{X}_i \cup B\right) - \sum_{S\subseteq P,\,S\neq\varnothing}
(-1)^{|S|+1}\mu\!\left(B\right).
\end{align*}
$$
with $B=\bigcup_{j\in N}\tilde{X}_j$. 

Thus, for each $A\in\mathcal{A}$,
$$
\mu(A) = \sum_{B\in\mathcal{B}} d_{A,B}\,\mu(B),
$$
for suitable real coefficients $d_{A,B}$. In vector form,
$$
\mathbf{u} = D_n \mathbf{h},
\tag{2}
$$
for some $k\times k$ matrix $D_n$.

##### Step 4: Invertibility and parametrization


Substituting (1) into (2) gives
$$
\mathbf{u} = D_n \mathbf{h} = D_n (C_n \mathbf{u}) = (D_n C_n)\,\mathbf{u}.
\tag{5}
$$
Since this must hold for **every** possible choice of the vector $\mathbf{u}$ (i.e., for
every signed measure $\mu$), we must have
$$
D_n C_n = I_k,
$$
so $D_n$ is the inverse of $C_n$. In particular, $C_n$ is invertible and the mapping
$$
\mathbf{u} \longleftrightarrow \mathbf{h}
$$
is bijective.

Thus specifying the values $\mu(B)$ for all $B\in\mathcal{B}$ (i.e., choosing any
$\mathbf{h}\in\mathbb{R}^k$) uniquely determines $\mathbf{u}$ and hence the signed
measure $\mu$ on all atoms, and therefore on all of $\mathcal{F}_n$.

## Definition: Construction of the I-Measure $\mu^*$

Consider $n$ finite random variables $X_1,\ldots,X_n$, and for each $X_i$ let
$\tilde{X}_i$ be the associated set.  Let
$$
\mathcal{N}_n=\{1,\ldots,n\},
\qquad
\Omega_n=\bigcup_{i\in\mathcal{N}_n}\tilde{X}_i,
$$
and let $\mathcal{F}_n$ be the field generated by
$\tilde{X}_1,\ldots,\tilde{X}_n$.

For every nonempty $G\subseteq\mathcal{N}_n$ define
$$
\tilde{X}_G=\bigcup_{i\in G}\tilde{X}_i.
$$
Since a signed measure on $\mathcal{F}_n$ is uniquely determined by its
values on all such unions $\tilde{X}_G$, we define the $I$-measure $\mu^*$ on
$\mathcal{F}_n$ by first specifying
$$
\mu^*(\tilde{X}_G)
=
-\sum_{x\in\mathcal{X}_G} p_{X_G}(x)\,\log p_{X_G}(x),
\qquad
\varnothing\neq G\subseteq\mathcal{N}_n,
$$
where $p_{X_G}$ is the pmf of the joint random variable
$X_G=(X_i)_{i\in G}$ and assuming the convention $0 \log 0 := 0$..

By the parametrization theorem for signed measures on $\mathcal{F}_n$, this
assignment extends uniquely to a signed measure $\mu^*$ on all sets in
$\mathcal{F}_n$.

## I-measure extends Entropy

### Lemma

For any sets $A,B,C$ and any set-additive function $\mu$,
$$
\mu(A \cap B - C)
=
\mu(A \cup C)
+
\mu(B \cup C)
-
\mu(A \cup B \cup C)
-
\mu(C).
$$


#### **Proof**

From the identities
$$
\begin{align*}
\mu(A \cap B - C) &= \mu(A - C) + \mu(B - C) - \mu(A \cup B - C),\\
\mu(E - C) &= \mu(E \cup C) - \mu(C),
\end{align*}
$$
we obtain
$$
\begin{aligned}
\mu(A \cap B - C)
&= \mu(A - C) + \mu(B - C) - \mu(A \cup B - C) \\[2pt]
&= \big(\mu(A \cup C) - \mu(C)\big)
   + \big(\mu(B \cup C) - \mu(C)\big)
   - \big(\mu(A \cup B \cup C) - \mu(C)\big) \\[2pt]
&= \mu(A \cup C) + \mu(B \cup C) - \mu(A \cup B \cup C) - \mu(C),
\end{aligned}
$$
which is exactly the claim.

### Lemma

For random variables $X,Y,Z$,
$$
H(X\Cap Y\mid Z)
=
H(X,Z) + H(Y,Z) - H(X,Y,Z) - H(Z).
\tag{1}
$$

#### **Proof**

Using the altirnative form and its conditioned version 
$$
H(X\mid Z)   = H(X,Z)   - H(Z), 
\qquad
H(X\mid Y,Z) = H(X,Y,Z) - H(Y,Z),
\tag{3}
$$
we obtain
$$
\begin{aligned}
H(X\Cap Y\mid Z)&= H(X\mid Z) - H(X\mid Y,Z),\\
&= \big(H(X,Z) - H(Z)\big)
   - \big(H(X,Y,Z) - H(Y,Z)\big) \\[2pt]
&= H(X,Z) + H(Y,Z) - H(X,Y,Z) - H(Z),
\end{aligned}
$$
which is exactly the calim.

## Theorem: $\mu^*$ is the Unique Signed Measure Consistent with Shannon Quantities

The signed measure $\mu^*$ on $\mathcal{F}_n$ is the unique signed measure whose
values agree with all Shannon information measures.


#### **Proof**

Consider nonempty index sets $G,G',G''\subseteq\mathcal{N}_n$.  
Using the identities for signed measures on $\mathcal{F}_n$, we compute:
$$
\begin{aligned}
\mu^*(\tilde{X}_G \cap \tilde{X}_{G'} - \tilde{X}_{G''})
&= \mu^*(\tilde{X}_{G\cup G''})
 +  \mu^*(\tilde{X}_{G'\cup G''})
 -  \mu^*(\tilde{X}_{G\cup G'\cup G''})
 -  \mu^*(\tilde{X}_{G''}) \\[4pt]
&= H(X_{G\cup G''})
 +  H(X_{G'\cup G''})
 -  H(X_{G\cup G'\cup G''})
 -  H(X_{G''}) \\[4pt]
&= I(X_G; X_{G'} \mid X_{G''}).
\end{aligned}
$$

Thus every region of the field corresponds exactly to the appropriate
Shannon information quantity. Since a signed measure on $\mathcal{F}_n$ is
uniquely determined by its values on all such regions, $\mu^*$ is the unique
measure consistent with these quantities.  $\square$

## Definition: Information Synergy

In contrast to the basic quantities
$$
\mu^{*}(X)=H(X),\quad \mu^{*}(X\cup Y)=H(X,Y),\quad \mu^{*}(X\cap Y)=H(X\Cap Y),
$$
which are always non–negative, the information measure $\mu^{*}(X\cap Y\cap Z)$ may be **negative** as in the followinfg example

### Example: Negative triple disjoint-entropy 
Consider $X,Y\sim \text{Bernoulli}(\frac{1}{2})$ with $X\perp Y$ and $Z=X+Y \text{mod} \ 2 $ (XOR). What is $\mu^*(X\cap Y\cap Z)$?

#### Joint distribution

Then, the joint distribution of $(X,Y,Z)$ is:

$$
\begin{align*}
p_{(X,Y,Z)}(0,0,0)&=1/4\\
p_{(X,Y,Z)}(0,1,1)&=1/4\\
p_{(X,Y,Z)}(1,0,1)&=1/4\\
p_{(X,Y,Z)}(1,1,0)&=1/4\\
\end{align*}
$$

then $p_Z(0)=p_Z(1)=\frac{1}{4}+\frac{1}{4}=\frac{1}{2}$. So $Z \sim \text{Bernoulli}(\frac{1}{2})$

#### Joint 1

Notice that if $W\sim \text{Bernoulli}(p)$ we have
$$
\begin{align*} 
\mu^*(W)&=-p_{W}(0)\log(p_{W}(0))-p_{W}(1)\log(p_{W}(1)),\\
&=-(1-p)\log(1-p)-p\log(p).
\end{align*}
$$
So, for the case $p=\frac{1}{2}$, we get $\mu^*(W)=-\frac{1}{2}\log(2^{-1})-\frac{1}{2}\log(2^{-1})=\frac{1}{2}+\frac{1}{2}=1$.

Tnen, we have $\mu^*(X)=\mu^*(Y)=\mu^*(Z)=1$

#### Joint 2

The marginals with $Z$ are
$$
\begin{aligned}
p_{X,Z}(0,0)=\tfrac14, \qquad & p_{Y,Z}(0,0)=\tfrac14,\\
p_{X,Z}(0,1)=\tfrac14, \qquad & p_{Y,Z}(0,1)=\tfrac14,\\
p_{X,Z}(1,0)=\tfrac14, \qquad & p_{Y,Z}(1,0)=\tfrac14,\\
p_{X,Z}(1,1)=\tfrac14, \qquad & p_{Y,Z}(1,1)=\tfrac14.
\end{aligned}
$$

Thus,
$$
p_{X,Z}(x,z)=p_X(x)\,p_Z(z), \qquad 
p_{Y,Z}(y,z)=p_Y(y)\,p_Z(z),
$$
so, in addition to $X \perp Y$, we also have $X \perp Z$ and $Y \perp Z$.  
Then, we have
$\mu^*(X\cap Y)=\mu^*(Y\cap Z)=\mu^*(Z\cap X)=0$.


Hence $X$, $Y$, and $Z$ are pairwise independent, but not mutually independent, since  
$$
Z = X + Y \pmod 2.
$$




This phenomenon highlights a fundamental difference between the $\emph{content}$ of sets and the $\emph{information}$ of random variables.  
In classical set theory, the relation
$$
X \cap Y = \varnothing
$$
forces
$$
X \cap Y \cap Z = \varnothing,
$$
because
$$
\varnothing \subset X \cap Y \cap Z \subset X \cap Y = \varnothing.
$$

In information theory, however, the analogous regions of the I-diagram doesn not need to obey non-negativity.  
The atom $\tilde X \cap \tilde Y \cap \tilde Z$ may have $\emph{negative}$ I-measure.  
Thus the information diagram admits “less-than-empty’’ regions: sets whose informational measure is negative even though no such notion exists in ordinary set theory.  
This signed nature of the I-measure is what allows pairwise independence to coexist with higher-order dependence, as in the XOR example.

#### Joint 3

From the dependency $Z=f(X,Y):=X+Y \text{mod} \ 2$, we have
$$
\begin{align*}
\mu^*(X\cup Y \cup Z)&=\mu^*(Z-(X\cup Y)) + \mu^*(X\cup Y),\\
&=H(Z|X,Y) + \mu^*(X) + \mu^*(Y),\\
&=0+1+1,\\
&=2.
\end{align*}
$$

#### Computation

Then we can compute:
$$
\begin{align*}
\mu^*(X\cap Y\cap Z)&=\mu^*(X)+\mu^*(Y)+\mu^*(Z)
 - \mu^*(X\cup Y) - \mu^*(X\cup Z) - \mu^*(Y\cup Z) + \mu^*(X\cup Y\cup Z),\\
&=1+1+1 -2 -2 -2+2,\\
&= -1.
\end{align*}
$$

## Code: I-Measure

In [None]:
import numpy as np
from pyitlib import discrete_random_variable as drv

def H(X, base=2):
    return tuple(float(h) for h in drv.entropy(X, base=base))

def H_joint(vars, base=2):
    return float(drv.entropy_joint(vars, base=base))

def H_cond(Y_vars,X_vars,  base=2):
    XY=np.concatenate([X_vars,Y_vars])
    return H_joint(XY, base=base) - H_joint(X_vars, base=base)

def H_mutual(vars, base=2):
    return float(drv.information_co(vars, base=base))

In [None]:
X = np.array([[0,0,1,1]])
Y = np.array([[0,1,0,1]])
Z = X ^ Y  # XOR of X and Y

XYZ = np.concatenate([X, Y, Z])          # shape (3, n_samples)
XY = np.concatenate([X, Y])              # shape (2, n_samples)
YZ = np.concatenate([Y, Z])              # shape (2, n_samples)
XZ = np.concatenate([X, Z])              # shape (2, n_samples)
XYZ

array([[0, 0, 1, 1],
       [0, 1, 0, 1],
       [0, 1, 1, 0]])

In [None]:
H_cond(Z,XY)

0.0

In [None]:
H_mutual(XY)

0.0

In [None]:
Hx,Hy,Hz = H(XYZ)  # this is I(X;Y;Z) in bits by default
print(f"{Hx,Hy,Hz=}")
Hxy = H_joint(XY) 
Hyz = H_joint(YZ)
Hxz = H_joint(XZ)
print(f"{Hxy,Hyz,Hxz=}")
Hxyz = H_joint(XYZ)
print(f"{Hxyz=}")

H_XintYintZ = Hx + Hy + Hz - Hxy - Hyz - Hxz + Hxyz
print(f"{H_XintYintZ=}")


Hx,Hy,Hz=(1.0, 1.0, 1.0)
Hxy,Hyz,Hxz=(2.0, 2.0, 2.0)
Hxyz=2.0
H_XintYintZ=-1.0


In [None]:
H_mutual(XYZ, base=2)  # equivalent to I_XintYintZ

-1.0

## Machine Learning interpretation

This sign has a natural interpretation:

- If $\mu^{*}(X\cap Y\cap Z) < 0$, we say that $X,Y,Z$ exhibit **information synergy**: none of the pairs carries enough information, but the three variables *together* reveal extra structure that is not present in any pair alone.
- If $\mu^{*}(X\cap Y\cap Z) > 0$, we speak of **information redundancy**: the information that the third variable adds about the other two is already (partly) contained in the pairs.


This can be seen more clearly when one variable is interpreted as a “target” and the other two as “features”. For a target $Y$ and two features $X_1,X_2$ we have the decomposition
$$
\begin{align*}
H(Y|X_1,X_2) &= \mu^*\big(Y - (X_1\cup X_2)\big),\\
&= \mu^*\big(Y)-\mu^*\big(Y\cap (X_1\cup X_2)),\\
&= \mu^*\big(Y)-\mu^*\big((Y\cap X_1)\cup(Y\cap X_2)),\\
&= \mu^*\big(Y)-(\mu^*(Y\cap X_1) + \mu^*(Y\cap X_2) - \mu^*(Y\cap X_1\cap X_2)),\\
&= \mu^*\big(Y)-\mu^*(Y\cap X_1) - \mu^*(Y\cap X_2) + \mu^*(Y\cap X_1\cap X_2),\\
\end{align*}
$$

- If $\mu^*(Y\cap X_1\cap X_2) < 0$, then
  $$
  \mu^*\big(Y\cap (X_1\cup X_2)\big)
  > \mu^*(Y\cap X_1) + \mu^*(Y\cap X_2),
  $$
  so the pair $(X_1,X_2)$ together carries **more** disjoint-entropy with $Y$ than the sum of their individual contributions: this is **synergy**. In this case, $H(Y|X_1,X_2)$ is even smaller (more predictable) than when $X_1$ and $X_2$ are independent.

- If $H(Y\cap X_1\cap X_2) > 0$, then the combined information of $(X_1,X_2)$ about $Y$ is **less** than the sum $\mu^*(Y\cap X_1)+\mu^*(Y\cap X_2)$, indicating **redundancy** between the features. In this case, $H(Y|X_1,X_2)$ is biger (less predictable) than when $X_1$ and $X_2$ independent.

In [None]:
X1=X.copy()
X2=Y.copy()
Y=Z.copy()

X1X2 = np.vstack([X1, X2])     
H_cond(Y, X1X2, base=2)  # equivalent to H(Z|X,Y)

0.0

If they were mutually indeoendent.
$$
\begin{align*}
\mu^*(Y|X_1,X_2) &= \mu^*\big(Y)
\end{align*}
$$

In [None]:
H_joint(Y)

1.0