This notebook aims to provide a gentle introduction of Shannon's entropy and mutual information.

**Author: Peishi Jiang**

# What is information theory?
Information theory is a mathematical method to quantify the nonlinear dependencies in a multivariate system. It originates from electronic engineering to study the storage, quantification, and communication of information. Claude Shannon pioneered this field and proposed the concept of Shannon's entropy in the 1940s. Since then, this mathematical branch has gone through great development and applications in various field, including but not limited to, electronic engineering, mathematics, computer science, phyics, neurobiology, and earth science/hydrology.

**Reference**: 
- Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
- Cover, T., & Thomas, J. (2006). Elements of information theory (2nd., Vol. 2). New York: Wiley.

# Entropy
We use Shannon's entropy here, denoted of $H$. Given a variable $X$, $H$ quantifies the overall uncertainty (or variability) of $X$ as:
\begin{align}
 H(X) &= - \sum_{x \in X} p(x) \log(p(x)) \tag{1},
\end{align}
where $p(x)$ is the discretized probability of the realization of $X=x$.

For two variables $X$ and $Y$, their overall uncertainty can be calculated through the joint entropy as below:
\begin{align}
 H(X,Y) &= - \sum_{x \in X} \sum_{y \in Y} p(x,y) \log(p(x,y)) \tag{2}.
\end{align}

## Entropy's properties
Shannon's entropy has the following nice properties (**Prove them!**):
1. Nonnegativity: $H \geq 0$. In other words, the entropy is always nonnegative.
2. Chain rule: $H(X,Y) = H(X) + H(Y|X)$, where $H(Y|X)=-\sum_{x \in X} \sum_{y \in Y} p(x,y) \log(p(y|x))$ is the conditional entropy. This property states that the joint uncertainty of $X$ and $Y$ is the sum of $X$'s uncertainty ($H(X)$) and the remaining uncertainty of $Y$ given the knowledge of $X$ ($H(Y|X)$).
3. $H(X|Y) \leq H(X)$.

## Maximum and minimum entropy (univariate case)
Now, let's move onto the calculation of the entropy $H$ and see under what conditions it reaches maximum/minimum.

OK, say a variable $X$ has two states $X=a$ and $X=b$, with probabilities $P(X=a)$ and $P(X=b)$, respectively, such that $P(X=a) + P(X=b) = 1$. Instead of mannually calculating it (which you can!), we are going to use Python programming to compute it.

Let's first define a function to calculate the entropy --

In [24]:
import numpy as np
def computeEntropy(pdf, base=np.e):
    """
    Compute the entropy based on pdf
    pdf: a numpy array of probability such that sum(pdf)=1
    base: the logarithm base
    """
    log_pdf = np.ma.filled(np.log(np.ma.masked_equal(pdf, 0)), 0)
    return -np.sum(pdf*log_pdf / np.log(base))


Scenario 1: X is determinant such that $P(X=a)=1$ and $P(X=b)=0$. What is $H(X)$?

In [22]:
P1 = np.array([1, 0])
H1 = computeEntropy(P1)
print(f"H1: {H1} nats")

H1: -0.0 nats


  log_pdf = np.ma.filled(np.log(np.ma.masked_equal(pdf, 0)), 0)


$H(X) = 0$ ! This means there is no variability or uncertainty associated with $X$, which totally makes sense since $X$ is deterministic. Let's consider two additional scenarios:

Scenario 2: $P(X=a)=0.1$ and $P(X=b)=0.9$

Scenario 3: $P(X=a)=0.5$ and $P(X=b)=0.5$

In [23]:
P2, P3 = np.array([0.1, 0.9]), np.array([0.5, 0.5])
H2 = computeEntropy(P2)
H3 = computeEntropy(P3)
print(f"H2: {H2} nats")
print(f"H3: {H3} nats")

H2: 0.3250829733914482 nats
H3: 0.6931471805599453 nats


Why does $H2 < H3$? 

Intuitively, this is because scenario 2 is more deterministic than scenario 3 since there is a much higher chance that $X=b$. On the other hand, scenario 3 gives equal probability to $X=a$ and $X=b$.

In fact, one can prove that the entropy is maximized when the distribution follows a uniform distribution! Note that scenario 3 is exactly a discretized uniform distribution with only two realizations. In a generalized discrete uniform distribution with $N$ realizations, $H = \log(\frac{1}{N})$.

## What's the units of entropy?
It depends on what base is used in the log! If it is a natural base (as what has been used in the previous cases), the units is "nats". If the base adopts 2, the units is "bits"!

Now, let's revisit the three scenarios.


In [25]:
base = 2.
H1 = computeEntropy(P1, 2)
H2 = computeEntropy(P2, 2)
H3 = computeEntropy(P3, 2)
print(f"H1: {H1} bits")
print(f"H2: {H2} bits")
print(f"H3: {H3} bits")

H1: -0.0 bits
H2: 0.4689955935892812 bits
H3: 1.0 bits


  log_pdf = np.ma.filled(np.log(np.ma.masked_equal(pdf, 0)), 0)


Now, the results are different. Sometimes, bits are more intuitive for binary cases. Here, $H_3=1$ since $\log_2(1/2) = 1$.

## Differential entropy

We have covered Shannon's entropy applied to the discretized probability space. What about continuous probability distribution? Well, this is where differential entropy is applied (denoted as lower case $h$):
\begin{align}
 h(X) &= - \int_{x \in X} f(x) \log(f(x)) \mathrm{d}x\tag{3},
\end{align}
where $f$ is the probability density function of $X$.

Differential entropy is very useful when the functional form of $f$ is given and the analytical solutions of $h$ can be derived. Take the uniform distribution that are defined over $[a, b]$ as an example. The corresponding $h$ can be derived as:
\begin{align}
 h(X) = \log(b-a)\tag{4}.
\end{align}

For Gaussian distribution $f(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{\frac{-x^2}{2\sigma^2}}$, the corresponding $h$ is given as:
\begin{align}
 h(X) = \frac{1}{2}\log(2\pi e \sigma^2)\tag{5}.
\end{align}

**Note that differential entropy can be negative!**

# Mutual Information

In [None]:
#

In [None]:
# Analytical solution for normal distribution

In [None]:
# Homework:

# Flux tower data

In [None]:
# Binning methods

In [None]:
# Exploratory analysis


In [None]:
# Homework: lagged mutual information


# Homework

## Entropy

**Ex1**: Given a variable $X2$ that has four states $a,b,c,$ and $d$, what is its entropy if their probabilities are $P(X=a)=0$, $P(X=b)=0$, $P(X=c)=0$, and $P(X=d)=1$? Please use the natural number as the base.

**Ex2**: In **Ex1**, if their probability follows discretized uniform distribution, what is its entropy? Does it equal to $H_3$ in scenario 3? If not, why?

**Ex3**: Please prove the chain rule of the entropy $H(X,Y) = H(X) + H(Y|X)$.

**Ex4**: Please prove Eq.(5) that $h(X) = \frac{1}{2}\log(2\pi e \sigma^2)$ when $f(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{\frac{-x^2}{2\sigma^2}}$.

**Ex5**: TBD, related to the data used combined with the binning methods...

## Mutual information

# Further reading
- Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
- Cover, T., & Thomas, J. (2006). Chapter 2: Entropy, Relative Entropy, and Mutual Information. Elements of information theory (2nd., Vol. 2). New York: Wiley.
- Cover, T., & Thomas, J. (2006). Chapter 8: Differential Entropy. Elements of information theory (2nd., Vol. 2). New York: Wiley.
- Gong, W., Yang, D., Gupta, H. V., & Nearing, G. (2014). Estimating information entropy for hydrological data: One‐dimensional case. Water Resources Research, 50(6), 5003-5018.