## Mutual Information

Recall the notion of **dependence** and **independence** of events.

Events $A$ and $B$ are independent if $P(A|B) = P(A)$.

Equivalently, $A$ and $B$ are independent if $P(A\text{ and }B) = P(A)\cdot P(B)$.

For two events, we can calculate their **pointwise mutual information (PMI)** as $$\text{PMI}(A,B) = \ln\left(\frac{P(A\text{ and }B)}{P(A)\cdot P(B)}\right)$$

Note that if $A$ and $B$ are independent, then PMI$(A,B) = \ln(1) = 0$.

In [27]:
import pandas as pd
import numpy as np

Let's look at an example using the squirrels data.

In [14]:
squirrels = (
    pd.read_csv('../data/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv')
    .dropna(subset = ['Primary Fur Color', 'Runs from', 'Climbing'])
    .reset_index(drop = True)
)

In [16]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from'], normalize = 'index')

Runs from,False,True
Primary Fur Color,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,0.68932,0.31068
Cinnamon,0.778061,0.221939
Gray,0.777194,0.222806


Let's consider the events A: Primary Fur Color = Black and B: Runs from = True.

In [20]:
# P(A)
prob_a = squirrels['Primary Fur Color'].value_counts(normalize = True)['Black']
prob_a

0.034703504043126686

In [22]:
# P(B)
prob_b = squirrels['Runs from'].value_counts(normalize = True)[True]
prob_b

0.22574123989218328

In [23]:
prob_a * prob_b

0.007834012031298814

In [24]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from'], normalize = True)

Runs from,False,True
Primary Fur Color,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,0.023922,0.010782
Cinnamon,0.102763,0.029313
Gray,0.647574,0.185647


In [26]:
prob_a_and_b = pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from'], normalize = True).loc['Black', True]
prob_a_and_b

0.01078167115902965

In [29]:
pmi = np.log(prob_a_and_b / (prob_a * prob_b))
pmi

0.31937280647235017

pmi > 0

Interpretation: We more black squirrels that ran away than would be expected if primary fur color and running away were independent.

For discrete variables $X$ and $Y$, the **mutual information** between $X$ and $Y$ is given by

$$\text{MI}(X, Y) = \sum_{x \in X}\sum_{y \in Y} P(X = x\text{ and }Y = y)\cdot \ln\left(\frac{P(X = x\text{ and }Y = y)}{P(X = x)\cdot P(Y = y)}\right)$$

$$ = \sum_{x \in X}\sum_{y \in Y} P(X = x\text{ and }Y = y)\cdot \text{PMI}(x, y)$$

(You can also define mutual information for continuous variables, but it requires using an integral.)

If $X$ and $Y$ are independent, then MI$(X,Y)$ = 0.

It is also true (but harder to prove) that if MI$(X,Y) = 0$, then $X$ and $Y$ are independent. (See [these notes](https://mathweb.ucsd.edu/~lrothsch/information.pdf) for a proof).

In [30]:
from sklearn.metrics import mutual_info_score

In [31]:
mutual_info_score(labels_true = None, labels_pred = None, 
                  contingency = pd.crosstab(squirrels['Primary Fur Color'], squirrels['Runs from']))

0.000689736693261167

On its own, it is hard to judge the strength of the mutual information score, but it can be used to compare different variables.

In [38]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Climbing'], normalize = 'index')

Climbing,False,True
Primary Fur Color,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,0.757282,0.242718
Cinnamon,0.790816,0.209184
Gray,0.784472,0.215528


In [39]:
mutual_info_score(labels_true = None, labels_pred = None, 
                  contingency = pd.crosstab(squirrels['Primary Fur Color'], squirrels['Climbing']))

8.951238402454126e-05

In [44]:
mutual_info_score(labels_true = None, labels_pred = None, 
                  contingency = pd.crosstab(squirrels['Primary Fur Color'], squirrels['Approaches']))

0.003212864957886726

In [43]:
pd.crosstab(squirrels['Primary Fur Color'], squirrels['Approaches'], normalize = 'index')

Approaches,False,True
Primary Fur Color,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,0.941748,0.058252
Cinnamon,0.887755,0.112245
Gray,0.94905,0.05095
