# Mutual information

Mutual information is an intutive way of telling whether two variables are related.
How much does the sex of a Titanic passenger tell us about their likelihood of survival?
(And if we know a passenger survived, does that tell us anything about their sex?)

In many cases, you won't need to calculate mutual information by hand, but it's good to know how it works.
This notebook gives some examples of how to calculate mutual information, particularly when you need to take a weighted average.

As usual, we first load the necessary libraries.

In [1]:
import pandas as pd
from sklearn import metrics

## 1. Mutual information with equal groups of `Survived` values

To demonstrate mutual information, this notebook uses artificial data related to the Titanic passengers.
We focus on two variables—sex and age—because we know that women and younger passengers were more likely to survive.

In [2]:
data = [
    ( 'Male', '18+', True ), ( 'Male', '18+', False ), ( 'Male', '18-', False ), ( 'Male', '18-', False ),
    ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ),
    ( 'Male', '18+', False ), ( 'Male', '18-', False ),
    ( 'Female', '18-', False ), ( 'Female', '18+', True ), ( 'Female', '18+', True ), ( 'Female', '18+', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True )
]
df = pd.DataFrame(data, columns=[ 'Sex', 'Age', 'Survived' ])
df

Unnamed: 0,Sex,Age,Survived
0,Male,18+,True
1,Male,18+,False
2,Male,18-,False
3,Male,18-,False
4,Male,18+,False
5,Male,18+,False
6,Male,18+,False
7,Male,18+,False
8,Male,18+,False
9,Male,18-,False


The above tells us two important pieces of information:

1. The number of passengers who survived and did not survive is the same: 10 passengers survived and 10 others did not
2. 90% of female passengers survived and 90% of male passengers did not survive.

Remember Mutual Information can be described as the entropy of the `Survived` column (the label) minus how much entropy we remove if we use the `Sex` column (or another feature):

$I(Survived; Sex) = H(Survived) - H(Survived|Sex)$

First up, let's calculate the mutual information between sex and whether a passenger survived.
Therefore we isolate these two columns.
Then, we group and aggregate passengers:

- Group passengers based on their `Sex`
- Aggregate passengers, describing how many survived within each group.

In [3]:
_df = df[[ 'Sex', 'Survived' ]] # isolate the `Sex` and `Survived` columns
_df.groupby([ 'Sex', 'Survived' ]).size() # get the counts of the passengers who survived, split by `Sex`

Sex     Survived
Female  False       1
        True        9
Male    False       9
        True        1
dtype: int64

### Calculating entropy of the `Survived` feature

There are two components in the above equation.
First up: $H(Survived)$, or the entropy of the `Survived` label.
Remember the equation of entropy:

$H(Survived) = - p(survived) log(p(survived)) - p(!survived) log(p(!survived))$

We know that $p(survived)$ is $0.5$ and $p(!survived)$ is $0.5$ from the above DataFrame, which gives us:

$H(Survived) = - 0.5 log(0.5) - 0.5 log(0.5)$

> Note: $p(!survived)$ means the probability that a passenger did *not* survive.

### Calculating entropy when using the `Sex` feature

Don't let the conditional probability confuse you.
$H(Survived|Sex)$ is just telling us: what is the probability of a passenger surviving when `Sex` is `Male` (or `Female`)?

Both groups are split similarly: 90% of female passengers survived and 90% of male passengers did not survive.
Therefore piping this information, again, into the entropy formula gives us:

$H(Survived|Male) = - p(Survived|Male) log(p(Survived|Male)) - p(!Survived|Male) log(p(!Survived|Male))$

$H(Survived|Male) = - 0.1 log(0.1) - 0.9 log(0.9)$

Since both groups are split similarly, we could also have used $H(Survived|Female)$:

$H(Survived|Female) = - p(Survived|Female) log(p(Survived|Female)) - p(!Survived|Female) log(p(!Survived|Female))$

$H(Survived|Female) = - 0.9 log(0.9) - 0.1 log(0.1)$

### Calculating mutual information

What's next?
Calculating mutual information by combining $H(Survived)$ and $H(Survived|Sex)$:

$I(Survived; Sex) = H(Survived) - H(Survived|Sex)$

$I(Survived; Sex) = - 0.5 log(0.5) - 0.5 log(0.5) - H(Survived|Sex)$

$I(Survived; Sex) = (- 0.5 log(0.5) - 0.5 log(0.5)) - (- 0.1 log(0.1) -0.9 log(0.9))$

To confirm our calculation, we can use `sklearn.metrics` module's `mutual_info_score` to calculate mutual information automatically.
This function uses a logarithm with base `e` ([Euler's number](https://en.wikipedia.org/wiki/E_(mathematical_constant))); this type of logarithm is normally referred-to as `ln`.
Therefore instead of $log$ in the previous cell, we just use $ln$:

$I(Survived; Sex) = (0.5 ln(0.5) - 0.5 ln(0.5)) - (- 0.1 ln(0.1) - 0.9 ln(0.9))$

In [4]:
# equivalent to (- 0.5 ln(0.5) - 0.5 ln(0.5)) - (- 0.1 ln(0.1) - 0.9 ln(0.9))
metrics.mutual_info_score(df.Survived, df.Sex)

0.36806420716849675

## Mutual information of the `Age`

Let's do the same with the `Age` feature.
This time, again, there are 10 passengers who are 18+ and 10 passengers who are 18-.
And again, they are similarly-composed: 60% of passengers who are 18+ did not survive, and 60% of passengers who are 18- survived.

In [5]:
_df = df[[ 'Age', 'Survived' ]] # isolate the `Age` and `Survived` columns
_df.groupby([ 'Age', 'Survived' ]).size() # get the counts of the passengers who survived, split by `Age`

Age  Survived
18+  False       6
     True        4
18-  False       4
     True        6
dtype: int64

### Calculating entropy when using the `Age` feature

The equation of Mutual Information is given below:

$I(Survived; Sex) = H(Survived) - H(Survived|Age)$

We have already calculated $H(Survived)$, so we don't have to calculate it again.
We only need to calculate $H(Survived|Age)$.

Since 60% of 18+ passengers survived and 40% of 18+ passengers did not survive, we can substitute these values just like before.

$H(Survived|18+) = - p(Survived|18+) log(p(Survived|18+)) - p(!Survived|18+) log(p(!Survived|18+))$

$H(Survived|18+) = - 0.4 log(0.4) - 0.6 log(0.6)$

We could also have used $H(Survived|18-)$ because they also hae a 60/40 split:

$H(Survived|18-) = - p(Survived|18-) log(p(Survived|18-)) - p(!Survived|18-) log(p(!Survived|18-))$

$H(Survived|18-) = - 0.6 log(0.6) - 0.4 log(0.4)$

### Calculating mutual information

The last step, like before, is combining $H(Survived)$ and $H(Survived|Age)$.
We use the $H(Survived)$ value that we calculated previously:

$I(Survived; Age) = H(Survived) - H(Survived|Age)$

$I(Survived; Age) = - 0.5 log(0.5) - 0.5 log(0.5) - H(Survived|Age)$

$I(Survived; Age) = (- 0.5 log(0.5) - 0.5 log(0.5)) - (0.6 log(0.6) - 0.4 log(0.4))$

scikit-learn's `mutual_info_score` uses the same calculation, as you can see in the next code cell.

In [6]:
# equivalent to (-0.5 ln(0.5) -0.5 ln(0.5)) - (-0.6 ln(0.6) -0.4 ln(0.4))
metrics.mutual_info_score(df.Age, df.Survived)

0.02013551355068821

## 2. Mutual information when the groups do not have the same proportions of survival

The next example is slightly different than the one before it.
It still has two predictor variables, `Sex` and `Age`, but the groups have different proportions.
For example, before 90% of `Male` passengers did not survive and 10% did, and 90% of `Female` passengers survived and 10% did not: similar 90-10 splits.
This time, the splits are different.

First, we create the data.

In [7]:
data = [
    ( 'Male', '18+', True ), ( 'Male', '18+', True ), ( 'Male', '18-', True ), ( 'Male', '18-', False ),
    ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ),
    ( 'Male', '18+', False ), ( 'Male', '18-', False ),
    ( 'Female', '18-', False ), ( 'Female', '18+', True ), ( 'Female', '18+', True ), ( 'Female', '18+', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True )
]
df = pd.DataFrame(data, columns=[ 'Sex', 'Age', 'Survived' ])
df

Unnamed: 0,Sex,Age,Survived
0,Male,18+,True
1,Male,18+,True
2,Male,18-,True
3,Male,18-,False
4,Male,18+,False
5,Male,18+,False
6,Male,18+,False
7,Male,18+,False
8,Male,18+,False
9,Male,18-,False


As you can see below, 90% of `Female` passengers survived and 10% did not, and 30% of `Male` survived and 70% did not.
However, each group still has 10 passengers apiece.

In [8]:
_df = df[[ 'Sex', 'Survived' ]] # isolate the `Sex` and `Survived` columns
_df.groupby([ 'Sex', 'Survived' ]).size() # get the counts of the passengers who survived, split by `Sex`

Sex     Survived
Female  False       1
        True        9
Male    False       7
        True        3
dtype: int64

The only thing that changes is that now we have to calculate entropy and mutual information for the `Male` passengers and the `Female` passengers separately.

## Calculating entropy when using the `Sex` variable

The entropy calculation is very similar to before.
This time, we'll just have different values for the entropy with the `Male` and `Female` values.
The value of the entropy for the `Male` passengers, $H(Survived|Male)$, is given below:

$H(Survived|Male) = - p(Survived|Male) log(p(Survived|Male)) - p(!Survived|Male) log(p(!Survived|Male))$

$H(Survived|Male) = - 0.3 log(0.3) - 0.7 log(0.7)$

The value of the entropy for the `Female` passengers, $H(Survived|Female)$, is given below:

$H(Survived|Female) = - p(Survived|Female) log(p(Survived|Female)) - p(!Survived|Female) log(p(!Survived|Female))$

$H(Survived|Female) = - 0.9 log(0.9) - 0.1 log(0.1)$

### Calculating entropy of the `Survived` feature

Before calculating the mutual information, we need to re-calculate the entropy of the `Survived` label.
Now, we have 60% of all passengers who survived and 40% who did not, which gives us:

$H(Survived) = - p(survived) log(p(survived)) - p(!survived) log(p(!survived))$

$H(Survived) = - 0.6 log(0.6) - 0.4 log(0.4)$

### Calculating mutual information

Last stop: calculating the mutual information by combining $H(Survived)$ and $H(Survived|Sex)$.
This time, we need to calculate the mutual information for male passengers and female passengers separately because $H(Survived|Male)$ is not the same as $H(Survived|Female)$.

$I(Survived; Male) = H(Survived) - H(Survived|Male)$

$I(Survived; Male) = - 0.6 log(0.6) - 0.4 log(0.4) - H(Survived|Male)$

$I(Survived; Male) = - 0.6 log(0.6) - 0.4 log(0.4) - (-0.3 log(0.3) - 0.7 log(0.7))$

---

$I(Survived; Female) = H(Survived) - H(Survived|Female)$

$I(Survived; Female) = - 0.6 log(0.6) - 0.4 log(0.4) - H(Survived|Female)$

$I(Survived; Female) = - 0.6 log(0.6) - 0.4 log(0.4) - (-0.9 log(0.9) - 0.1 log(0.1))$

---

Finally, we take the average of $I(Survived; Male)$ and $I(Survived; Female)$ to get the mutual information of the `Sex` variable.

In [9]:
# Male: (- 0.6 ln(0.6) - 0.4 ln(0.4)) - (- 0.3 ln(0.3) - 0.7 ln(0.7))
# Female: (- 0.6 ln(0.6) - 0.4 ln(0.4)) - (- 0.9 ln(0.9) - 0.1 ln(0.1))
# The next line is equivalent to the average of the above because male and female passengers are split equally
metrics.mutual_info_score(df.Survived, df.Sex)

0.20503802928608575

You can calculate the mutual information for the `Age` similarly.

## 3. Mutual information when the groups have different sizes

The next example is a bit different from before.
This time, we have 9 `Female` passengers and 11 `Male` passengers instead of two groups of 10 passengers.

In [10]:
# df = pd.DataFrame(names=[ 'Sex', 'Class', 'Survived' ])
data = [
    ( 'Male', '18+', True ), ( 'Male', '18+', True ), ( 'Male', '18-', True ), ( 'Male', '18-', False ),
    ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ), ( 'Male', '18+', False ),
    ( 'Male', '18+', False ), ( 'Male', '18-', False ),
    ( 'Female', '18-', False ), ( 'Male', '18+', True ), ( 'Female', '18+', True ), ( 'Female', '18+', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ), ( 'Female', '18-', True ),
    ( 'Female', '18-', True ), ( 'Female', '18-', True )
]
df = pd.DataFrame(data, columns=[ 'Sex', 'Age', 'Survived' ])
df

Unnamed: 0,Sex,Age,Survived
0,Male,18+,True
1,Male,18+,True
2,Male,18-,True
3,Male,18-,False
4,Male,18+,False
5,Male,18+,False
6,Male,18+,False
7,Male,18+,False
8,Male,18+,False
9,Male,18-,False


Therefore now, there are two differences from the first example:

1. 11 passengers survived and 9 did not, so they are not equal groups (like 10 survived and 10 did not)
2. Each group has different sizes: there are 9 women and 11 men.

Still, the process is very similar to before.

In [11]:
_df = df[[ 'Sex', 'Survived' ]] # isolate the `Sex` and `Survived` columns
_df.groupby([ 'Sex', 'Survived' ]).size() # get the counts of the passengers who survived, split by `Sex`

Sex     Survived
Female  False       1
        True        8
Male    False       7
        True        4
dtype: int64

## Calculating entropy when using the `Sex` variable

The entropy calculation is very similar to before.
In fact, we can calculate entropy for men and women just like before.

$H(Survived|Male) = - p(Survived|Male) log(p(Survived|Male)) - p(!Survived|Male) log(p(!Survived|Male))$

$H(Survived|Male) = - 4/11 log(4/11) - 7/11 log(7/11)$

The value of the entropy for the `Female` passengers, $H(Survived|Female)$, is given below:

$H(Survived|Female) = - p(Survived|Female) log(p(Survived|Female)) - p(!Survived|Female) log(p(!Survived|Female))$

$H(Survived|Female) = - 8/9 log(8/9) - 1/9 log(1/9)$

### Calculating entropy of the `Survived` feature

Calculating the entropy of the `Survived` feature is also identical to before.
60% of all passengers survived and 40% did not, which gives us:

$H(Survived) = - p(survived) log(p(survived)) - p(!survived) log(p(!survived))$

$H(Survived) = - 0.6 log(0.6) - 0.4 log(0.4)$

### Calculating mutual information

Last stop: calculating the mutual information by combining $H(Survived)$ and $H(Survived|Sex)$.
This time, we need to calculate the mutual information for male passengers and female passengers separately because $H(Survived|Male)$ is not the same as $H(Survived|Female)$.

$I(Survived; Male) = H(Survived) - H(Survived|Male)$

$I(Survived; Male) = - 0.6 log(0.6) - 0.4 log(0.4) - H(Survived|Male)$

$I(Survived; Male) = - 0.6 log(0.6) - 0.4 log(0.4) - (- 4/11 log(4/11) - 7/11 log(7/11))$

---

$I(Survived; Female) = H(Survived) - H(Survived|Female)$

$I(Survived; Female) = - 0.6 log(0.6) - 0.4 log(0.4) - H(Survived|Female)$

$I(Survived; Female) = - 0.6 log(0.6) - 0.4 log(0.4) - (- 8/9 log(8/9) - 1/9 log(1/9))$

The only difference from the previous example is here.
Since the groups of women and men are not equal, this time we take the **weighted average** instead of the average.

$I(Survived, Sex) = \frac{|Sex = Male|}{|Passengers|} I(Survived; Male) + \frac{|Sex = Female|}{|Passengers|} I(Survived; Female)$

In [12]:
# Male: (-0.6 ln(0.6) -0.4 ln(0.4)) - (-4/11 ln(4/11) -7/11 ln(7/11))
# Female: (-0.6 ln(0.6) -0.4 ln(0.4)) - (-8/9 ln(8/9) -1/9 ln(1/9))
# Weighted average: 0.0175298931078637 * 11/20 + 0.3241795711662245 * 9/20
metrics.mutual_info_score(df.Survived, df.Sex)

0.15552224823412555

You can work out the mutual information of the `Age` similarly.