# Mutual Information

## Overview:
**Mutual Information (MI)** is a concept from *Information Theory* that measures the amount of information that is shared between two random variables. <u>Mutual Information quantifies the degree to which knowledge of one variable reduces uncertainty about another variable</u>. In simpler terms, Mutual Information tells us how much knowing the value of one variable helps us predict the value of another variable.

<span style="font-size: 11pt; color: steelblue; font-weight: bold">Mutual Information is crucial in Machine Learning for feature selection, dimensionality reduction, and building effective models.</span> 
- <span style="font-size: 11pt; color: mediumseagreen; font-weight: bold">It helps in identifying the most informative features that contribute to predictive accuracy.</span>

***
Mutual Information was first introduced by <span style="font-size: 14pt; color: goldenrod; font-weight: bold">Claude Shannon</span>, the father of information theory, in his landmark paper "*A Mathematical Theory of Communication*" published in 1948. It has since become a fundamental concept in various fields including statistics, information theory, machine learning, and more.
***

## Things to note:
- MI can be used to detect dependencies or relationships between variables, but <span style="font-size: 11pt; color: orange; font-weight: bold">MI doesn't imply causation</span>. Correlation does not necessarily imply causation.
- <span style="font-size: 11pt; color: orange; font-weight: bold">MI is sensitive to the scales and ranges of variables</span>. Preprocessing and normalization may be necessary.
- MI assumes discrete variables, but it can be approximated for continuous variables using discretization techniques.
- <span style="font-size: 11pt; color: orange; font-weight: bold">MI might not capture more complex relationships</span> between variables that other measures like conditional mutual information or non-linear correlations could capture.

## Formulas:
The Mutual Information between two discrete random variables X and Y is typically denoted as MI(X; Y) and is calculated using the following formula:

$$\large MI(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log\left(\frac{p(x, y)}{p(x) \cdot p(y)}\right) $$

Where:
- $p(x, y)$ is the joint PMF of X and Y.
- $p(x)$ and $p(y)$ are the marginal PMFs of X and Y respectively.
- The logarithm is usually taken to the base 2 or the natural logarithm.

## Interpreting MI values:
Understanding the values of Mutual Information and their implications can provide insights into the relationships between variables. The range of MI values can vary based on the characteristics of the data and the relationship between the variables.

- **MI = 0**: This indicates that the variables are independent; knowing one provides no information about the other.
- **MI > 0**: Positive MI values indicate some level of dependence or association between the variables. The higher the MI value, the stronger the association.
- **MI < 0**: Negative MI values are less common and suggest that knowing one variable reduces uncertainty about the other more than expected if they were independent. However, such negative values might be due to noise or data artifacts.

For binary features MI values falls in the range $[0,1]$, where 0 corresponds to complete independence and 1 - for perfect dependence.

## Applications:
Mutual Information has various use cases across different fields, including, but not limited to:
- **Feature Selection**: MI can be used to select relevant features in machine learning by measuring the amount of information that a feature provides about the target variable.
- **Clustering**: MI can help in determining the similarity between clusters in unsupervised learning.
- **Natural Language Processing**: It's used to measure the relationship between words in text corpora.
- **Information Retrieval**: It's used to measure the similarity between documents and query terms in information retrieval systems.
- **Image Registration**: In medical imaging, MI can be used to align and register images from different modalities.
- **Neuroscience**: It's used to measure the relationship between neural activities and stimuli.
- **Bioinformatics**: MI is used in genetics to understand the relationships between different genetic markers.

# Worked Example

#### Manual computation

In [1]:
from sklearn.metrics import mutual_info_score
import pandas as pd
import numpy as np

In [2]:
# Where '0' stands for 'no' and '1' stands for 'yes'
index = ["James", "Michael", "David", "John", "Robert", "William", "Benjamin", "Christopher"]
cols = {'drinks_soda':   [0,1,0,0,1,0,1,1],
        'eats_fastfood': [1,0,1,1,0,0,1,0],
        'does_sports':   [0,0,1,0,0,1,0,0],
        'is_overweight': [1,0,0,1,0,0,1,0]}

df = pd.DataFrame(cols, index=index)
df

Unnamed: 0,drinks_soda,eats_fastfood,does_sports,is_overweight
James,0,1,0,1
Michael,1,0,0,0
David,0,1,1,0
John,0,1,0,1
Robert,1,0,0,0
William,0,0,1,0
Benjamin,1,1,0,1
Christopher,1,0,0,0


Now let us calculate the Mutual Information between features `'eats_fastfood'` and target variable `'is_overweight'`.   
In other words, let's find out how much knowing if the person eats fastfood or not helps us in predicting him being overweight.

First, lets, compute marginal probabilities of `'eats_fastfood'` and `'is_overweight'`:

In [3]:
# out of 8 people, 4 of them eat fastfood, so
marginal_pmf_eats_fastfood = 4/8 

# out of 8 people, 3 of them are overweight, so
marginal_pmf_is_overweight = 3/8

print(f'Marginal PMF for "eats_fastfood" = 1: {marginal_pmf_eats_fastfood}')
print(f'Marginal PMF for "is_overweight" = 1: {marginal_pmf_is_overweight}')

Marginal PMF for "eats_fastfood" = 1: 0.5
Marginal PMF for "is_overweight" = 1: 0.375


Second, we compute joint PMFs for `'eats_fastfood'` and `'is_overweight'`.  
In total we will have to compute 4 joint PMFs:
1. eats_fastfood = 'yes' and is_overweight = 'yes'
2. eats_fastfood = 'yes' and is_overweight = 'no'
3. eats_fastfood = 'no' and is_overweight = 'yes'
4. eats_fastfood = 'no' and is_overweight = 'no'

In [4]:
# eats_fastfood = 'yes' and is_overweight = 'yes'
pmf_yes_yes = 3/8
print(f'Joint probability: eats_fastfood = "yes" and is_overweight = "yes" = {pmf_yes_yes}')

# eats_fastfood = 'yes' and is_overweight = 'no'
pmf_yes_no = 1/8
print(f'Joint probability: eats_fastfood = "yes" and is_overweight = "no" = {pmf_yes_no}')

# eats_fastfood = 'no' and is_overweight = 'yes'
pmf_no_yes = 0/8
print(f'Joint probability: eats_fastfood = "no" and is_overweight = "yes" = {pmf_no_yes}')

# eats_fastfood = 'no' and is_overweight = 'no'
pmf_no_no = 3/8
print(f'Joint probability: eats_fastfood = "no" and is_overweight = "no" = {pmf_no_no}')

Joint probability: eats_fastfood = "yes" and is_overweight = "yes" = 0.375
Joint probability: eats_fastfood = "yes" and is_overweight = "no" = 0.125
Joint probability: eats_fastfood = "no" and is_overweight = "yes" = 0.0
Joint probability: eats_fastfood = "no" and is_overweight = "no" = 0.375


Now we plug in the numbers into Mutual Information formula:

$$\large MI(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log\left(\frac{p(x, y)}{p(x) \cdot p(y)}\right) $$

In [5]:
# Note that we are using logarithm of base 2
# Third term: 0.000 * np.log2(0.000 / (0.5 * 0.375)) is ignored since log2(0) is undefined
MI = 0.375 * np.log2(0.375 / (0.5 * 0.375)) + \
     0.125 * np.log2(0.125 / (0.5 * 0.625)) + \
     0.375 * np.log2(0.375 / (0.5 * 0.625))

print(f'Mutual Information coefficient between "eats_fastfood" and "is_overweight" = {MI}')

Mutual Information coefficient between "eats_fastfood" and "is_overweight" = 0.3083968903267524


So the Mutual Information coefficient between 2 examined variables is strongly positive, which suggests that knowing the value of `'eats_fastfood'` feature might provide significant insights into the target variable.  

Since for binary features MI takes values in the range $[0,1]$:  
`'eats_fastfood'` posesses moderate predictive power for determining if the person is overweight or not. Target variable `'is_overweight'` moderately depends on the feature '`eats_fastfood`'

#### Via Sci-Kit Learn's `mutual_info_score`

Identical result can be (**and must be**) achieved in just one line of code with the help of Sci-Kit learn library:

In [6]:
mi_score = mutual_info_score(df['eats_fastfood'], df['is_overweight'])
print("Mutual Information:", mi_score)

Mutual Information: 0.38039566584857787
