## Mutual Information

First encountering a new dataset can sometimes feel overwhelming. One might be presented with hundreds or thousands of features without even a description to go by. Where to even begin?

A good first step might be to construct a **ranking** with a **feature utility metric** - a function that measures associations between a feature and a target. With that, one can choose a smaller set of the most useful features to develop initially and have more confidence that the time will be well spent.

This metric can be the [Mutual Information](https://medium.com/swlh/a-deep-conceptual-guide-to-mutual-information-a5021031fad0)

Advantages compared to Pearson Correlation:
- It can detect any kind of relationship, while correlation only detects linear relationships.
- Easy to use and interpret;
- Computationally efficient;
- theoretically well-founded;
- resistance to everfitting;
- able to detect any kind of relationship;

#### What it measures?

Describes realtionships in terms of uncertainty. The mutual information between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. **If you knew the value of a feature, how much more confident would you be about the target?**

Uncertainty is measured using **entropy**. The entropy of a variable simply means: 
- **How many yes-or-no questions you would need to describe an occurance of that variable, on average?**
    - The more questions you have to ask --> the more **uncertain** you must be about the variable (more entropy);

Mutual Information is:
- **How many questions you expect the feature to answer about the target?**


### Interpreting Mutual Information Scores

1. The least possible mutual information is 0.0
    - Meaning the quantities are independent;
2. There is no upper bound, but values above 2.0 are uncommon;
3. Mutual information is a logarithmic quantity, so it decreases very slowly;


![](images/2022-06-09-18-04-31.png)

When applying MI one must be aware of:
1. MI can help understand the relative potential of a feature as a predictor of the target, considered by itself;
2. It is possible for a feature to be very informative when interacting with other features, but not so informative all alone.
3. MI **can't detect interactions** between features - it is a **univariate metric**;
4. The actual usefulness of a feature depends on the model you use it with. A feature is only useful to the extent that its relationship with the target is one your model can learn. A high MI score does not mean your model will be able to do anything with that information, you may need to transform a feature to expose the association.

### Example

In [8]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("seaborn-whitegrid")

df = pd.read_csv("data/autos.csv")
df.head()

Unnamed: 0,symboling,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,alfa-romero,gas,std,2,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
1,3,alfa-romero,gas,std,2,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
2,1,alfa-romero,gas,std,2,hatchback,rwd,front,94.5,171.2,...,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
3,2,audi,gas,std,4,sedan,fwd,front,99.8,176.6,...,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
4,2,audi,gas,std,4,sedan,4wd,front,99.4,176.6,...,136,mpfi,3.19,3.4,8,115,5500,18,22,17450


In [None]:
""