### Mutual information technique

Mutual information is a function metric(construct a ranking with a feature utility metric) measure relationship between a feature and the target. Then we can pick up smaller set of most useful features to develop our model, helping us more confidence that our time will spent on.

Mutual information look alike with correlation in that it measures a relationship between two variables, however Mutual information technique detects any kind of relationship, while correlation only detects linear relationships.

#### --------------------
Mutual information is a great general-purpose metric and especially useful at the start of feature development when we might not know what model we'd like to use yet
#### --------------------

mutual information is easy to use and interpret, computationally efficient, theoretically well-founded, resistant to overfitting, and able to detect any kind of relationship

### What mutual information measures?

* Mutual Information(MI) describes relationships in term of uncertainty.
* MI can help us to understand the relative potential of a feature as a predictor of the target
* MI can not detect interactions between features. It is univariate metric. It is possible for a feature to be very informative when interacting with other features, but not so informative all alone.
* The actual usefulness of a feature depends on the model we use it with. A feature is only useful to the exntent that its relationship with the target is one our model can learn. Just because a feature has a high MI score does not mean our model will be able to do anything with that information. We may need to transform the feature first to expose the relationship.

### Mutual Information on 1985 Automobiles, car's price prediction

In [6]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

plt.style.use("seaborn-v0_8-whitegrid")

df = pd.read_csv('./data/Automobile_data.csv')
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


### NOTE
The scikit-learn algorithm for MI treats discrete features differently from continuous features, we need to tell it which are which. Any feature that must has a float dtype is not discrete. Categoricals (object or categorical dtype) can be treated as discrete by giving them a label encoding)

In [41]:
df.nunique()

symboling              6
normalized-losses     52
make                  22
fuel-type              2
aspiration             2
num-of-doors           3
body-style             5
drive-wheels           3
engine-location        2
wheel-base            53
length                75
width                 44
height                49
curb-weight          171
engine-type            7
num-of-cylinders       7
engine-size           44
fuel-system            8
bore                  39
stroke                37
compression-ratio     32
horsepower            60
peak-rpm              24
city-mpg              29
highway-mpg           30
price                187
dtype: int64

In [44]:
df.drop(columns='price').select_dtypes('object')

Unnamed: 0,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system,bore,stroke,horsepower,peak-rpm
0,?,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,3.47,2.68,111,5000
1,?,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi,3.47,2.68,111,5000
2,?,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi,2.68,3.47,154,5000
3,164,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi,3.19,3.4,102,5500
4,164,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi,3.19,3.4,115,5500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,95,volvo,gas,std,four,sedan,rwd,front,ohc,four,mpfi,3.78,3.15,114,5400
201,95,volvo,gas,turbo,four,sedan,rwd,front,ohc,four,mpfi,3.78,3.15,160,5300
202,95,volvo,gas,std,four,sedan,rwd,front,ohcv,six,mpfi,3.58,2.87,134,5500
203,95,volvo,diesel,turbo,four,sedan,rwd,front,ohc,six,idi,3.01,3.4,106,4800


In [42]:
df.drop(columns='price').select_dtypes('int')

Unnamed: 0,symboling,curb-weight,engine-size,city-mpg,highway-mpg
0,3,2548,130,21,27
1,3,2548,130,21,27
2,1,2823,152,19,26
3,2,2337,109,24,30
4,2,2824,136,18,22
...,...,...,...,...,...
200,-1,2952,141,23,28
201,-1,3049,141,19,25
202,-1,3012,173,18,23
203,-1,3217,145,26,27


In [43]:
df.drop(columns='price').select_dtypes('float')

Unnamed: 0,wheel-base,length,width,height,compression-ratio
0,88.6,168.8,64.1,48.8,9.0
1,88.6,168.8,64.1,48.8,9.0
2,94.5,171.2,65.5,52.4,9.0
3,99.8,176.6,66.2,54.3,10.0
4,99.4,176.6,66.4,54.3,8.0
...,...,...,...,...,...
200,109.1,188.8,68.9,55.5,9.5
201,109.1,188.8,68.8,55.5,8.7
202,109.1,188.8,68.9,55.5,8.8
203,109.1,188.8,68.9,55.5,23.0


In [20]:
df.drop(columns='price').select_dtypes('object').nunique()

normalized-losses    52
make                 22
fuel-type             2
aspiration            2
num-of-doors          3
body-style            5
drive-wheels          3
engine-location       2
engine-type           7
num-of-cylinders      7
fuel-system           8
bore                 39
stroke               37
horsepower           60
peak-rpm             24
dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

In [8]:
X = df.copy()
y = X.pop("price")

#Label encoding for categoricals
for colname in X.select_dtypes('object'):
    X[colname], _ = X[colname].factorize()
    
# All discrete features should now have integer dtypes (double-check before using MI)
discrete_features = X.dtypes == int