# Introduction

When you create a model in table data, you often want to know correlations between the explanatory variables and the objective variables.

**Mutual Information(MI)** is one of the methods to know correlations even it is not linear.

MI of two discrete random variables X and Y is defined like below.

$$
I(X;Y)=\sum _{{y\in Y}}\sum _{{x\in X}}p(x,y)\log {\frac  {p(x,y)}{p(x)\,p(y)}}
$$


- Easy to understand MI  
[kaggle course about MI](https://www.kaggle.com/ryanholbrook/mutual-information)

- For further information  
[wikipedia](https://en.wikipedia.org/wiki/Mutual_information)


This competition itself, I referred to  
[[TPS-May] Categorical EDA](https://www.kaggle.com/subinium/tps-may-categorical-eda)  
This notebook is very instructive for EDA(Explanatory Data Analysis) even if you are kaggle beginner(so am I!)

In [None]:
import numpy as np
import pandas as pd 
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-may-2021/train.csv")
train.head()

In [None]:
train = train.drop('id', axis=1)

In [None]:
train.info()

In [None]:
fig, ax = plt.subplots()
sns.countplot(x='target', data=train, order=sorted(train['target'].unique()), ax=ax)
ax.set_ylim(0, 63000)
ax.set_title('Target Distribution', weight='bold')
plt.show()

In [None]:
train.describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                            .background_gradient(subset=['50%'], cmap='coolwarm')

In [None]:
X = train.copy()

# Label encoding for categoricals
# Now, only "target" is categorical
X["target"], _ = X["target"].factorize()
X.head()

In [None]:
y = X.pop("target")

In [None]:
# from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import mutual_info_classif

def make_mi_scores(X, y):
#     mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = mutual_info_classif(X, y)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y)
mi_scores[::3]  # show a few features with their MI scores

In [None]:
def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")


# plt.figure(dpi=100, figsize=(8, 5))
plt.figure(dpi=100, figsize=(20, 15))
plot_mi_scores(mi_scores)

The high-scoring **feature_14** exhibits a strong relationship with **target**.

I hope this mutual information result would be somewhat helpful for your prediction.

Thanks😊