## Table of Contents
* [Import and Basic EDA](#import)
* [Correlation of numerical features](#corr_num)
* ["Correlation" of all features](#corr_all)
* [Impact of age feature](#age)
* [Target vs Features](#target)

In [None]:
# packages

# standard
import numpy as np
import pandas as pd

# plot
import matplotlib.pyplot as plt
import seaborn as sns

# statistics
import phik

In [None]:
# show files
!ls -l '../input/playground-series-s3e2/'

<a id='import'></a>
# Import and Basic EDA

In [None]:
# import and overview
df_train = pd.read_csv('../input/playground-series-s3e2/train.csv')
df_test = pd.read_csv('../input/playground-series-s3e2/test.csv')
df_train.info()

In [None]:
# test set
df_test.info()

In [None]:
# preview
df_train.head()

In [None]:
# feature definition
features_num = ['age', 'avg_glucose_level', 'bmi']

features_cat = ['gender', 'hypertension', 'heart_disease', 
                'ever_married', 'work_type', 'Residence_type',
                'smoking_status']

features = features_num + features_cat

target = 'stroke'

In [None]:
# basic stats - train
df_train[features_num].describe().T

In [None]:
# basic stats - test
df_test[features_num].describe().T

In [None]:
# show distribution of categorical features
for f in features_cat:
    plt.figure(figsize=(12,3))
    ax1 = plt.subplot(1,2,1)
    foo = df_train[f].value_counts()
    ax1.bar(height=foo, x=foo.index, color='darkblue')
    plt.title(f + ' [Train]')
    plt.grid()
    ax2 = plt.subplot(1,2,2, sharex=ax1)
    foo = df_test[f].value_counts()
    ax2.bar(height=foo, x=foo.index, color='darkgreen')
    plt.title(f + ' [Test]')
    plt.grid()
    plt.show()

#### Distributions of numerical features will be shown in the following pairplots.

<a id='corr_num'></a>
# Correlation of numerical features

In [None]:
# pairplot of numerical features - train
g = sns.pairplot(data=df_train[features_num], 
                 plot_kws = {'alpha': 0.2, 's' : 15})
g.fig.suptitle('Pairplot - train', y=1.02)
plt.show()

In [None]:
# pairplot of numerical features - train
# including visualization of target via color
g = sns.pairplot(data=df_train[features_num+[target]],
                 hue = target,
                 plot_kws = {'alpha': 0.2, 's' : 15})
g.fig.suptitle('Pairplot - train - colored by target', y=1.02)
plt.show()

In [None]:
# calc correlation matrix - train
rho_mat_train = df_train[features_num].corr(method='pearson')
# and visualize
plt.figure(figsize=(5,4))
sns.heatmap(rho_mat_train, annot=True,
            fmt='.3f',
            linecolor='black', linewidths=1,
            cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Correlation Numerical Features - Train')
plt.show()

In [None]:
# pairplot of numerical features - test
g = sns.pairplot(data=df_test[features_num], 
                 plot_kws = {'alpha': 0.2, 's' : 15})
g.fig.suptitle('Pairplot - test', y=1.02)
plt.show()

In [None]:
# calc correlation matrix - test
rho_mat_test = df_test[features_num].corr(method='pearson')
# and visualize
plt.figure(figsize=(5,4))
sns.heatmap(rho_mat_test, annot=True,
            fmt='.3f',
            linecolor='black', linewidths=1,
            cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Correlation Numerical Features - Test')
plt.show()

<a id='corr_all'></a>
# "Correlation" of all features
### Using the Phi_K coefficient we can also check connections between categorical and numerical features (see https://phik.readthedocs.io/en/latest/)

In [None]:
# calc Phi_K matrix
phiK_mat_train = df_train[features].phik_matrix(interval_cols=features_num)

In [None]:
# visualize phi_K matrix
plt.figure(figsize=(9,7))
sns.heatmap(phiK_mat_train, annot=True,
            fmt='.3f',
            linecolor='black', linewidths=1,
            cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Phi_K correlation - Train')
plt.show()

In [None]:
# calc Phi_K matrix
phiK_mat_test = df_test[features].phik_matrix(interval_cols=features_num)
# and visualize phi_K matrix
plt.figure(figsize=(9,7))
sns.heatmap(phiK_mat_test, annot=True,
            fmt='.3f', 
            linecolor='black', linewidths=1,
            cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Phi_K correlation - Test')
plt.show()

### 💡 We observe that especially age has some strong connections with other features. In particular the connection between age and ever_married is discussed here: https://www.kaggle.com/competitions/playground-series-s3e2/discussion/377253

<a id='age'></a>
# Impact of age feature on other features

### Let's evaluate the impact of age by introducing a binned version:

In [None]:
# first, create a discrete version of age
df_train['age_cat'] = pd.cut(df_train.age, [0,10,20,30,40,50,60,70,80,90])
plt.figure(figsize=(10,4))
df_train.age_cat.value_counts().sort_index().plot(kind='bar', color='darkblue')
plt.title('Binned version of age')
plt.grid()
plt.show()

In [None]:
# plot bivariate distributions between age and categorical features;
# we normalize each age column here
for f in features_cat:
    # calc cross table
    ctab = pd.crosstab(df_train[f],df_train.age_cat)
    # ...and normalized by column
    ctab_norm = ctab / ctab.sum()
    # plot as heatmap
    plt.figure(figsize=(10,3))
    g = sns.heatmap(ctab_norm, annot=True,
                    fmt='.2%', linecolor='black',
                    linewidths=1,
                    cmap='Greens', 
                    vmin=0, vmax=+1)
    plt.title(f + ' vs age(cat) - train')
    plt.show()

In [None]:
# plot also numerical features vs age groups
for f in features_num:
    if f != 'age':
        plt.figure(figsize=(10,5))
        sns.violinplot(data=df_train, x='age_cat', y=f)
        plt.title(f + ' vs age(cat) - train')
        plt.grid()
        plt.show()

### A different visualization approach without binning of age:

In [None]:
# create violinplots for all categorical features
for f in features_cat:
    plt.figure(figsize=(10,4))
    sns.violinplot(data=df_train, y=f, x='age',
                   orient='h')
    plt.title(f + ' vs age - train')
    plt.xlabel('age')
    plt.grid()
    plt.show()

<a id='target'></a>
# Target vs Features

### We can easily extend the Phi_K evaluation to also include our **target**:

In [None]:
# calc
phiK_mat_w_target = df_train[features+[target]].phik_matrix(interval_cols=features_num)
# reduce to relavant data only
phiK_target = phiK_mat_w_target[target] # extract "stroke" column only
phiK_target = pd.DataFrame(phiK_target[phiK_target.index!=target]) # remove "stroke" row
# and visualize
plt.figure(figsize=(2,5))
sns.heatmap(pd.DataFrame(phiK_target), annot=True,
            fmt='.3f', 
            linecolor='black', linewidths=1,
            cmap='RdYlGn', vmin=-1, vmax=+1)
plt.title('Phi_K correlation - Target vs Features')
plt.show()

### 💡 As expected, age has the highest "correlation" here. On the other end we can expect gender and Residence_type to have only a small impact on the stroke prediction. Let's visualize the connections in the following:

### Target vs **Categorical** Features (incl. age_cat):

In [None]:
for f in (['age_cat'] + features_cat):
    ctab = pd.crosstab(df_train[target],df_train[f])
    ctab_norm = ctab / ctab.sum()
    # plot as heatmap
    plt.figure(figsize=(10,3))
    g = sns.heatmap(ctab_norm, annot=True,
                    fmt='.2%', linecolor='black',
                    linewidths=1,
                    cmap='Greens', 
                    vmin=0, vmax=+1)
    plt.title('Target vs '+f)
    plt.show()

### Target vs **Numerical** Features:

In [None]:
# plot also numerical features vs age groups
for f in features_num:
    plt.figure(figsize=(10,4))
    sns.violinplot(data=df_train, y=target, x=f,
                   orient='h')
    plt.title('target vs ' + f)
    plt.xlabel(f)
    plt.grid()
    plt.show()