<h1 style='background:#F0FFF0; border:2; color:black'><center>Simple and Easy EDA 💁</center></h1>

The purpose of this notebook is to quickly and easily show the EDA of this November competition.

# **<span style="color:#228B22;">Data</span>**

For this competition, you will be predicting a binary target based on 100 feature columns given in the data. All columns are continuous.

The original dataset deals with predicting identifying spam emails via various extracted features from the email. Although the features are anonymized, they have properties relating to real-world features.

The data is synthetically generated by a GAN that was trained on a real-world dataset used to identify spam emails via various extracted features from the email.

**Files**
> - ``` train.csv``` -  the training data with the target column
> - ```test.csv``` - the test set
> - ```sample_submission.csv``` - a sample submission file in the correct format

# **<span style="color:#228B22;">Evaluation Metric</span>**
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

# **<span style="color:#228B22;">Import Module</span>**

In [None]:

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

SEED = 42
np.random.seed(SEED)

# **<span style="color:#228B22;">Load Data</span>**

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')

print('Quick view of training data: ')
train.head()

In [None]:
TARGET = 'target'
FEATURES = [col for col in train.columns if col not in ['id', TARGET]]
print(f'Training data:\n\t Number of rows: {train.shape[0]}, Number of columns: {train.shape[1]}')
print(f'Testing data:\n\t Number of rows: {test.shape[0]}, Number of columns: {test.shape[1]}')

In [None]:
print('Basic statistics of training data:')
train[FEATURES+[TARGET]].describe().style.background_gradient(cmap="Greens")

In [None]:
print('Basic statistics of testing data:')
test[FEATURES].describe().style.background_gradient(cmap="Greens")

In [None]:
print(f'Number of missing values in training data: {train.isna().sum().sum()}')
print(f'Number of missing values in testing data: {test.isna().sum().sum()}')

# **<span style="color:#228B22;">EDA</span>**

In [None]:
df = pd.concat([train[FEATURES], test[FEATURES]], axis=0)

cat_features = [col for col in FEATURES if df[col].nunique() < 10]
cont_features = [col for col in FEATURES if df[col].nunique() >= 10]

del df

print(f'Total number of features: {len(FEATURES)}')
print(f'categorical features: {len(cat_features)}')
print(f'continuos features: {len(cont_features)}')

plt.pie([len(cat_features), len(cont_features)], 
        labels=['Categorical', 'Continuos'],
        colors=['#2E8B57', '#8FBC8F'],
        textprops={'fontsize': 15},
        autopct='%1.1f%%')
plt.show()

In [None]:
print("Feature distribution of continous features: ")
ncols = 5
nrows = int(len(cont_features) / ncols + (len(FEATURES) % ncols > 0))

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 150), facecolor='#EAEAF2')

for r in range(nrows):
    for c in range(ncols):
        col = cont_features[r*ncols+c]
        sns.kdeplot(x=train[col], ax=axes[r, c], color='#58D68D', label='Train data')
        sns.kdeplot(x=test[col], ax=axes[r, c], color='#DE3163', label='Test data')
        axes[r, c].set_ylabel('')
        axes[r, c].set_xlabel(col, fontsize=8, fontweight='bold')
        axes[r, c].tick_params(labelsize=5, width=0.5)
        axes[r, c].xaxis.offsetText.set_fontsize(4)
        axes[r, c].yaxis.offsetText.set_fontsize(4)
plt.show()

In [None]:
print("Feature distribution of continous features: ")
ncols = 5
nrows = int(len(cont_features) / ncols + (len(FEATURES) % ncols > 0))

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 150), facecolor='#EAEAF2')

for r in range(nrows):
    for c in range(ncols):
        col = cont_features[r*ncols+c]
        sns.boxplot(y=train[col], ax=axes[r, c], color='#58D68D')
        sns.boxplot(y=test[col], ax=axes[r, c], color='#DE3163')
        axes[r, c].set_ylabel('')
        axes[r, c].set_xlabel(col, fontsize=8, fontweight='bold')
        axes[r, c].tick_params(labelsize=5, width=0.5)
        axes[r, c].xaxis.offsetText.set_fontsize(4)
        axes[r, c].yaxis.offsetText.set_fontsize(4)
plt.show()

In [None]:
print("Target Distribution: ")

target = pd.DataFrame(train[TARGET].value_counts()).reset_index()
target.columns = [TARGET, 'count']

fig, ax = plt.subplots(1, 1, figsize=(25, 8), facecolor='#EAEAF2')
sns.barplot(y=TARGET, x='count', data=target, palette=['#58D68D', '#DE3163'], ax=ax, orient='h')
ax.set_xlabel('Count', fontsize=16)
ax.set_ylabel('Target', fontsize=16)
plt.show()

In [None]:
#As a test, draw both bivariate and univariate KDEs

sns.jointplot(train['target'],train['f1'],x ='target', y ='f1',color = '#ff355d', kind = "kde")     
plt.show()

# **<U><span style="color:#F08080;">Hope you get a basic understanding of your data with a simple EDA 🙂</span></U>**