# Solutions I:  Exploration

**Variables:**

1. `age` (numeric)
2. `job` : type of job (categorical)
3. `marital` : marital status (categorical)
4. `education`: level of education(categorical)
5. `default`: has credit in default? (categorical)
6. `housing`: has housing loan? (categorical)
7. `loan`: has personal loan? (categorical)

**Related to the last contact of the current campaign:**

8. `contact`: contact communication type (categorical)
9. `month`: last contact month of year (categorical)
10. `day_of_week`: last contact day of the week (categorical)
11. `duration`: last contact duration, in seconds (numeric)

*Note: Contact duration is not available prior to contacting the customer!*

**Other attributes:**

12. `campaign`: number of contacts performed during this campaign and for this client (numeric)
13. `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric)
14. `previous`: number of contacts performed before this campaign and for this client (numeric)
15. `poutcome`: outcome of the previous marketing campaign (categorical)

**Social and economic context attributes:**

16. `emp.var.rate`: employment variation rate - quarterly indicator (numeric)
17. `cons.price.idx`: consumer price index - monthly indicator (numeric)
18. `cons.conf.idx`: consumer confidence index - monthly indicator (numeric)
19. `euribor3m`: euribor 3 month rate - daily indicator (numeric)
20. `nr.employed`: number of employees - quarterly indicator (numeric)


In [None]:
import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Load dataset

In [None]:
df = pd.read_csv("../../0_data/banking/bank-additional-full.csv", sep=";")
df.shape

In [None]:
# Columns in the dataset
df.columns

In [None]:
df.dtypes

In [None]:
# NUmber of missing values
df.isna().sum()

## Basic descriptives

In [None]:
# Target variable distribution
df["y"].value_counts()

In [None]:
# Or as percentage
df["y"].value_counts(normalize=True)

In [None]:
# Create numerical y
y = df["y"].replace({"yes": 1, "no": 0})

### Numerical features

In [None]:
numerical = df.select_dtypes("number")

In [None]:
# Numerical features
numerical.describe()

In [None]:
# Set 999 to missing for pdays
numerical = numerical.assign(pdays=lambda df: df["pdays"].replace({999: np.nan}))

# Mostly missing values...
(numerical["pdays"].value_counts(dropna=False, normalize=True) * 100).head(10)

In [None]:
# Plot numeric distributions
ncols = 3
nrows = math.ceil(numerical.shape[1] / ncols)
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols * 3, nrows * 2))

for idx, col_name in enumerate(numerical.columns):
    # Compute location in axes grid
    row = idx // ncols
    col = idx % ncols
    ax = axes[row, col]
    
    ax.hist(numerical[col_name], bins=50, edgecolor="white")
    ax.set_title(col_name)

plt.rc('font', size=7)
plt.tight_layout()

In [None]:
# Correlations with target
(
    numerical
    .corrwith(y)
    .rename("correlation with target")
    .sort_values(ascending=False)
    .round(2)
    .to_frame()
)

### Categorical

In [None]:
categorical = df.select_dtypes("object")

In [None]:
# Categorical features
categorical.describe()

In [None]:
# Plot categorical distributions
ncols = 3
nrows = math.ceil(categorical.shape[1] / ncols)
fig, axes = plt.subplots(nrows, ncols, figsize=(ncols * 3, nrows * 2))

for idx, col_name in enumerate(categorical.columns):
    # Compute location in axes grid
    row = idx // ncols
    col = idx % ncols
    ax = axes[row, col]
    
    (
        categorical
        .pivot_table(index=[col_name], columns="y", aggfunc="size", fill_value=0)
        .sort_values("no")
        .plot.barh(ax=ax)
    )

    ax.set_title(col_name)
    ax.legend(loc="lower right")

plt.rc('font', size=7)
plt.tight_layout()