In [None]:
%%html
<style>
table {float:left}

td {font-size: 14px}
</style>

# Bank Marketing

## Project Description

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). The data was downloaded from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing.
<br>
<br>
*Source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*
<br>
<br>
There are four datasets in this repository:
1. bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2. bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3. bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4. bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs)

<br>
In this project I will use the 'bank-additional-full.csv' dataset.


**Feature description:**

| Feature | Type | Description |
|:---------|:------|:-------------|
|*Bank client data*|
|age|numeric|age clients|
|job|categorical|type of job ('admin', 'blue-colar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')|
|marital|categorical|marital statues ('divorced' (divorced or widowed), 'married', 'single', 'unknown')|
|education|catgeorical|'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown'|
|default|categorical|has credit in default? ('yes', 'no', 'unknown')|
|housing|categorical|has housing loan? ('yes', 'no', 'unknown')|
|loan|categorical|has personal loan? ('yes', 'no', 'unknown')|
|*Related with the last contact of the current campaign* |
|contact|categorical|contact communication type ('cellular', 'telephone')|
|month|categorical|last contact month of the year ('jan', 'feb', 'mar', ... ,'nov', 'dec')|
|day_of_weel|categorical|last contact day of the week ('mon', 'tue', 'wed', 'thu', 'fri')|
|duration|numeric|last contact duration in seconds. Remark: this attributes highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.|
|*Other attributes*|
|campaign|numeric (includes last contact)|number of contacts performed during this campaign and for this client|
|pdays|numeric, 999 means client was not previously contacted |number of days that passed by after the client was last contacted from a previous campaign|
|previous|numeric|number of contacts performed before this campaign and for this client|
|poutcome|categorical|outcome of the previous marketing campaign ('failure', 'nonexistent', 'success')|
|*Social and economic context attributes*|
|emp.var.rate|numeric| employment variation rate - quaterly indicator|
|cons.price.idx| numeric| consumer price index|
|cons.conf.idx|numeric| consumer confidence index - monthly indicator|
|euribor3m|numeric| euribor 3 month rate - daily indicator|
|nr.employed |numeric| number of employees - quaterly indicator|
|*Output variable (desired target)*|
|y|categorical| has the client subscribed a term deposit? ('yes', 'no')



## Get the data

In [None]:
# general imports
from IPython.core.display import display,HTML # for ipywidgets
import zipfile # to get data
from urllib.request import urlopen # to get data
from io import BytesIO # to get data
import numpy as np
import pandas as pd
from collections import Counter
from display_side_by_side import *

In [None]:
# visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# Cufflinks
import cufflinks as cf
cf.go_offline()

# Plotly
import plotly
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff


In [None]:
# user-defined functions
from dataset_overview import *
from eda_cat_vs_target_binary import *
from eda_num_vs_target_binary import *
from eda_num_univ_binary import *
from cross_tab_chi_sqrt_p_value import *

In [None]:
# ipywidgets
import ipywidgets as widgets

In [None]:
# pandas pipelines
import pdpipe as pdp

In [None]:
# sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit

In [None]:
#display(HTML("<style>.container { width:80% !important; }</style>"))

In [None]:
# function to highlight max values in table
def highlight_max(data, color='red'):
    '''
    highlight the maximum in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else:  # from .apply(axis=None)
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

**Data**

In [None]:
z = urlopen('http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip')
myzip = zipfile.ZipFile(BytesIO(z.read())).extract('bank-additional/bank-additional-full.csv')
data = pd.read_csv(myzip,sep=';')

## Quick look at the data structure

In [None]:
data.head()

In [None]:
dataset_overview(data)

So in this dataset we have 21 variables and 41188 observations, we don't have missing values, and we have 12 rows (about 0.03%) which are duplicates. From the descriptive statsistics for numerical variables we can see that there are some zeros as min values for the 'duration', 'pdays' and 'previous'. The `duration` is the last contact duration, so if the duration is 0 seconds, there was no last contact. Regarding `pdays`, which is the number of days passed by after the client was last contacted from a previous campaign, if pdays = 0, that means that the client was contacted in the last 24h. Also this variable has a max value of 999 which means that the client was not previously contacted. `previous` represents the number of contacts performed before this campaign and for this client, thus previous = 0 means that the client was never contacted. So the presence of zeros is not weird.
<br>
We also see some negative values for 'emp.var.rate' and 'cons.conf.idx'. Although I couldn't find how `emp.var.rate` is calculated, I assume that is just a variation in the employment rate on a quartely bases, so a negative number is not weird. For the `cons.conf.idx`, which measures how consumers feel about jobs, the country and spendings, negative values means that consumers pessimistic opinions prevails.
<br>
<br>
Regarding the categorical variables, again we don't have any missing values (null values). 
<br>
Most of the clients:
- are admin, married and with a university degree;
- do not have credit in default, do not have a personal loan, but do have a housing loan;
- were contacted via cellular in may and on thursdays!
- the outcome of the previous marketing campaign was mainly 'nonexisting' (new clients?)
- didn't subscribe a term deposit.


We also have a great amout of zeros (about 86.4%) which may be from a specific column. For the rest, regarding data quality, it seems that everything is ok.

## Split the data

The reason I am going to split the data into a train and test sets at this point of the analysis, it to avoid the so called `data snooping bias`. After all, if the test set is a proxy for new unseen data, then we should not see it anyway.
<br>
Reference: Hands-On Machine Learning with Scikit-Learn and TensorFlow, Aurélien Géron.

From the project description we know that we are dealing with a classification task, more precisely a binary classification. And we also know from the descriptive statistics for categorical variables, then the majority of the clients didn't subscribe a term deposit (y = no is 36548 out of 41188). Thus, before splitting the data let's just check how imbalanced the target is so we can maintain the same proportions of the response in the training and test set after splitting:

In [None]:
counts = data['y'].value_counts()
counts_perc = 100*data['y'].value_counts()/len(data)
target_data_proportions = pd.DataFrame({'Target Counts':counts,'Target Counts (%)':counts_perc}).round(2)
target_data_proportions

Ok, so we have a highly imbalanced dataset, with about 88.73% of the observations belonging to the class no (clients did not subscribed to a term deposit). Thus we need to take this into consideration while splitting the data:

In [None]:
split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=123)
for train_index,test_index in split.split(data,data['y']):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]
    
# we could use instead train_test_split with stratify and get the same result
# train_set, test_set = train_test_split(data,test_size=0.2,stratify=data['y'],random_state=123) 

Let's check if the proportions of the response in the train and test sets were mantained:

In [None]:
target_train = strat_train_set['y']
target_test = strat_test_set['y']
counter_train = Counter(target_train)
counter_test = Counter(target_test)

print('Train target:')
for k_train,v_train in counter_train.items():
    perc_train = round(v_train/len(target_train)*100,2)
    print('Class={}, Count={}, Percentage={}'.format(k_train,v_train,perc_train))
    
print('')

print('Test target:')
for k_test,v_test in counter_test.items():
    perc_test = round(v_test/len(target_test)*100,2)
    print('Class={}, Count={}, Percentage={}'.format(k_test,v_test,perc_test))

Now we have a training set and a test set where the proportions of the target variable are maintained as they were in the whole dataset, let's proceed to EDA.

# Exploratory Data Analysis

In [None]:
# Let's give another name to the training set
df = strat_train_set.copy(deep=True)

Let's first drop 'duration' due to possible data leakage (see 'Feature description' above):

In [None]:
df = df.drop('duration',axis=1)

Let's check again the dataset stats:

In [None]:
dataset_overview(df)

By droping the column 'duration' the number of duplicated rows increases greatly. However, since there is no unique ID to identify the clients, I don't think that we can treat these duplicated rows as such. For example, from the table below we have two duplicated rows for clients with 28 years old that are students and single (and ...), but you can have more than one client that is 28 years old and so on. So let's leave these rows like this for now.

In [None]:
dup = df[df.duplicated(keep=False)]
dup.loc[(dup['age']==28) & (dup['job']=='student') & (dup['education']=='basic.9y')]

## Numeric variables Summary 

Let's start by analysing each continuous variable (univariate analysis):

In [None]:
print('')
print('Figure 1: Numeric Variables Summary')
print('')
eda_num_univ_binary(df)

|Variables|  Observations      |
|:---------|:--------|
|age|- The clients of this bank have ages that range from 17 to 98 years old.
|   |- Median of the clients is 38 years old, and about 75% of clients has an age less than 47 years old.
|   |- There appear to exists some outliers. From the boxplot we can see that the extreme values regarding age range from 70 to 98 years old. However, these extreme values in my opinion, are just that! I also don't think that people with a certain age cannot be clients of a bank!| 
|campaign|- The number of contacts performed during this campaign and for this client ranges from 1 to 43. This variable includes the last contact.
|        |- About 75% of the contacts are less than 3. The most frequent number of contacts in this dataset is 1 (43.10%).
|        |- ( appears to be a good candidate for discretization)|
|pdays|- The great majority of the clients (96.38%) was not contacted previously (pdays = 999,are these new clients?)
|     |- (appears to be a good candidate for discretization)|
|previous|- 86.37% of the values in this feature are zeros, which means that for the great majority, the number of days passed before the last campaign was zero!
|     |- (also a good candidate for dicretization)|
|emp.var.rate|- The employment variation rate ranges from -3.4 to 1.4.
|            |- 75% of the values are below 1.4.
|            |- (good candidate for binning. From the histogram a natural approach would be -inf,-1.8], ]-1.8,-0.1], ]-0.1,+inf )|
|cons.price.idx|- The consumer price index ranges from 92.20 to 94.77.
|              |- 75% of the values are below 93.994.
|              |- (this feature also apears to be more categorical in nature than continuous. Also a good candidate to be discretized)|
|cons.conf.idx|- The consumer confidence index ranges from -50.80 to -26.90.
|             |- 75% of the values are below -36.4.
|             |- (this feature also apears to be more categorical in nature than continuous. Also a good candidate to be discretized)|
|euriborm3m|- The euribor 3 month rate ranges from 0.63 to 5.04.
|          |- 75% of the values are below 4.961
|          |- (appears to be a good candidate for discretization)|
|nr.employed|- The number of employees ranges from 4963.60 to 5228.10.
|           |- 75% of the values are below 5228.1.
|           |- (appears to be a good candidate for discretization)|


Now we look at the numeric features vs the target:

In [None]:
print('')
print('Figure 2: Numeric Features vs Response Variable Summary')
print('')
eda_num_vs_target_binary(df)

|variables| Observations        |
|:---------|:---------|
|age      |- It appears that most clients that subscribe to a term deposit have ages somewhere between 28 and 38 years old. Regarding the boxplot it does not appear to show a significant difference between the ages of clients that subscribed and clients that didn't. However, it appears that clients that are in the 60+ age group are more likely to subscribe to a term deposit. We can slice the age in bins so we can make the age groups more interpetable. Thus I will slice the age in bins of 10 years.
|campaign |-For a number of contacts equal to 1, 49.5% of the clients subscribed to a term deposit, and it appears the more the number of contacts the less subscriptions. Furthermore, calling the clients more than 10 times seems quite a bit!!|
|pdays    |-Not much info from the plots. This feature will be encoded and further analysed in the 'categorical data analysis' section.|
|previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m and nr.employed|-will also be discretized|

Let's look now at the correlations between numeric variables:

In [None]:
print('')
print('Figure 3: Correlation matrix numeric features')
print('')
df2 = df.copy()
df2['y_num'] = df2['y'].map({'yes':1,'no':0})
df_corr = df2.corr()
fig = ff.create_annotated_heatmap(
    z=df_corr.values,
    x=list(df_corr.columns),
    y=list(df_corr.index),
    annotation_text=df_corr.round(2).values,
    showscale=False)
fig.update_layout(template='plotly_dark')
fig.show()

The correlations between the numeric features and the response are very low. However, there are three pairs of correlations above 0.9: r(emp.var.rate,nr.employed) = 0.91, r(emp.var.rate,euribor3m) = 0.97, and r(euribor3m,nr.employed) = 0.95. Let's take a closer look at these features:

In [None]:
print('')
print('Figure 4: Scatter plot highest correlated features')
print('')
fig1 = px.scatter(df,x='emp.var.rate',y='nr.employed',render_mode='webgl',trendline='ols',trendline_color_override='yellow')
fig2 = px.scatter(df,x='emp.var.rate',y='euribor3m',render_mode='webgl',trendline='ols',trendline_color_override='green')
fig3 = px.scatter(df,x='nr.employed',y='euribor3m',render_mode='webgl',trendline='ols',trendline_color_override='orange')

trace1 = fig1['data'][0]
trace1_trend = fig1['data'][1]
trace2 = fig2['data'][0]
trace2_trend = fig2['data'][1]
trace3 = fig3['data'][0]
trace3_trend = fig3['data'][1]


fig = make_subplots(rows=1,cols=3,subplot_titles=('r = 0.91','r = 0.97','r = 0.95'))
fig.add_trace(trace1,row=1,col=1)
fig.add_trace(trace1_trend,row=1,col=1)
fig.add_trace(trace2,row=1,col=2)
fig.add_trace(trace2_trend,row=1,col=2)
fig.add_trace(trace3,row=1,col=3)
fig.add_trace(trace3_trend,row=1,col=3)

fig['layout']['xaxis']['title']='emp.var.rate'
fig['layout']['yaxis']['title']='nr.employed'
fig['layout']['xaxis2']['title']='emp.var.rate'
fig['layout']['yaxis2']['title']='euribor3m'
fig['layout']['xaxis3']['title']='euribor3m'
fig['layout']['yaxis3']['title']='nr.employed'

fig.update_layout(template='plotly_dark')

fig.show()



So we see that these correlations are misleading particularly the scatter plots of (emp.var.rate vs euribor3m) and (euribor3m vs nr.employed). This does not appear to be the best representation of this data, not only because the data is packed into columns (emp.var.rate vs euribor3m and euribor3m vs nr.employed), but there are also many overlapping points.The scatter plots again suggests that this data is more categorical in nature than continuous.

## Categorical variables Summary

Let's start by discretizing the above mentioned variables with the help of pipelines (since all the transformations made in the train set have to be repeated in the test set):

In [None]:
# copy of df
df_cat = df.copy()

|Variables|Categorization|
|:---------|:--------------|
|campaign |categorized into (one,two,three,more than three), since about 81.6% of the data falls within the first three levels.          |
|pdays    |categorized into ('contacted', 'never contacted') where 999 = 'never contacted'.|
|previous |categorized into ('contacted', 'never contacted'), where 0 = 'never contacted'.|
|emp.var.rate| categorized into (-inf,-1.8],(-1.8,-0.1],(-0.1,inf), per observation of the histogram|
|cons.price.idx|quantile binning since this feature is highly multimodal|
|cons.conf.idx|quantile binning since this feature is highly multimodal|
|euribor3m|quantile binning since this feature is highly multimodal|
|nr.employed|quantile binning since this feature is highly multimodal


In [None]:
# campaign
value_map_campaign = lambda x: 'one' if x==1 else 'two' if x==2 else 'three' if x==3 else ' more than three'
# pdays
value_map_pdays = lambda x: 'never contacted' if x==999 else 'contacted'
# previous
value_map_previous = lambda x: 'never contacted' if x==0 else 'contacted'
# emp.var.rate
bin_map_emp = [-np.inf,-1.8,-0.1,np.inf]
# cons.price.idex
result,bins_cons_price = pd.cut(df_cat['cons.price.idx'],bins=3,right=True,retbins=True)
bins_cons_price = np.round(bins_cons_price,3)
# cons.conf.idx
result,bins_cons_conf = pd.qcut(df_cat['cons.conf.idx'],q=5,retbins=True)
bins_cons_conf = np.round(bins_cons_conf,3)
# euribor3m
results,bins_euribor = pd.qcut(df_cat['euribor3m'],q=5,retbins=True)
bins_euribor = np.round(bins_euribor,3)
# nr.employed
results,bins_employed = pd.qcut(df_cat['nr.employed'],q=5,retbins=True,duplicates = 'drop')
bins_employed = np.round(bins_employed,3)

Let's also bin the 'age' column:

In [None]:
value_map_age = lambda x: '[18,30)' if x<30 else '[30,40)' if x<40 else '[40,50)' if x<50 else '[50,60)' if x<60 else '60+'

In [None]:
# pipelines
pipeline = pdp.MapColVals('campaign',value_map_campaign,drop=True)
pipeline += pdp.MapColVals('pdays',value_map_pdays,drop=True)
pipeline += pdp.MapColVals('previous',value_map_previous,drop=True)
pipeline += pdp.Bin({'emp.var.rate':bin_map_emp[1:3]},drop=True)
pipeline +=pdp.Bin({'cons.price.idx':bins_cons_price[1:5]},drop=True)
pipeline +=pdp.Bin({'cons.conf.idx':bins_cons_conf[1:5]},drop=True)
pipeline +=pdp.Bin({'euribor3m':bins_euribor[1:5]},drop=True)
pipeline +=pdp.Bin({'nr.employed':bins_employed[1:5]},drop=True)
pipeline +=pdp.MapColVals('age',value_map_age,drop=True)
df_cat = pipeline(df_cat)

In [None]:
df_cat.head()

In [None]:
#df_cat.columns

In [None]:
print('')
print('Figure 5: Catgeorical Features vs Response Variable Summary')
print('')
print('Remark: the width of the bars in the ''Target proportions by Category (%)'' bar plot is proportional to the size of the classes in the dataset')
print('')
eda_cat_vs_target_binary(df_cat)

In [None]:
data['default'].value_counts()


|Variables|Observations|
|:---------|:-------|
|`age`     |- Most of the clients are between 30 and 40 years old (41%), followed by clients with ages between 40 and 50 (25.62%). However, the clients that most subscribed a term deposit have ages above or equal to 60. It is however important to note that clients that are above 60 years old comprise only 2.91% of the observations (959 out of 32950); |
|          |- This feature appears to impact the target.|
|`job`      |- Most of the clients have jobs as admin (25.21%), blue-collar (22.66%) and technician (16.41%);|
|         |- The job categories that subscribed the most to a term deposit are students (32.75%) and retired (25.04%). Note that students and retired clients make up only 2.08% and    0.83% respectively, of the observations. |
|         |- This variable appears to impact the target.|
|`marital`  |- The majority of the clients are married (60.61%), followed by single clients (about 28%);|
|         |- The marital status does not appear to greatly affect the likelihood of a client to subscribe a term deposit;|
|         
|`education`|- Most of the clients have a university degree (29.4%), followed by clients with a high-school (23.17%) and basic.9y (14.72%);| 
|         |- The level 'illiterate' has only 0.04% of the observations, thus this level will be dropped;
|         |- This variavel does not appear to impact the taget.|
|`default`: has credit in default?  |- Only three clients have a credit in default.|
|         |- It appears that clients with no credit in default are more likely to subscribe to a term deposit than the ones that didn't want to share that information;|
|         |- This feature appears to impact the response.|
|`Housing`: has housing loan?  |- The distribution of clients with a housing loan and without a housing loan are very similar, respectively 52.5% and 45.1%;|
|         |- This features does not appear to impact the target.|
|`loan`: has personal loan?     |- The majority of the clients do not have a personal loan (82.5%);|
|         |- This feature does not appear to impact the target.|
|`contact`  |- More than half of the clients were contacted via cellular (63.6%);|
|         |- The clients that were contacted by cellular appear to be more likekely to subscribe to a term deposit (about 15%) compared to the clients that were contacted via telephone (about 5%), maybe because a client with cellular is more easy to be contacted;|
|         |- This feature appears to impact the target.|
|`month`   |- Most of the clients were contacted in May (33.45%), July (17.39%) and August (about 15%);|
|         |- This feature appears to impact the target.|
|`day_of_week`|- The distributions of the days of the week when the client was contacted does not vary much;|
|           |- This feature does not appear to impact the target.| 
|`campaign`  |- Most of the clients were contacted one time (43.1%) and two times (25.6%);|
|          |- The likelihood of the clients to describe a term deposit does not vary much between the number of contacts. For one contact 13% subscribed a term deposit, and for two contacts 11% subscribed a term deposit. Still not sure about the impact of this variable on the target.|
|`pdays`: number of contacts performed during this campaign and for this client     |- Clients that were never contacted are about 93.4% of the observations. However the chances of subscribing a term deposit are much higher for clients that were recontacted (64%);|
|          |- This feature appears to impact the target.|
|`previous`: number of contacts performed before this campaign and for this client  |- 86.37% of the clinets were never contacted. And again, 27% of the clients that were contacted before this campaign subscribed to a term deposit in contrast to 9% of the never contacted clients that subscribed a term deposit;|
|          |- This feature appears to impact the target.|
|`poutcome`: outcome of the previous marketing campaign: outcome of the previous marketing campaign  |- Apparently most of the campaigns were just nonexistent (86.37%);|
|                                                      |- Clearly, if the previous campaign was a success, the client will tend to subscribe again (about 66% did!);|
|                                                      |- This feature appears to impact the target.|
|`emp.var.rate`|- Although the great majority of the contacts was made when the 'emp.var.rate' was more than or equal to -0.1, the most subscribtions (39%) occured when the 'emp.var.rate' was less than -1.8;|
|              |- This feature appears to impact the target.|
|`cons.price.idx`|- Although the great majority of the contacts was made for a 'cons.price.idx' between 93.912-94.767, the most subscriptions occured for a 'cons.price.idx' less than 93.056;|
|                |- This feature appears to impact the target.|
|`cons.conf.idx`|- Most of the contacts were made when the 'cons.conf.idx' was less than or equal to -36.4 (37.8%), followed by when the 'cons.conf.idx' was between -46.2 and -42.0 (30.4%). However, most of the subscriptions occured when the 'cons.conf.idx' was between -40.0 and -36.4;|
|               |- This feature appears to impact the target.|
|`euribor3m`|-Most of the subscriptions occured when the 'euribor3m' was less than 1.299;|
|           |- This feature appears to impact the target.|
|`nr.employed`|- Most of the contacts were made when the 'nr.employed' was higher than or equal to 5228.1 (more people avaiable to make the phone calls?). Howerver most of the clients subscribed a term deposit when the 'nr.employed' was less than 5099.1 (less employees but more motivated? Fear of being fired?);|
|             |- This feature appears to impact the target.|


From the above figures we can also see that there exists some classes named 'unknown'. Because this dataset is the result of phone calls, it is quite possible that some clients didn't want to answer some questions, therefore some (or all) of this missing values are not at random. In order to make a decision regarding imputation or deletion of the missing values, I will first calculate a contingency table of the features vs the response and calculate the chi squared stats as well as the p-value, to check for independe. In case, the features vs the response are dependent, we can't ignore it, and the missing values have to be imputed using the rest of the dataset.

In [None]:
cross_tab_chi_sqrt_p_value(df_cat,'unknown',df_cat['y'],0.99)

All values from the expected frequencies are above 5 which means that it's acceptable to use the chi-square.

The only features that are dependent on the target are 'education' and 'default'. Regarding education, there are only 4.2% missing values so this feature will be imputed using the rest of the dataset. Regarding 'default', there are only 3 clients that subscribed a term deposit (figure 5), and there is quite a large amount of 'unknown' so imputation does not seem to me the right way to go, so I will keep the 'unknown' class. 

For the independent variables: 
- 'job': the class with the highest count is admin (25.21%) so I will add the 'unknown' to this category
- 'marital': the class with the highest count is married (60.61%) so I will add the 'unknown' to this category
- 'loan' : the class with the highest count is no (82.48%) so adding 2.4% of the 'unknown' here will hopefully not make a huge difference.
- 'housing': in this one I will use the whole dataset to input the values since there is no big difference between no (45.10%) and yes (52.20%)

For now I will stop with EDA and move on to modelling (part II of this project can be found [here](./Bank-Marketing-Part-II-Modelling.ipynb)). 
<br>
But first let's do a quick summary of the dataset:
- The original dataset contains 41188 observations and 21 variables (10 numeric and 11 categorical);
- Before EDA the dataset was splitted into training data and test data, and only the training data was analyzed. The split took into consideration the fact that the dataset was highly unbalanced (88.73% of the bank clients didn't subscribed a term deposit, while 11.27% did);
- The feature 'duartion' was dropped before EDA due to data leakage;
- There are 1227 (3.72%) duplicated rows. These duplicates were not removed, but I will test during modelling if by removing them will affect positively the result;
- There were some outliers. However, the dataset is now completely categorical so no need to worry about those now.