In [42]:
import numpy as np
import pandas as pd
import plotly.express as px

# Bank Marketing Data

## Problem Statement

The data is related with **direct marketing campaigns** of a Portuguese banking institution. 
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y)



In [7]:
bank_data=pd.read_csv("D:/Dataset/bank-additional/bank-additional/bank-additional-full.csv", sep=";")

## Dataset Description

Input variables:There are 20 input variable in this dataset

##### bank client data:
1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')
##### related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

##### other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
##### social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

Below are the few samples of the dataset

In [8]:
bank_data.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


There are 41188 samples of this dataset with total 21 columns

In [9]:
bank_data.shape

(41188, 21)

## Univariate Analysis 

we look statistical summary of our dataset one by one

In [10]:
bank_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
age,41188,,,,40.0241,10.4212,17.0,32.0,38.0,47.0,98.0
job,41188,12.0,admin.,10422.0,,,,,,,
marital,41188,4.0,married,24928.0,,,,,,,
education,41188,8.0,university.degree,12168.0,,,,,,,
default,41188,3.0,no,32588.0,,,,,,,
housing,41188,3.0,yes,21576.0,,,,,,,
loan,41188,3.0,no,33950.0,,,,,,,
contact,41188,2.0,cellular,26144.0,,,,,,,
month,41188,10.0,may,13769.0,,,,,,,
day_of_week,41188,5.0,thu,8623.0,,,,,,,


From the above table we can conclude that

1. people are from age between 17 to 98 with average age of 40.
2. job has 12 unique values with 'admin' has most frequent.
3. maritial has 4 unique values and most of the person are married.
4. education has 8 different values and most of them are university degree holder.
5. default has 3 unique values and most of users does not have default.
6. Most of the users have housing loan but doesn't have personal loan.

## Age

Age distribution of users

In [13]:
bank_data[['age']].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,41188.0,40.02406,10.42125,17.0,32.0,38.0,47.0,98.0


In [16]:
px.histogram(bank_data, x='age')

So, Age is right skewed and maximum of user are lies in age group 30-40 and average age of users is 40.


## job

job distribution in our dataset

In [17]:
bank_data[['job']].describe().T

Unnamed: 0,count,unique,top,freq
job,41188,12,admin.,10422


In [18]:
px.histogram(bank_data, x='job')

So, Maximum of the users are admin or blue-collar.

## marital Status

marital distribution of dataset

In [19]:
bank_data[['marital']].describe().T

Unnamed: 0,count,unique,top,freq
marital,41188,4,married,24928


In [20]:
px.histogram(bank_data, x='marital')

we can observe that maximum of user are married.

## Education

we look at the distribution of education on our tarfeted users.

In [21]:
bank_data[['education']].describe().T

Unnamed: 0,count,unique,top,freq
education,41188,8,university.degree,12168


In [22]:
px.histogram(bank_data, x='education')

we can see that maximum of user are university degree holder or high school holder and very few arround 18 users are illiterate.


## Default, Housing Loan and Personal Loan

we look at the distribution of default, Loan of housing and personal data.

In [24]:
bank_data[['default']].describe().T

Unnamed: 0,count,unique,top,freq
default,41188,3,no,32588


In [27]:
px.histogram(bank_data, x='default')

In [25]:
bank_data[['housing']].describe().T

Unnamed: 0,count,unique,top,freq
housing,41188,3,yes,21576


In [28]:
px.histogram(bank_data, x='housing')

In [26]:
bank_data[['loan']].describe().T

Unnamed: 0,count,unique,top,freq
loan,41188,3,no,33950


In [29]:
px.histogram(bank_data, x='loan')

we can see that most of the user have housing loan but not personal loan and most of the targeted user are not defaulter.

## Data Preprocessing

we create dummies for the following variable

1. maritial
2. job
3. education
3. poutcomes

In [30]:
bank_data_new = pd.get_dummies(bank_data , columns=['marital','job','education','poutcome'])

In [31]:
bank_data_new.head()

Unnamed: 0,age,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,...,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,no,no,no,telephone,may,mon,261,1,999,...,0,0,0,0,0,0,0,0,1,0
1,57,unknown,no,no,telephone,may,mon,149,1,999,...,0,0,1,0,0,0,0,0,1,0
2,37,no,yes,no,telephone,may,mon,226,1,999,...,0,0,1,0,0,0,0,0,1,0
3,40,no,no,no,telephone,may,mon,151,1,999,...,1,0,0,0,0,0,0,0,1,0
4,56,no,no,yes,telephone,may,mon,307,1,999,...,0,0,1,0,0,0,0,0,1,0


for other variables like

1. default 
1. housing
2. loan
3. contact

if yes then 1 otherwise 0

In [37]:
bank_data_new['default']=np.where(bank_data_new['default']=='yes',1 ,0)
bank_data_new['housing']=np.where(bank_data_new['housing']=='yes',1 ,0)
bank_data_new['loan']=np.where(bank_data_new['loan']=='yes',1 ,0)
bank_data_new['contact']=np.where(bank_data_new['contact']=='cellular',1 ,0)
bank_data_new['y']=np.where(bank_data_new['y']=='yes',1 ,0)


elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



In [38]:
bank_data_new.head()

Unnamed: 0,age,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,...,education_basic.6y,education_basic.9y,education_high.school,education_illiterate,education_professional.course,education_university.degree,education_unknown,poutcome_failure,poutcome_nonexistent,poutcome_success
0,56,0,0,0,0,may,mon,261,1,999,...,0,0,0,0,0,0,0,0,1,0
1,57,0,0,0,0,may,mon,149,1,999,...,0,0,1,0,0,0,0,0,1,0
2,37,0,0,0,0,may,mon,226,1,999,...,0,0,1,0,0,0,0,0,1,0
3,40,0,0,0,0,may,mon,151,1,999,...,1,0,0,0,0,0,0,0,1,0
4,56,0,0,0,0,may,mon,307,1,999,...,0,0,1,0,0,0,0,0,1,0


In [39]:
bank_data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 44 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   default                        41188 non-null  int32  
 2   housing                        41188 non-null  int32  
 3   loan                           41188 non-null  int32  
 4   contact                        41188 non-null  int32  
 5   month                          41188 non-null  object 
 6   day_of_week                    41188 non-null  object 
 7   duration                       41188 non-null  int64  
 8   campaign                       41188 non-null  int64  
 9   pdays                          41188 non-null  int64  
 10  previous                       41188 non-null  int64  
 11  emp.var.rate                   41188 non-null  float64
 12  cons.price.idx                 41188 non-null 

In [41]:
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'