###  Bank Marketing (Campaign) -- Group Project

**Problem Statement:**

ABC Bank wants to sell it's term deposit product to customers and before launching the product they want to develop a model which help them in understanding whether a particular customer will buy their product or not (based on customer's past interaction with bank or other Financial Institution).

**Why ML Model:** Bank wants to use ML model to shortlist customer whose chances of buying the product is more so that their marketing channel (tele marketing, SMS/email marketing etc)  can focus only to those customers whose chances of buying the product is more.

This will save resource and their time ( which is directly involved in the cost ( resource billing)).

Develop model with Duration and without duration feature and report the performance of the model.

Duration feature is not recommended as this will be difficult to explain the result to business and also it will be difficult for business to campaign based on duration.

**Task:**

Business Understanding

Data understanding

Exploratory data Analysis

Data Preparation

Model Building ( Logistic Regression, ensemble, Boosting etc)

Model Selection

Performance reporting

Deploy the model

Converting ML metrics into Business metric and explaining result to business

Prepare presentation for non technical persons.

**Data Set Information :**

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

# Attribute Information:

# Input variables:
**bank client data:**
1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

**related with the last contact of the current campaign:**
8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

**other attributes:**
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

**social and economic context attributes**
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

# Output variable (desired target):**
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
#Let's import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns 

In [2]:
#Read the dataset
df=pd.read_csv(r"C:\Users\Lenovo\Desktop\İntership Projects\Seventh week\bank.csv", sep=';')

In [15]:
#Let's drop duration columns because we will not need it as it mentioned in information of dataset
df.drop('duration',axis=1,inplace = True)

## Explorotory Data Analysis (EDA)

In [16]:
#First 5 rows of dataset
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,1,-1,0,unknown,no


In [17]:
#Last 5 rows of dataset
df.tail()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,campaign,pdays,previous,poutcome,y
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,4,211,3,other,no
4520,44,entrepreneur,single,tertiary,no,1136,yes,yes,cellular,3,apr,2,249,7,other,no


In [18]:
#Shape of dataset
df.shape

(4521, 16)

In [20]:
#Columns of dataset
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'campaign', 'pdays', 'previous',
       'poutcome', 'y'],
      dtype='object')

In [21]:
#Some statistics values of dataset
df.describe()

Unnamed: 0,age,balance,day,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,15.915284,2.79363,39.766645,0.542579
std,10.576211,3009.638142,8.247667,3.109807,100.121124,1.693562
min,19.0,-3313.0,1.0,1.0,-1.0,0.0
25%,33.0,69.0,9.0,1.0,-1.0,0.0
50%,39.0,444.0,16.0,2.0,-1.0,0.0
75%,49.0,1480.0,21.0,3.0,-1.0,0.0
max,87.0,71188.0,31.0,50.0,871.0,25.0


In [22]:
#Information about columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   object
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   object
 7   loan       4521 non-null   object
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  campaign   4521 non-null   int64 
 12  pdays      4521 non-null   int64 
 13  previous   4521 non-null   int64 
 14  poutcome   4521 non-null   object
 15  y          4521 non-null   object
dtypes: int64(6), object(10)
memory usage: 565.2+ KB


In [23]:
#Check duplicates values
df.duplicated().sum()

0

In [24]:
#Check missing and null values 
df.isnull().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [25]:
#Types of columns
df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [30]:
# Let's find that which columns are numerical in dataset
numeric_columns = df.select_dtypes(include=['int','float']).columns
numeric_columns

Index(['age', 'balance', 'day', 'campaign', 'pdays', 'previous'], dtype='object')

In [29]:
# Let's find that which columns are categorical in dataset
categorical_columns=df.select_dtypes(include='object').columns
categorical_columns

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y'],
      dtype='object')

## Categorical Variables

There are 2 ranges of variables in ML.

1.Nominal variables which order is not important such as colors.

2.Ordinal variables which order is important such as Bachelor's Degree, Master's Degree, Doctorate Degree.

**Ordinal variables**

**poutcome**

In [31]:
df['poutcome'].value_counts()

unknown    3705
failure     490
other       197
success     129
Name: poutcome, dtype: int64

In [43]:
df['poutcome']=df['poutcome'].map({'success' :1,'failure': -1, 'other' :0,'unknown' :0})

In [44]:
df['poutcome'].value_counts()

 0    3902
-1     490
 1     129
Name: poutcome, dtype: int64

**default**

In [32]:
df['default'].value_counts()

no     4445
yes      76
Name: default, dtype: int64

In [35]:
df['default']=df['default'].map({'yes' : 1, 'no' : 0})

In [36]:
df['default'].value_counts()

0    4445
1      76
Name: default, dtype: int64

**housing**

In [33]:
df['housing'].value_counts()

yes    2559
no     1962
Name: housing, dtype: int64

In [37]:
df['housing']=df['housing'].map({'yes' : 1, 'no' :0})

In [38]:
df['housing'].value_counts()

1    2559
0    1962
Name: housing, dtype: int64

**loan**

In [34]:
df['loan'].value_counts()

no     3830
yes     691
Name: loan, dtype: int64

In [39]:
df['loan']=df['loan'].map({'yes' :1, 'no' : 0})

In [40]:
df['loan'].value_counts()

0    3830
1     691
Name: loan, dtype: int64

## Nominal Variables

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        4521 non-null   int64 
 1   job        4521 non-null   object
 2   marital    4521 non-null   object
 3   education  4521 non-null   object
 4   default    4521 non-null   int64 
 5   balance    4521 non-null   int64 
 6   housing    4521 non-null   int64 
 7   loan       4521 non-null   int64 
 8   contact    4521 non-null   object
 9   day        4521 non-null   int64 
 10  month      4521 non-null   object
 11  campaign   4521 non-null   int64 
 12  pdays      4521 non-null   int64 
 13  previous   4521 non-null   int64 
 14  poutcome   4521 non-null   int64 
 15  y          4521 non-null   int64 
dtypes: int64(11), object(5)
memory usage: 565.2+ KB


In [53]:
nominal_variables=['job','marital','education','contact','month','day']

## Target Variable

In [45]:
df['y'].value_counts()

no     4000
yes     521
Name: y, dtype: int64

In [46]:
df['y']=df['y'].map({'yes' :1,'no' :0})

In [47]:
df['y'].value_counts()

0    4000
1     521
Name: y, dtype: int64

## One Hot Encoding

In [54]:
df=pd.get_dummies(df,columns=nominal_variables)

In [55]:
df.head()

Unnamed: 0,age,default,balance,housing,loan,campaign,pdays,previous,poutcome,y,...,day_22,day_23,day_24,day_25,day_26,day_27,day_28,day_29,day_30,day_31
0,30,0,1787,0,0,1,-1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,33,0,4789,1,1,1,339,4,-1,0,...,0,0,0,0,0,0,0,0,0,0
2,35,0,1350,1,0,1,330,1,-1,0,...,0,0,0,0,0,0,0,0,0,0
3,30,0,1476,1,1,4,-1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,59,0,0,1,0,1,-1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [57]:
df.shape

(4521, 75)

In [58]:
df.columns

Index(['age', 'default', 'balance', 'housing', 'loan', 'campaign', 'pdays',
       'previous', 'poutcome', 'y', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student', 'job_technician',
       'job_unemployed', 'job_unknown', 'marital_divorced', 'marital_married',
       'marital_single', 'education_primary', 'education_secondary',
       'education_tertiary', 'education_unknown', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6', 'day_7', 'day_8',
       'day_9', 'day_10', 'day_11', 'day_12', 'day_13', 'day_14', 'day_15',
       'day_16', 'day_17', 'day_18', 'day_19', 'day_20', 'day_21', 'day_22',
       'day_23', 'day_24', 'day_25