## _Response Modeling of Bank Marketing Campaign_

<br />

<img src="AI.png" width = '400'><br>


### _Business Scenario_

There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.


* The task is to build a POC for the problem

* The data is related with direct marketing campaigns of a Portuguese banking institution. 

* The marketing campaigns were based on phone calls. 

* Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

## _Attributes Information_


### _Bank client data:_
1 - age (numeric)

2 - job : type of job 
(categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status 
(categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

### _Data Related to the last contact of the current campaign:_
8 - contact: contact communication type (categorical: 'cellular','telephone') 

9 - month: last contact month of year 
(categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week 
(categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). 
Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### _Other attributes:_

12 - campaign: number of contacts performed during this campaign and for this client 
(numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign 
(numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

### _Social and economic context attributes_
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric) 

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

In [1]:
# Set Directory
import os

## _Exploratory Analysis_

### _Import Libraries_

In [2]:
#! pip install imblearn

#if the above command does not work to install imblearn package run the following command in your terminal
# conda install -c glemaitre imbalanced-learn

In [3]:
# !pip install seaborn

In [None]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, recall_score, precision_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
def convert_for_sklearn(label_list):
    return [1 if i == 'yes' else 0 for i in label_list]

def accuracy_precision_recall_metrics(y_true, y_pred):
    
    y_test_scoring = convert_for_sklearn(y_test)
    test_pred_scoring = convert_for_sklearn(y_pred)

    acc = accuracy_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    prec = precision_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    rec = recall_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    
    print("Test Precision: ",acc)
    print("Test Recall: ",prec)
    print("Test Accuracy: ",rec)

### _Read in the data_

from google.colab import drive
drive.mount('/content/drive')

!ls "/content/drive/My Drive/Colab Notebooks/CSE7305c_ML_Architecture"

In [None]:
#### TYPE

### _Understand the dataset_

Undestanding the dataset can be as thorough as you want it to be, you can start by looking at the variables and asking questions, like the one's mentioned below.

In [None]:
# What are the names of the columns?
print(list(bank_data.columns))

In [None]:
# What are the data types?
bank_data.dtypes

In [None]:
# What is the distribution of numerical columns?
bank_data.describe()

In [None]:
# What about categorical variable levels count?
bank_data.describe(include=['object'])

In [None]:
bank_data.marital.value_counts()

### _Distribition of dependent variable_

In [None]:
# Plot Distribution
#### TYPE


# What are the counts?
print(bank_data.y.value_counts())

# What is the percentage?
count_yes = len(bank_data[bank_data.y == 'yes'])
count_no = len(bank_data[bank_data.y != 'yes'])

percent_success = (count_yes/(count_yes + count_no))*100

print('Percentage of people who have taken the campaign:', percent_success, "%")

### _Drawing trends toward the target variable_

In [None]:
#### TYPE

In [None]:
#### TYPE

### _Feature Engineering_

#### _Fix levels of categorical variable by domain_

In [None]:
#### TYPE

In [None]:
bank_data.education.value_counts()

#### _Drop Unnecessary variables_

In [None]:
#### TYPE

In [None]:
test_data.shape

#### _Type Casting_

In [None]:
for col in ['job', 'marital', 'education', 'credit_default', 'housing', 'loan', 'contact', 'contacted_month', 'day_of_week', 'poutcome', 'y']:
    bank_data[col] = bank_data[col].astype('category')

In [None]:
bank_data.dtypes

#### Split Numeric and Categorical Columns

In [None]:
#### TYPE

In [None]:
cat_attr

In [None]:
num_attr

#### _Handle Missing Values_

In [None]:
bank_data.isnull().sum()

## Sklearn pipelines

Pipelines are a way to streamline a lot of the routine processes, encapsulating little pieces of logic into one function call, which makes it easier to actually do modeling instead just writing a bunch of code.

Pipelines are set up with the fit/transform/predict functionality, so you can fit a whole pipeline to the training data and transform to the test data, without having to do it individually for each thing you do. Super convenienent, right?

Steps to follow to create a pipeline

Step 1) Fetch the numerical and categorical columns

Step 2) Create a transformer/pipeline for numerical attributes

    Create a list of tuples where each tuple represents the operation to be performed on numerical attributes

Step 3) Create a transformer/pipeline for categorical attributes

    Create a list of tuples where each tuple represent the operation to be performed on categorical attributes

Step 4) Create a ColumnTransformer which merges both the numerical and categorical transformers

Step5) Create a final pipeline object which includes the ColumnTransformer and an estimator (an algorithm to be build on dataset)

Step6) (optional) Create a GridSearchCV object with pipeline as one of the inputs along with hyperparameter grid and Cross validation object

Step7) Apply fit() on train data and predict() on test data <br><br>

**TL; DR**
Pipeline is a collection of transformers chained together and operate sequentially. (often ending with an estimator)

__Bird's view of sklearn pipeline__

<img src="Pipeline_broadview.png"><br><br>

__Train and Test dataflow inside the sklearn pipeline__
<img src="fit_tranform.jpg">

#### _Instantiate Pre-processing Objects for Pipeline_

In [None]:
#### TYPE

#### _Instantiate Pipeline Object_

In [None]:
#### TYPE

### _Train-Test Split_

In [None]:
X_train, y_train = bank_data.loc[:,bank_data.columns!='y'], bank_data.loc[:,'y']

X_test, y_test = test_data.loc[:,test_data.columns!='y'], test_data.loc[:,'y']

In [None]:
X_train.head(1)

### _Build Logistic Regression Model - 1_

In [None]:
#### TYPE

### _Evaluate Model_

In [None]:
#### TYPE

### _Build SVC Model - 2

In [None]:
accuracy_precision_recall_metrics(y_true = y_test, y_pred = test_pred)

In [None]:
### A. SVM (Linear  and RBF Models)

In [None]:
%%time
#### TYPE

In [None]:
train_pred =svc_lin.predict(X_train)
test_pred = svc_lin.predict(X_test)

In [None]:
Accuracy_test=(confusion_matrix_test[0,0]+confusion_matrix_test[1,1])/(confusion_matrix_test[0,0]+confusion_matrix_test[0,1]+confusion_matrix_test[1,0]+confusion_matrix_test[1,1])

Precision_Test= confusion_matrix_test[1,1]/(confusion_matrix_test[1,1]+confusion_matrix_test[0,1])
Recall_Test= confusion_matrix_test[1,1]/(confusion_matrix_test[1,0]+confusion_matrix_test[1,1])

print("Test Precision: ",Precision_Test)
print("Test Recall: ",Recall_Test)
print("Test Accuracy: ",Accuracy_test)

In [None]:
### B. SVM (rbf kernel)

In [None]:
%%time
#### TYPE

In [None]:
train_pred =svc_rbf.predict(X_train)
test_pred = svc_rbf.predict(X_test)

In [None]:
#### Using the function to calculate accuracy, precision and recall.

accuracy_precision_recall_metrics(y_true = y_test, y_pred = test_pred)

### C. SVM (Grid Search CV)

In [None]:
%%time
#### TYPE

In [None]:
svc_grid.best_params_

In [None]:
train_pred = svc_grid.predict(X_train)
test_pred = svc_grid.predict(X_test)


In [None]:
#### Using the function to calculate accuracy, precision and recall.

accuracy_precision_recall_metrics(y_true = y_test, y_pred = test_pred)

### C. SVM (Random Search CV)

In [None]:
%%time
#### TYPE

In [None]:
svc_random.best_params_

In [None]:
train_pred = svc_random.predict(X_train)
test_pred = svc_random.predict(X_test)

In [None]:
#### Using the function to calculate accuracy, precision and recall.

accuracy_precision_recall_metrics(y_true = y_test, y_pred = test_pred)

## Handling Imbalanced Data

###  Oversample Using SMOTE_

<img src="SMOTE.jpg" width = '400'><br>

In [None]:
clf = Pipeline(steps=[('preprocessor', preprocessor)])

X_train_pp = pd.DataFrame(clf.fit_transform(X_train))
X_test_pp = pd.DataFrame(clf.transform(X_test))

In [None]:
###### from imblearn.over_sampling import SMOTE

#### TYPE


os_data_X = pd.DataFrame(data=os_data_X)
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])

# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y['y']=='no']))
print("Number of subscription",len(os_data_y[os_data_y['y']=='yes']))
print("Proportion of no subscription data in oversampled data is ",len(os_data_y[os_data_y['y']=='no'])/len(os_data_X))
print("Proportion of subscription data in oversampled data is ",len(os_data_y[os_data_y['y']=='yes'])/len(os_data_X))

In [None]:
%%time

clf_svc = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', SVC())])
svc_param_grid = {'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100 ], 'gamma':[0, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
                  "classifier__kernel":'linear'}


svc_grid_bal = GridSearchCV(clf_svc, param_grid=svc_param_grid, cv=5)

svc_grid_bal.fit(os_data_X,os_data_y)


In [None]:
train_pred = svc_grid_bal.predict(os_data_X).reshape(1,-1)[0]
test_pred = svc_grid_bal.predict(X_test_pp).reshape(1,-1)[0]

In [None]:
confusion_matrix_train = confusion_matrix(y_true=os_data_y, y_pred = train_pred)
confusion_matrix_train

In [None]:
Accuracy_train=(confusion_matrix_train[0,0]+confusion_matrix_train[1,1])/(np.sum(confusion_matrix_train))

Precision_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,1]+confusion_matrix_train[0,1])
Recall_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,0]+confusion_matrix_train[1,1])

print("Test Precision: ",Precision_Train)
print("Test Recall: ",Recall_Train)
print("Train Accuracy: ",Accuracy_train)

In [None]:
accuracy_precision_recall_metrics(y_true = y_test, y_pred = test_pred)