<a href="https://colab.research.google.com/github/Pooja-Ramesh/WSU/blob/master/Bank_Marketing_Prediction_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **European Banking Institute | Case Study** 
Pooja Ramesh
---


---


**Introduction:**

A European banking institute conducted a marketing campaign for 18 months between 2008 & 2010. The direct markteing was rendered through phone calls. The financial instrument offered during the marketing was term deposit. The bank has adopted advance analytics, Predictive Machine Learning, to learn about the customers who are receptive to subscribing to marketing campaign


> *What is Direct Marketing?* \
Selected customers are contacted directly through a communication mode (personal contact, telephone cellular, mail, and email) to advertise the new product/service or give an offer on the existing suite of products.



> *What is Term Deposit?*\
Term Deposits, popularly known as Fixed Deposit, is an investment instrument in which a lump-sum sum amount is deposited at an agreed rate of interest for a fixed period of time, ranging from 1 month to 5 years.



> *Data Composition*\
The data shared by the bank comprises pf ~ 41,000 customer records with 22 features. Features are Associated with the below categories:
1.   Customer Demographics
2.   Campaign - Recent & Historic information
3.   Customer Credit/Debt history
4.   Economic Indicators
5.   Bank Employee Information
6.   Current Model Prediction - probability that customer will subscribe



> Case Study Objective:
1.   Evaluate performance of the current predictive model
     - Is the model accurately identifying customers who subscribed to the campaign?
     - Is the model identifying customers as 'will subscribe' when they have'nt?
2.   Exploratory Data Analysis - *Completed in Power BI*
3.   Opportunites for Future Modelling



> Assumptions Applied:
1.   The data gathered is representative of teh reality
2. Only one type of term deposit is offered in a marketing campaign to all customers.
3.   The historic campaign data relates to the same product type
4.   The predictions availabe are values from the test dataset. The bank has trained the data on another set and is not shared for performance analysis.
5.   All data made avaialble (all rows & columns; except for dependant variable 'y' & 'probability predictions') were used by the bank for testing purpose.
6. 'Unknown' values in the data table exist and its not missing at random; Assumption - Customers may be reluctant to share personal information during marketing campaigns.
7. Discrepancy identified between PDAYS, POUTCOME, PREVIOUS relating to historic campaigns; Assumption - Data was sourced from various sources and the data in column 'PDAYS' is correct. Override was applied whereever necessary.







#Objective 1: Evaluate Performance

A machine learning classification model can be used to predict the actual class of the data point directly or predict its probability of belonging to different classes. The latter gives us more control over the result. We can determine our own threshold to interpret the result of the classifier. This is sometimes more prudent than just building a completely new model.\
The predictions in the bank dataset are represented as probability of the instance to be positive (customer will subscribe to term deposit in the marketing campaign). 

In case we were to promote the current machine learning model for future campaign decisions, hypothetically, a stakeholder requirement could fall into the below two options: 

1.   From a marketing manager perspective, is the model correctly identifying customers who have subscribed to term deposits? The manager is interested in higher True Positive Rate as a performance metric since they want to gain higher conversion rate on these marketing activities.
2.   From an operations manager perspective, is the model incorrectly identifying customers to have subscribed to term deposits, when they have'nt? The manager is interested in keeping low False Positive Rate as a performance metric since they want to optimally utilize the staff for marketing activities.

As we dont have these stakeholders available to advise on specificity of the threshold for business purpose; we will use a metric called AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve that evaluates the model's performance at varied thresholds.

> What is the AUC - ROC Curve?
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1.By analogy, the Higher the AUC, the better the model is at distinguishing between customers who will subscribe vs. not. 
How is this calculated? ROC curve plots TPR against FPR at various threshold values.
1.   True Positive Rate (TPR) -  proportion of the customers who subscribed got correctly classified.
2.   False Positive Rate (FPR) - proportion of the customer who didnt subscribe but got incorrectly classified.\
\
How do we read AUC?
- When AUC = 1, then the classifier is able to perfectly distinguish between all the Positive and the Negative class points correctly.
- When 0.5<AUC<1, there is a high chance that the classifier will be able to distinguish the positive class values from the negative class values.
- AUC=0.5, then the classifier is not able to distinguish between Positive and Negative class points.





**Import necessary libraries and load the file needed for our Analysis**

Pandas, Numpy, Seaborn and Matplotlib has been imported to read/explore the dataset. \
Sklearn has been imported to complete model performance analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import metrics  #Machine Learning Library
from sklearn.metrics import roc_curve

Read bank data onto a dataset and started to perform priliminary Data Wrangling:

- Replaced null string with _ to read and access column names easily
- Renamed columns in lowercase to run analysis easily

In [None]:
# Read the data from CSV File onto a Pandas Dataframe.
bank = pd.read_csv('/DSA Data Set.csv', sep=',')

# Rename columns to lowercase and replace space with '_'  
bank.columns = bank.columns.str.replace(' ','_')
bank.columns = map(str.lower, bank.columns)

#Preview of the first 5 rows of the data
bank.head()

In [None]:
#Replace 'y' &'n' actual customer responses (representing subscribed or refused respectively) to 0 & 1
bank['y'].replace(['no', 'yes'], [0,1], inplace = True) 

In [None]:
#Count of rows per each response category:
bank['y'].value_counts()
#Data is highly imbalanced; Which is representative of the real-world scenario. The subscribed ratio to refused is 1:8

In [None]:
#Save the actual customer responses & predicted probabilities separately for analysis
y_bank = bank['y']
pred_bank = bank['modelprediction']

In [None]:
#Calculate the ROC Curve to understand which threshold yields better results
fpr, tpr, threshold = metrics.roc_curve(y_bank, pred_bank)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
fig, (ax) = plt.subplots(nrows = 1, ncols = 1, figsize = (15,8))
ax.plot(fpr, tpr, 'b', label='ROC curve (area = %.2f)' %roc_auc)
ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess')
ax.set_title('Receiver Operating Characteristic - Bank Model Evaluation ',fontsize=20)
ax.set_ylabel('True Positive Rate',fontsize=12)
ax.set_xlabel('False Positive Rate',fontsize=12)
ax.legend(loc = 'lower right', prop={'size': 15})

Interpretation: The model performance is poorer than the random guess of 0.5

Next Steps:
- Check wrong predictions in testing data, and find which “feature” makes your model predicts correct in training data but wrong in testing data.
- Check if train AUC > 0.5
- Understand if the model used for training was sensitive to highly imbalanced data.
- Understand if the training data was baised to the negative class


#Objective 2: Opportunites for Future Modelling
**Data Cleaning**
- **Missing Data Check** - Completed in Power BI, None Identified 

- **Duplicate Check** - Completed in Power BI, None Identified

- **Data Validation** Completed in Power BI
1. pdays <> previous
2. pdays <> poutcome

- **Data Visualizations** Completed in Power BI

**Data Pre-Processing - Features to drop:**
1. Current Campaign: The case study objective is to analyse the customer who will be receptive to marketing campaign in **advance**.  All columns associated with current campaign will be removed for this purpose. Columns include: contact, month, day_of_week, duration, campaign
2. Past campaign: Feature engineering has been performed on column 'pdays' to calculate a binary column representing whether the customer was contacted in historic campaigns or not. From the EDA in power BI, we have also learnt that if the customer has been contacted more than 1, there are fewer customers who have subscribed. This needs to be addressed as part of the customer segment that the marketing campaign will focus on. Original column pdays & previous will be removed from the analysis. Since, we are interested in the binary contact information & its repective response. 
3. Bank Employee Data: Assumption1 - Column 'nr:employed' indicates the count of all employees the bank had in any given quarter. Assumption2 - Not all customers of the bank are part of the marketing activity. Hence this column will be removed.

**Future Modelling**
Handling Imbalance dataset: 
We understand that the data gathered is representative of the reality and it is highly imbalanced. How do we handle this?
1. Undersampling: Consists in sampling from the majority class in order to keep only a part of these points.
2. Oversampling:  Consists in replicating some points from the minority class in order to increase its cardinality.


In [None]:
#A quick check on the number of missing data in each column:
bank.isnull().any()
missing_values_count = bank.isnull().sum()
print(missing_values_count[0:8])
total_cells = np.product(bank.shape)
total_missing = missing_values_count.sum()
print('percentage of data missing -',(total_missing/total_cells) * 100)

In [None]:
bank.shape

In [None]:
bank.columns

In [None]:
#If 'pdays'=999, create a column 'pcontacted' with values 0 representing 'Not contacted', else 1 representing 'Contacted'
bank['pcontacted'] = np.where(bank['pdays']==999, 0, 1)

In [None]:
#Sanity check of values in column 'poutcome' based on pcontacted = 0 (not contacted) 
bank1 = bank[bank['pcontacted']==0]
bank1['poutcome'].unique()
#'failure' outcome is available as values for customers who were not contacted.

In [None]:
#Replace thebank data column values in 'poutcome' based on values of 'pcontacted'
bank['poutcome'] = bank.loc[bank.pcontacted == 0, 'poutcome'] = "nonexistent"

In [None]:
#Remove columns that are'nt useful for modelling purpose:
bank.drop(['contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'nr.employed', 'modelprediction'], axis=1, inplace=True)

In [None]:
#Remove any duplicates from the data
bank.duplicated(subset=None, keep='first')

In [None]:
#Check the shape of the data after removing duplicates - No duplicates were found
bank.shape

In [None]:
#A quick glace at the data to understand the data type & change if necessary
bank.info()

In [None]:
#Change values to String for categorial variables
bank['y'] = bank['y'].astype(str)
bank['pcontacted'] = bank['pcontacted'].astype(str)

#Assumption - The algorithm selected for modelling is not sensitive to categorical variables

In [None]:
#Check data type after conversion
bank.info()