<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# The Bank Marketing Campaign (Report)



### Authors: Abdullah Al-Qithmi
---


## Dataset information

#### This study considers real data collected from direct telemarketing campaigns of **Portuguese retail bank**
* The dataset consisted of **41,188** customer data
* It was collected from  **May 2008** to **June 2013**
* Often, **more than one contact** to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
* Both **Inbound** & **Outbound marketing** techniques were used during the campaigns
    * Inbound Telemarketing: a customer calls in to the cell center.
    * Outbound telemarketing: call-centers call potential customers directly.



#### The dataset encompasses 3 main groups of features:
* **Demographic Information** — age, job, marital, education, default, housing, & loan
* **Time Characteristics of the Call** — day, month, & duration
* **Characteristics of the Campaign** — contact, campaign, pdays, previous, & poutcome

The dataset can be downloaded from <a href="http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing"><b>Here</b></a>

---

## Problem statnemnt


- There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. As a result, the Portuguese bank would like to identify existing clients that have a higher chance to subscribe for a term deposit and focus marketing effort on such clients. 

- Thus, we are trying to find a way to improve the effectiveness of the campaign by analyzing the previous campaigns and build a classification model

## Business Understanding
* **Term Deposit:** is a deposit held at a financial institution that has a fixed term
* **Direct Marketing:** is a form of communicating an offer, where organizations communicate directly to a pre-selected customers group.
* **Retail Banking:** is the provision of services by a bank to the general public, rather than to companies.
* **Default:** failure to repay a loan.


---
### Data Dictionary

In [2]:
import pandas as pd
#creating a dataframe that contains the dataset attributes names, description, and type
df = pd.DataFrame({
"Attribute Name": ["Age", "Job", "Marital", "Education", "Default", "Housing", "Loan", "Contact", "Month", "Day of Week", "Duration", "Campaign", "Pdays", "Previous", "Poutcome", "Emp.var.rate", "Cos.price.idx", "Cons.conf.idx", "Euribor3m", "Nr.employed", "y"],
"Description": ["It is age of client", "It is type of client’s job.",
"It is client’s marital status.", "What is the highest education of client?", "Does client has credit?", "Does client has housing loan?", "Does client has personal loan?",
"What is a contact communication type of client?",
"Last month of the year contracting to the client",
"Last day of the week contracting to the client?", "How long does it contact to the client?", "Number of contacts performed during campaign", "Number of days passed after last contacted", "Number of contacts performed before campaign",
"Outcome of the previous marketing campaign", "Employment variation rate",
"Consumer price index", "Consumer confidence index", "Euribor 3 month rate", "Number of employees", "Does the client has subscribed a term deposit?"],
"Type": ["Numeric", "Categorical", "Categorical", "Categorical", "Categorical", "Categorical", "Categorical", "Categorical", "Categorical", "Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric", "Categorical"],
                  })

df.style.set_properties(**{'text-align': 'left'})

Unnamed: 0,Attribute Name,Description,Type
0,Age,It is age of client,Numeric
1,Job,It is type of client’s job.,Categorical
2,Marital,It is client’s marital status.,Categorical
3,Education,What is the highest education of client?,Categorical
4,Default,Does client has credit?,Categorical
5,Housing,Does client has housing loan?,Categorical
6,Loan,Does client has personal loan?,Categorical
7,Contact,What is a contact communication type of client?,Categorical
8,Month,Last month of the year contracting to the client,Categorical
9,Day of Week,Last day of the week contracting to the client?,Categorical


## Exploratory Data Analysis

Done mathematically first, and then we will go through visualize the data to produce more concrete results and observations

- Also, during this stage, we did some exploration activities for the missing values
    * Null Values
    * Zero Values
    * Specific labels for missing values (Ex; Unknown, N/A, 999)
 ---

## Observations

### Missing Values (Observations)
* **For The Object Attributes:** 
    - There are a number of missing values that coded with the "unknown" label.
* **For The Numeric Attributes:** 
    - Some attributes have zero values; however, these zero values have a meaning, so it does not consider as missing values. 
    - The **default attribute** contains a large number of missing value which is about **20.9%** of its records 
    - Also, in the **pdays attribute**, **"999"** means client was not previously contacted which represents the **majority**
    
    
 ### EDA (Observations)
 
1. **Job Attribute:** the clients who have more profitable jobs (admin, blue-collar, technicians,  management, services, & retired) tend to subscribe in the term-deposit more than the others
2. **Default Attribute:** the clients who have defaulted on credit, have no chance to subscribe on a term deposit
3. **Loan Attribute:** the people who do not have any financial obligations like loans tend more to subscribe on a term deposit.
4. **Month Attribute:** months have a large relation with the number of subscribers, and it seems to be an increase in the subscription number In the middle of the year and a decrease at the end of the year.
5. All of **DAY_OF_WEEK, HOUSING, MARITAL, & EDUCATION** have no clear relation with our output Attribute

6. Most of the subscribers belong to the **23-60** age group which is normal since it's considered the working-age stage in most the world.
    - In addition, the clients who are **younger than 23 years** usually can not afford the deposit
    - On the other hand, the clients who are **older than 60 years** usually are not interested in purchasing a term deposit since they are in their retirement period and they try to enjoy their time and spend more money.
7. We can observe that most of the clients who subscribed to the term-deposit made their decision clear in the few  first contacts
8. Also, The greater the number of contacts the less the number of positive responses
9. Most of the bank customers in this dataset are aged between **30–40 years**.
10. The age attribute has some **outliers**.
11. nr.employed is highly correlated to euribor3m
12. emp.var.rate is highly correlated to euribor3m
13. emp.var.rate  is highly correlated to nr.employed
14. Our target "y" positive correlation with previous and a negative correlation with nr.employed.
 
---

## Data Preparation & Cleansing
- **Drop columns**
    1. The **duration** of the call can only be known after the call has been performed, and usually after performing the call, the outcome is obviously known. Thus, the duration attribute should be drop; else, the predictive model will not be a realistic one.
    2. All of these attributes **('euribor3m', 'nr.employed', 'cons.price.idx')** have high correlation which could violate assumption for the log, so we need to drop them
    3. We need to drop **'pdays '** since most of the clients (about 96%) were not contacted previously, so it's useless
- **Missing Values**
    1.  For the **default** attribute, we will Infer the unknown values based on other attributes. Since there is a clear relation between default and our target (y), we will Infer the unknown values of the default based on our target (y).
    2. For **loan & housing** attributes, we will use the KNN log to fill the missing values since we could not find any clear relation between loan & housing and the other attributes
    3. For **job and education** attributes, we will create a separate category named other for all the missing values of the job and education attributes
    4. For marital attribute, we will drop the missing values since it only represents 0.19% of its values
- **Reduce Categorical Attributes**
    1. Reduce Categories of education and marital status
- **Outliers Values**
    1. Only two attributes contain outliers values which are age & duration. The duration will be dropped, and for the age outliers values, no need to treat them since it will not affect the model.
- **Reduce Categorical Attributes**
    1. For Marital Attribute, convert divorced values to be single
    2. For Education Attribute, group “basic.4y”,“basic.9y”,“basic.6y”, and call them “basic”.
---

## Features Selection
- From the above Exploration, the selected features are as following:
    * age
    * campaign
    * previous
    * Job
    * Default
    * Loan
    * Month
    * poutcome

## Model
Because its mainly a classification problem, the chosen model is **Logistic regression**

#### Logistic Regression Assumptions
- Binary logistic regression requires the dependent variable to be binary.
- For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
- Only the meaningful variables should be included.
- The independent variables should be independent of each other. That is, the model should have little or no - multicollinearity.
- The independent variables are linearly related to the log odds.
- Logistic regression requires quite large sample sizes.  

---
## Model Evaluation Metrics
Evaluation metrics are based on the training and testing accuracy results. Also, evaluating the testing set scores on the baseline accuracy of each approach. For the best performing models in the test set, a classification report is displayed showing different metrics for the models such as F1 score. Lastly, a confusion matrix is displayed also for the well-performing model.

## Future Work

**There is always a chance to improvement**, and here is a list of future works that can be done to improve the project

1. Remove the outlier values of the age attributes
2. Use different classification models
3. Try to treat the imbalanced classes of the output values with different methods
4. Come with a different set of features
5. Try to reduce both False Negatives and False Positives