# Direct Marketing Prediction

In this notebook, we are going to explore the direct marketing dataset. First, we start off with Exploratory Data Analysis, followed by feature engineering steps, walk through imputation, apply Machine Learning models, and interpret the results. We look at individual model performance, hypoerparameter tuning and look at different evaluation metrics. Later, we understand cost-benefit analysis for our problem and finally deploy a few models via. Airflow.

This notebook is solely for Data, features and model exploration, and is not the final deployment code. For more, visit [Direct Marketing Prediction Homepage](https://github.com/AdiVarma27/DirectMarketingPrediction).

## TOC:

1. [Problem Statement & Data Import](#1)    
    1.1. [Problem Formulation](#1.1)    
    1.2. [Dataset import & columns exploration](#1.2)


2. [Exploratory Data Analysis](#2)            
    2.1. [Feature Exploration](#2.1)     
    2.2. [Feature label Analysis](#2.2)      


3. [Data Engineering](#3)     
    3.1. [Imputation & Cleaning](#3.1)     
    3.2. [Feature Engineering](#3.2)      
   
   
4. [Modeling](#4)    

---
## 1). Problem Statement & Data Import <a class="anchor" id="1"></a>

In this section, we are going to understand the task at hand and explore possible set of questions we could answer with the dataset, by looking at the columns at hand.

Our dataset comes from AWS Sagemaker examples and original source used in the following paper:

**[A data-driven approach to predict the success of bank telemarketing](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf)**

[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

### 1.1). Problem Formulation<a class="anchor" id="1.1"></a>

* From the above paper, we see that the source contains features of customers, who are **targeted through Telemarketing calls, to sell long-term deposits**. The dataset also includes calls where the client calls customer care for some reason, and long-term deposits are upsold at end of conversation as well. Hence, the dataset also contains a **label**, which denotes success or no-success on selling long-term deposits.


* The dataset comes from a rea Portuguese retail bank, from May 2008 to June 2013, with 52,944 phone calls in total. As expected with real world datasets, the dataset is imbalanced (different number of successful and unsuccessful outcome of treatment effect). The author also includes other interesting external resources such as socio-economic factors (Unemployment variation) and so on. We could also look at other socia-economic metrics about Portugal for those years and see if they might influence any purchasing decision making process.


* Given we have previous campaign information (campaign, contact and outcome), we could potentially find clusters of customers (Clustered by Treatment-Response). Hence, with respect to modeling, we have clear **Supervised Learning** & **Unsupervised Learning** tasks at hand. Other than that, we have a ton of interesting questions to answer with the dataset through **Exploratory Analysis & Visualizations**. Lets start by importing the required libraries & dataset.

### 1.2). Dataset Import <a class="anchor" id="1.2"></a>

In [17]:
# importing necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('max_columns', None)

# importing main dataset into DataFrame
df = pd.read_csv('dataset/bank-additional/bank-additional-full.csv')

# 41k rows, 21 columns
df.shape

(41188, 21)

In [19]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
