# Create Model Dataset

## **Introduction:** 

This notebook extracts bank marketing data from the UCI Machine Learning Repository. The objective is to build a classification model to predict the subscriptions. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

**Business Problem:** How to build a classification model to predict the customers who are expected to subscribe a term deposit (variable y).

In [1]:
# import required packages

import os
import pandas as pd
import numpy as np
import collections
from sklearn.base import TransformerMixin
import random
import pandas_profiling

# import required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import time
import re
import os
import matplotlib.pyplot as plt
sns.set(style="ticks")

#import tensorflow as tf
#import tflearn

# import libraries required for preprocessing
import sklearn as sk
from scipy import stats
from sklearn import preprocessing

# novelty analysis
from sklearn.neighbors import LocalOutlierFactor
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split



# set the working directory # in the example, the folder 'packt_exercises' is in the desktop
os.chdir("/Users/svk/Desktop/packt_exercises")

## Data Source

Read the input data that is downloaded from the UCI Machine Library repository for Bank Marketing Data from the link: https://archive.ics.uci.edu/ml/datasets/bank+marketing

In [2]:
# read the input dataset as 'df' using pandas' read_csv function
df = pd.read_csv('bank.csv', sep=';')

# view the first 5 rows of the dataset using head function
df.head(5)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


### Data Dictionary

Provides detailed attribute level information:

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

8 - contact: contact communication type (categorical: 'cellular','telephone') 

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric) 

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

## Data Understanding

To understand the data at a attribute level, we can use functions like info and describe, however, pandas_profiling is a library that provides many descriptive information in one function where we can extract the following information:

At dataset level: 

1. Number of variables
2. Number of observations
3. Total Missing (%)
4. Total size in memory
5. Average record size in memory
6. Correlation Matrix
7. Sample Data

At attribute level:

1. Distinct count
2. Unique (%)
3. Missing (%)	
4. Missing (n)	
5. Infinite (%)
6. Infinite (n)
7. Histogram for distribution
8. Extreme Values

# Data Preprocessing

**Missing Value Treatment:** For the current example, as there are no missing values in the data, we will introduce missing values using the following method

In [3]:
# introducing missing values in the data

# set the loop parameters
replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.1*len(ix)))

# loop for generating missing values
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

In [4]:
# lets look into each column's missing values
df.isna().sum()

age          442
job          424
marital      460
education    466
default      441
balance      405
housing      467
loan         474
contact      454
day          455
month        461
duration     458
campaign     484
pdays        419
previous     428
poutcome     481
y            467
dtype: int64

## Preprocessing Step 1: Missing Value Imputation 

Imputation  means replacing the missing values with an estimate that is generated through various methods. A very simple model for treating is using mean values or for categorical values is mode imputation. We can use other methods like K-Nearest Neighbor or RandomForest for missing value treatment.

In [5]:
# develop the function for missing value imputation

class DataFrameImputer(TransformerMixin):
    def __init__(self):
        """Impute missing values.
        Columns of dtype object are imputed with the most frequent value 
        in column.
        Columns of other types are imputed with mean of column.
        """
    def fit(self, X, y=None):
        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.fill)

### Apply the model on the dataset

In [6]:
# apply the developed model DataFrameImputer on df

df = DataFrameImputer().fit_transform(df)
df.isna().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

## Preprocessing Step 2: Outlier Treatment

An outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

In this excercise, we will use "interquartile range", abbreviated "IQR" method for outlier treatment. 

In [7]:
# define the range of IQRs and apply on the dataset to identify the outliers

num = df._get_numeric_data() 
Q1 = num.quantile(0.25)
Q3 = num.quantile(0.75)
IQR = Q3 - Q1
print(num < (Q1 - 1.5 * IQR))
print(num > (Q3 + 1.5 * IQR))

        age  balance    day  duration  campaign  pdays  previous
0     False    False  False     False     False  False     False
1     False    False  False     False     False  False     False
2     False    False  False     False     False  False     False
3     False    False  False     False     False  False     False
4     False    False  False     False     False  False     False
5     False    False  False     False     False  False     False
6     False    False  False     False     False  False     False
7     False    False  False     False     False  False     False
8     False    False  False     False     False  False     False
9     False    False  False     False     False  False     False
10    False    False  False     False     False  False     False
11    False    False  False     False     False  False     False
12    False    False  False     False     False  False     False
13    False    False  False     False     False  False     False
14    False    False  Fal