# Project 4: Exploratory Data Analysis and Model Building - Bank Marketing
### Muhammed Albayati

## Overview
For the fourth project you’ll be applying your knowledge of supervised machine learning to build a classifier. Specifically, you’ll be attempting to classify whether potential customers will be persuaded to become customers of a bank.This project will fit well in your portfolio – prediction is a highly sought-after skill. Plus, you will build on your machine learning skills in the upcoming capstone project.
## Concepts covered:
- Cleaning and preparing data\
- Exploring and visualizing data\
- Model selection\
- Improving machine learning model performance
### Bank Marketing Data Set
Source: https://archive.ics.uci.edu/ml/datasets/Bank%20Marketing

## Attribute Information:

Input variables:
### bank client data:
1 - age (numeric)\
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')\
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)\
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')\
5 - default: has credit in default? (categorical: 'no','yes','unknown')\
6 - housing: has housing loan? (categorical: 'no','yes','unknown')\
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
### related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')\
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')\
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')\
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.\
### other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)\
14 - previous: number of contacts performed before this campaign and for this client (numeric)\
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')\
### social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)\
17 - cons.price.idx: consumer price index - monthly indicator (numeric)\
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)\
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)\
20 - nr.employed: number of employees - quarterly indicator (numeric)\

Output variable (desired target):\
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

# Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Data exploring, cleaning, and feature engineering 

## 1.1 Load the data

In [2]:
df=pd.read_csv('datasets/bank-full.csv',sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


## Checking for missing values

In [3]:
df.isnull().sum()

age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

## Check for unique values that can be categorical to convert them to numeric values

In [4]:
df.nunique()

age            77
job            12
marital         3
education       4
default         2
balance      7168
housing         2
loan            2
contact         3
day            31
month          12
duration     1573
campaign       48
pdays         559
previous       41
poutcome        4
y               2
dtype: int64

## 1.2 Preprocessing the data
* Convert categorical columns to numeric values

In [5]:
# we need to predict column 'y' (the target column is: 'y') so we need to convert it to a numeric value using LabelEncoder,
from sklearn import preprocessing
le=preprocessing.LabelEncoder()
# here is a simple example to how this work
le.fit_transform(['yes','yes','no'])

array([1, 1, 0], dtype=int64)

## convert columns  "y"  and "loan"  to numeric values

In [6]:
df['y']=le.fit_transform(df['y'].astype(str))
df['housing']=le.fit_transform(df['housing'].astype(str))
df['loan']=le.fit_transform(df['loan'].astype(str))
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,1,0,unknown,5,may,261,1,-1,0,unknown,0
1,44,technician,single,secondary,no,29,1,0,unknown,5,may,151,1,-1,0,unknown,0
2,33,entrepreneur,married,secondary,no,2,1,1,unknown,5,may,76,1,-1,0,unknown,0
3,47,blue-collar,married,unknown,no,1506,1,0,unknown,5,may,92,1,-1,0,unknown,0
4,33,unknown,single,unknown,no,1,0,0,unknown,5,may,198,1,-1,0,unknown,0


## Convert some other categorical columns to numeric values

In [7]:
df=pd.get_dummies(df,columns=['job','marital','education','default','contact','poutcome','month'])
df.head()

Unnamed: 0,age,balance,housing,loan,day,duration,campaign,pdays,previous,y,...,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep
0,58,2143,1,0,5,261,1,-1,0,0,...,0,0,0,0,0,0,1,0,0,0
1,44,29,1,0,5,151,1,-1,0,0,...,0,0,0,0,0,0,1,0,0,0
2,33,2,1,1,5,76,1,-1,0,0,...,0,0,0,0,0,0,1,0,0,0
3,47,1506,1,0,5,92,1,-1,0,0,...,0,0,0,0,0,0,1,0,0,0
4,33,1,0,0,5,198,1,-1,0,0,...,0,0,0,0,0,0,1,0,0,0


In [8]:
# we can save the processed data in separated csv file
# df.to_cv('datasets/bank_processed.csv')

# 2. Spliting data and model evaluationg 
Source: https://scikit-learn.org/stable/modules/model_evaluation.html

## 2.1 Spliting data

In [9]:
from sklearn.model_selection import train_test_split

x=df.drop('y',axis=1)
y=df['y']

# Split into training and test sets
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

# cross_val_score

In [10]:
from sklearn.model_selection import cross_val_score

## Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
logr=LogisticRegression() # Default parameters (penalty='l2', c='1.0', solver='liblinear')
score=cross_val_score(logr,x,y,cv=5) # Shift+Tab for more info
logr_test_acc=score.mean()
logr_test_acc

0.8806481153244509

## K-NN Classifier

In [12]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
score=cross_val_score(knn,x,y,cv=5) # Shift+Tab for more info
knn_test_acc=score.mean()
knn_test_acc

0.8722208936472338

## Gradient Boosting Classifier

In [13]:
from sklearn.ensemble import GradientBoostingClassifier
gbc=GradientBoostingClassifier()
score=cross_val_score(gbc,x,y,cv=5) # Shift+Tab for more info
gbc_test_acc=score.mean()
gbc_test_acc

0.7232938657283495

## Bagging Classifier

In [14]:
from sklearn.ensemble import BaggingClassifier
bc=BaggingClassifier()
score=cross_val_score(bc,x,y,cv=5) # Shift+Tab for more info
bc_test_acc=score.mean()
bc_test_acc

0.6049573515198821

## RandomForestClassifier

In [15]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
score=cross_val_score(rfc,x,y,cv=5) # Shift+Tab for more info
rfc_test_acc=score.mean()
rfc_test_acc

0.6148001892112552

## Let's summarize the results from each of the 5-Folds average test accuracy results

In [16]:
print("Logistic Regression Test Accuracy:\t",logr_test_acc)
print("K-NN Classifier Test Accuracy:\t",knn_test_acc)
print("Gradient Boosting Classifier Test Accuracy:\t",gbc_test_acc)
print("Bagging Classifier Test Accuracy:\t",bc_test_acc)
print("RandomForestClassifier Test Accuracy:\t",rfc_test_acc)

Logistic Regression Test Accuracy:	 0.8806481153244509
K-NN Classifier Test Accuracy:	 0.8722208936472338
Gradient Boosting Classifier Test Accuracy:	 0.7232938657283495
Bagging Classifier Test Accuracy:	 0.6049573515198821
RandomForestClassifier Test Accuracy:	 0.6148001892112552


# Hyperparameter Optimization 

In [17]:
from sklearn.model_selection import GridSearchCV
parameters={'C':[0.1,0.4,0.8,1,2,5]
           ,'penalty':['l1','l2']}
grid_search=GridSearchCV(logr,parameters,cv=5)
grid_search.fit(x_train,y_train)
grid_search.best_params_

{'C': 0.4, 'penalty': 'l2'}

# Now will use these best parameters to test our model


In [18]:
logistic_model=LogisticRegression(penalty=grid_search.best_params_['penalty'],\
                                  C=grid_search.best_params_['C']).fit(x_train,y_train)

In [19]:
y_pred=logistic_model.predict(x_test)

# accuracy_score

In [20]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

def summarize_classification(y_test, y_pred):
    
    acc = accuracy_score(y_test, y_pred, normalize=True)
    num_acc = accuracy_score(y_test, y_pred, normalize=False)

    prec = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    print("Test data count: ",len(y_test))
    print("accuracy_count : " , num_acc)
    print("accuracy_score : " , acc)
    print("precision_score : " , prec)
    print("recall_score : ", recall)
    print()

In [21]:
summarize_classification(y_test,y_pred)

Test data count:  9043
accuracy_count :  8036
accuracy_score :  0.8886431493973239
precision_score :  0.6029411764705882
recall_score :  0.22548120989917506

