<h1><center>1.1 Bank Marketing Data Set</center></h1>

<p style='text-align: center;'> 
Jishnu Jeevan <br>
Department of Computer Science <br>
M.Tech Computer and Information Science <br>
jishnujeevan@cusat.ac.in <br>
</p>

<h2><center> Assignemt Objective</center></h2>
<p style='text-align: justify;'>
The data is related with direct marketing campaigns of a Portuguese banking institution.<br>
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed.<br> 
There are two datasets:<br>
&emsp; 1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).<br>
&emsp; 2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.<br>
The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).<br>
The classification goal is to predict if the client will subscribe a term deposit (variable y).<br>
The citation for the data set :<br>
<b>[Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.</b><br>
</p>

<h2>The data set contains the following columns i.e features</h2>
<p style='text-align: justify;'>
Input variables:<br>
Bank client data:<br>
1 - age (numeric)<br>
2 - type of job (categorical:"admin.","unknown","unemployed","management","housemaid","entrepreneur","student","blue-collar","self-employed","retired","technician","services")<br>
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)<br>
4 - education (categorical: "unknown","secondary","primary","tertiary")<br>
5 - default: has credit in default? (binary: "yes","no")<br>
6 - balance: average yearly balance, in euros (numeric)<br>
7 - housing: has housing loan? (binary: "yes","no")<br>
8 - loan: has personal loan? (binary: "yes","no")<br>

Related with the last contact of the current campaign:<br>
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")<br>
10 - day: last contact day of the month (numeric)<br>
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")<br>
12 - duration: last contact duration, in seconds (numeric)<br>
   
Other attributes:<br>
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)<br>
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)<br>
15 - previous: number of contacts performed before this campaign and for this client (numeric)<br>
16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")<br>

Output variable (desired target):<br>
17 - y - has the client subscribed a term deposit? (binary: "yes","no")<br>
</p>

### 1. Importing the required libraries 

In [10]:
# For reading the data
import pandas as pd

# For training, testing and splitting of the data
from sklearn.model_selection import train_test_split

# For calculating accuracy, precision and recall and confusion matrix
from sklearn import metrics
from sklearn.metrics import confusion_matrix

#Classification Algorithms 
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

### 2. Importing and displaying the data

In [11]:
data = pd.read_csv("./bank-full.csv", delimiter=";")
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


### 3. We won't need to deal with missing values as there are no missing values.

### 4. Encoding Categorical and numerical data into digits form.

In [12]:
# Find out the data types of the columns
data.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [13]:
# Converting object type data into One-Hot Encoded data using get_dummies method.
data_new = pd.get_dummies(data, columns=['job','marital','education','default','housing','loan','contact','month','poutcome'])

#Converting 'yes' or 'no' columns to binary format
data_new.y.replace(('yes', 'no'), (1, 0), inplace=True)

#Successfully converted data into  integer data types
data_new.dtypes

age                    int64
balance                int64
day                    int64
duration               int64
campaign               int64
pdays                  int64
previous               int64
y                      int64
job_admin.             uint8
job_blue-collar        uint8
job_entrepreneur       uint8
job_housemaid          uint8
job_management         uint8
job_retired            uint8
job_self-employed      uint8
job_services           uint8
job_student            uint8
job_technician         uint8
job_unemployed         uint8
job_unknown            uint8
marital_divorced       uint8
marital_married        uint8
marital_single         uint8
education_primary      uint8
education_secondary    uint8
education_tertiary     uint8
education_unknown      uint8
default_no             uint8
default_yes            uint8
housing_no             uint8
housing_yes            uint8
loan_no                uint8
loan_yes               uint8
contact_cellular       uint8
contact_teleph

In [14]:
# Show a few of the data
data_new.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,y,job_admin.,job_blue-collar,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
0,58,2143,5,261,1,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,44,29,5,151,1,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
2,33,2,5,76,1,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
3,47,1506,5,92,1,-1,0,0,0,1,...,0,0,1,0,0,0,0,0,0,1
4,33,1,5,198,1,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


### 5. Remove the features and the output variable from the dataset

In [15]:
# Output variable is stored in data_y
data_y = pd.DataFrame(data_new['y'])

# Features are stored in data_x
data_X = data_new.drop(['y'], axis=1)

print(data_X.columns)
print(data_y.columns)

Index(['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_apr', 'month_aug',
       'month_dec', 'month_feb', 'month_jan', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_failure', 'poutcome_other', 'poutcome_success',
       'poutcome_unknown'],
      dtype='object')
Index(['y'], dtype='object')


### 6. We are not going to reduce the number of features in the dataset. Originally there was only 17 features. Due to data cleaning we got 52 feature. Since there are 17 features (which is less), we won't be using any feature extraction methods.

### 7. We are going to be splitting the dataset into training (70% of the dataset) and testing (30%)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.3, random_state=2, stratify=data_y)

### 8. We are going to use different binary classification algorithms to do classification on the data, and we will evaluate the accuarcy of each algorithm.
### The algorithms we are going to use are
1. XGB Classifier
2. Logistics Regression
3. Random Forest Classifier
4. Decision Tree
5. Linear Discriminant Analysis
6. K Nearest Neighbour
7. Gaussian Naive Bayes 
8. AdaBoosting

In [17]:
# To surpress warning messages
import warnings
warnings.filterwarnings('ignore')

# Create a dictionary to find out the best classifier using accuracy score
ranking = {}

# We are going to be using the following classifiers and doing a comparision study
classifiers = {
                '1. XGB Classifier':XGBClassifier(n_estimator = 500, learning_rate = 0.5),
                '2. Logistic Regression':LogisticRegression(),
                '3. Random Forest Classifier': RandomForestClassifier(),
                '4. Decision Tree Classifier':DecisionTreeClassifier(),
                '5. Linear Discriminant Analysis':LinearDiscriminantAnalysis(),               
                '6. K Nearest Neighbour':KNeighborsClassifier(8),                
                '7. Gaussian Naive Bayes Classifier':GaussianNB(),
                '8. Adaptive Boosting Classifier':AdaBoostClassifier()
               }

# Take each classifier from list
for Name, classifier in classifiers.items():
    # Fit the model using the training set
    classifier.fit(X_train ,y_train)
    
    # Find out the predicion using test set
    y_predicted = classifier.predict(X_test)
    
    # Find out the accuracy using the y test set and perdicted valur of y
    accuracy = metrics.accuracy_score(y_test,y_predicted)
    
    # Cross validaion score
    score = cross_val_score(classifier, X_train, y_train, cv=3)
    
    # Find out percision
    precision = metrics.precision_score(y_test,y_predicted,average='macro')
    
    # Find out recall
    recall = metrics.recall_score(y_test,y_predicted,average='macro')
         
    # Confusion matrix
    confusion = confusion_matrix(y_test, y_predicted)
    
    # Print results
    print("\n")
    print("Name : ", Name)
    print("Accuracy : ", accuracy)
    print("Cross validation score : ", score)
    print("Precision : ", precision)
    print("Recall : ", recall)
    print("Confusion matrix : \n", confusion)
    
    # Add the name of classifier and accuracy score to dictionary
    ranking[Name] = accuracy



Name :  1. XGB Classifier
Accuracy :  0.9079180182836921
Cross validation score :  [0.90444592 0.90738459 0.90766897]
Precision :  0.7903802338431314
Recall :  0.7149955656975175
Confusion matrix : 
 [[11580   397]
 [  852   735]]


Name :  2. Logistic Regression
Accuracy :  0.9021675022117369
Cross validation score :  [0.89894777 0.90283439 0.90131766]
Precision :  0.7854482716016697
Recall :  0.6633628916671257
Confusion matrix : 
 [[11679   298]
 [ 1029   558]]


Name :  3. Random Forest Classifier
Accuracy :  0.8995871424358596
Cross validation score :  [0.8981894  0.9027396  0.89847379]
Precision :  0.7762352454198819
Recall :  0.6534290492399868
Confusion matrix : 
 [[11675   302]
 [ 1060   527]]


Name :  4. Decision Tree Classifier
Accuracy :  0.8762164553229136
Cross validation score :  [0.87742914 0.8739217  0.87733434]
Precision :  0.7018313687356116
Recall :  0.7074303936567352
Confusion matrix : 
 [[11112   865]
 [  814   773]]


Name :  5. Linear Discriminant Analysis
A

### 9. The aglorithms that perform well, accorging to there accuracy score are as follows:

In [18]:
# Sort the dictionary 'ranking' accoriding to highest accuracy
print("\n")
ranking_sorted = sorted(ranking.items(),  reverse = True, key=lambda x: x[1]) # This returns a tuple, not a dictinary
for k,v in ranking_sorted:
    print(k, ";", v)



1. XGB Classifier ; 0.9079180182836921
2. Logistic Regression ; 0.9021675022117369
8. Adaptive Boosting Classifier ; 0.8998083161309348
3. Random Forest Classifier ; 0.8995871424358596
5. Linear Discriminant Analysis ; 0.8995134178708346
6. K Nearest Neighbour ; 0.8852108522559717
4. Decision Tree Classifier ; 0.8762164553229136
7. Gaussian Naive Bayes Classifier ; 0.8526245945148924


### NOTE: We can also use support vector classifier (SVC) on the dataset. But due to the large running time of the algorithm, I have decided to abandon it as it takes a very long time to give the output, compared to the above algorithms.