## _Response Modeling of Bank Marketing Campaign_


### _Business Scenario_

There has been a revenue decline for the Portuguese bank and they would like to know what actions to take. After investigation, we found out that the root cause is that their clients are not depositing as frequently as before. Knowing that term deposits allow banks to hold onto a deposit for a specific amount of time, so banks can invest in higher gain financial products to make a profit. In addition, banks also hold better chance to persuade term deposit clients into buying other products such as funds or insurance to further increase their revenues. As a result, the Portuguese bank would like to identify existing clients that have higher chance to subscribe for a term deposit and focus marketing effort on such clients.


* The task is to build a POC for the problem

* The data is related with direct marketing campaigns of a Portuguese banking institution. 

* The marketing campaigns were based on phone calls. 

* Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

## _Attributes Information_


### _Bank client data:_
1 - age (numeric)

2 - job : type of job 
(categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status 
(categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical:'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

### _Data Related to the last contact of the current campaign:_
8 - contact: contact communication type (categorical: 'cellular','telephone') 

9 - month: last contact month of year 
(categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week 
(categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). 
Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

### _Other attributes:_

12 - campaign: number of contacts performed during this campaign and for this client 
(numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign 
(numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

### _Social and economic context attributes_
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric) 

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

## _Exploratory Analysis_

### _Import Libraries_

In [1]:
import os
import pickle 
import numpy as np
import pandas as pd
from sklearn import tree
import sklearn.metrics as metrics
from sklearn.metrics import classification_report

from sklearn import preprocessing
from sklearn.impute import SimpleImputer

#from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier  
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
# !pip install seaborn
import seaborn as sns

#!pip install imblearn
#if the above command does not work to install imblearn package run the following command in your terminal
# conda install -c glemaitre imbalanced-learn
#from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Define custom function to print accuracy, precision and recall

def convert_for_sklearn(label_list):
    return [1 if i == 'yes' else 0 for i in label_list]


def accuracy_precision_recall_metrics(y_true, y_pred):
    
    y_test_scoring = convert_for_sklearn(y_true)
    test_pred_scoring = convert_for_sklearn(y_pred)

    acc = accuracy_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    prec = precision_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    rec = recall_score(y_true= y_test_scoring, y_pred = test_pred_scoring)
    
    print("Test Precision: ",prec)
    print("Test Recall: ",rec)
    print("Test Accuracy: ",acc)

In [3]:
# Reading train and test data
bank_data = pd.read_csv("bank-additional-full.csv", sep=',', header=0, na_values='unknown')
test_data =  pd.read_csv("test_cases.csv", sep=',', header=0, na_values='unknown')

print(bank_data.shape)
print(test_data.shape)

bank_data.head()

(41188, 22)
(4119, 22)


Unnamed: 0,customer_no,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,1,56,housemaid,married,basic.4y,no,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,2,57,services,married,high.school,,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,3,37,services,married,high.school,no,yes,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,4,40,admin.,married,basic.6y,no,no,no,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,5,56,services,married,high.school,no,no,yes,telephone,may,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [4]:
# Data types of train data
bank_data.dtypes

customer_no          int64
age                  int64
job                 object
marital             object
education           object
credit_default      object
housing             object
loan                object
contact             object
contacted_month     object
day_of_week         object
duration             int64
campaign             int64
pdays                int64
previous             int64
poutcome            object
emp_var_rate       float64
cons_price_idx     float64
cons_conf_idx      float64
euribor3m          float64
nr_employed        float64
y                   object
dtype: object

In [5]:
# Summary of the data
bank_data.describe(include='all')

Unnamed: 0,customer_no,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
count,41188.0,41188.0,40858,41108,39457,32591,40198,40198,41188,41188,...,41188.0,41188.0,41188.0,41188,41188.0,41188.0,41188.0,41188.0,41188.0,41188
unique,,,11,3,7,2,2,2,2,10,...,,,,3,,,,,,2
top,,,admin.,married,university.degree,no,yes,no,cellular,may,...,,,,nonexistent,,,,,,no
freq,,,10422,24928,12168,32588,21576,33950,26144,13769,...,,,,35563,,,,,,36548
mean,20594.5,40.02406,,,,,,,,,...,2.567593,962.475454,0.172963,,0.081886,93.575664,-40.5026,3.621291,5167.035911,
std,11890.09578,10.42125,,,,,,,,,...,2.770014,186.910907,0.494901,,1.57096,0.57884,4.628198,1.734447,72.251528,
min,1.0,17.0,,,,,,,,,...,1.0,0.0,0.0,,-3.4,92.201,-50.8,0.634,4963.6,
25%,10297.75,32.0,,,,,,,,,...,1.0,999.0,0.0,,-1.8,93.075,-42.7,1.344,5099.1,
50%,20594.5,38.0,,,,,,,,,...,2.0,999.0,0.0,,1.1,93.749,-41.8,4.857,5191.0,
75%,30891.25,47.0,,,,,,,,,...,3.0,999.0,0.0,,1.4,93.994,-36.4,4.961,5228.1,


In [6]:
# Replacing various category to one single category as basic
print(bank_data.education.value_counts())
bank_data.replace(['basic.6y','basic.4y', 'basic.9y'], 'basic', inplace=True)

university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
illiterate                18
Name: education, dtype: int64


In [7]:
# Value counts of each level in education
bank_data.education.value_counts()

basic                  12513
university.degree      12168
high.school             9515
professional.course     5243
illiterate                18
Name: education, dtype: int64

In [8]:
# bank_data.drop("customer_no", axis = 1, inplace= True)
# test_data.drop("customer_no", axis = 1, inplace= True)
# bank_data.head()

In [9]:
# Type casting columns to required data type
for col in ['job', 'marital', 'education', 'credit_default', 'housing', 'loan', 'contact', 'contacted_month', 'day_of_week', 'poutcome', 'y']:
    bank_data[col] = bank_data[col].astype('category')

In [10]:
# Splitting columns to numeric and categorical types
cat_attr = list(bank_data.select_dtypes("category").columns)
num_attr = list(bank_data.columns.difference(cat_attr))

# Removing target variable
cat_attr.pop()

'y'

In [11]:
# Printing categorical column names
cat_attr

['job',
 'marital',
 'education',
 'credit_default',
 'housing',
 'loan',
 'contact',
 'contacted_month',
 'day_of_week',
 'poutcome']

In [12]:
# Printing numeric column names
num_attr

['age',
 'campaign',
 'cons_conf_idx',
 'cons_price_idx',
 'customer_no',
 'duration',
 'emp_var_rate',
 'euribor3m',
 'nr_employed',
 'pdays',
 'previous']

In [13]:
# Checking for null values
bank_data.isnull().sum()

customer_no           0
age                   0
job                 330
marital              80
education          1731
credit_default     8597
housing             990
loan                990
contact               0
contacted_month       0
day_of_week           0
duration              0
campaign              0
pdays                 0
previous              0
poutcome              0
emp_var_rate          0
cons_price_idx        0
cons_conf_idx         0
euribor3m             0
nr_employed           0
y                     0
dtype: int64

In [14]:
# Dropping target variable before train-test split
X = bank_data.drop(["y"], axis = 1)

In [15]:
# Saving target variable for train-test split
y = bank_data["y"]

In [16]:
# Displaying columns
X.columns

Index(['customer_no', 'age', 'job', 'marital', 'education', 'credit_default',
       'housing', 'loan', 'contact', 'contacted_month', 'day_of_week',
       'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate',
       'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed'],
      dtype='object')

In [17]:
# Shape of data
print(X.shape, y.shape)

(41188, 21) (41188,)


In [18]:
# Splitting train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [19]:
# Shape of the data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(32950, 21)
(8238, 21)
(32950,)
(8238,)


In [20]:
# Copying train and test data
import copy
X_train_bu = copy.deepcopy(X_train)
X_test_bu = copy.deepcopy(X_test)

In [21]:
# Dropping customer_no columns from both train and test data
X_train.drop("customer_no", axis = 1, inplace= True)
X_test.drop("customer_no", axis = 1, inplace= True)
X_train.head()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
5402,35,entrepreneur,married,high.school,no,no,no,telephone,may,fri,178,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
40128,22,services,single,professional.course,no,no,no,cellular,jul,tue,256,3,999,1,failure,-1.7,94.215,-40.3,0.835,4991.6
11388,38,blue-collar,married,,no,yes,no,telephone,jun,fri,42,5,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1
16361,25,blue-collar,single,basic,,yes,no,cellular,jul,wed,442,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1
23389,37,technician,married,university.degree,no,no,no,cellular,aug,wed,107,4,999,0,nonexistent,1.4,93.444,-36.1,4.964,5228.1


In [22]:
X_train.columns

Index(['age', 'job', 'marital', 'education', 'credit_default', 'housing',
       'loan', 'contact', 'contacted_month', 'day_of_week', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'emp_var_rate',
       'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed'],
      dtype='object')

In [23]:
X_test.head()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
8107,42,services,married,high.school,no,no,no,telephone,jun,mon,217,4,999,0,nonexistent,1.4,94.465,-41.8,4.865,5228.1
38463,37,blue-collar,married,basic,no,,,cellular,oct,fri,93,1,999,0,nonexistent,-3.4,92.431,-26.9,0.73,5017.5
1933,41,blue-collar,married,basic,,no,no,telephone,may,fri,407,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0
8352,42,admin.,married,university.degree,,yes,no,telephone,jun,tue,215,3,999,0,nonexistent,1.4,94.465,-41.8,4.864,5228.1
37164,56,entrepreneur,married,high.school,no,yes,no,cellular,aug,wed,131,5,999,1,failure,-2.9,92.201,-31.4,0.884,5076.2


In [24]:
# Count of levels in target variable in train data
print(y_train.value_counts())

no     29250
yes     3700
Name: y, dtype: int64


In [25]:
# Count of levels in target variable in test data
print(y_test.value_counts())

no     7298
yes     940
Name: y, dtype: int64


In [26]:
# Checking for null values in train data
X_train.isna().sum()

age                   0
job                 261
marital              66
education          1372
credit_default     6926
housing             790
loan                790
contact               0
contacted_month       0
day_of_week           0
duration              0
campaign              0
pdays                 0
previous              0
poutcome              0
emp_var_rate          0
cons_price_idx        0
cons_conf_idx         0
euribor3m             0
nr_employed           0
dtype: int64

In [27]:
# Checking for null values in test data
X_test.isna().sum()

age                   0
job                  69
marital              14
education           359
credit_default     1671
housing             200
loan                200
contact               0
contacted_month       0
day_of_week           0
duration              0
campaign              0
pdays                 0
previous              0
poutcome              0
emp_var_rate          0
cons_price_idx        0
cons_conf_idx         0
euribor3m             0
nr_employed           0
dtype: int64

In [28]:
# Imputing categorical data with mode
cat_imputer = SimpleImputer(strategy='most_frequent')

In [29]:
# Fit and transform on train data
X_train[cat_attr] = cat_imputer.fit_transform(X_train[cat_attr])

In [30]:
# Fit and transform on test data
X_test[cat_attr] = cat_imputer.fit_transform(X_test[cat_attr])

In [31]:
# Removing customer_no from numerical attribute
num_attr.remove('customer_no')
num_attr

['age',
 'campaign',
 'cons_conf_idx',
 'cons_price_idx',
 'duration',
 'emp_var_rate',
 'euribor3m',
 'nr_employed',
 'pdays',
 'previous']

In [32]:
# Imputation on numeric attribute using median
num_imputer = SimpleImputer(strategy='median')
X_train[num_attr] = num_imputer.fit_transform(X_train[num_attr])

In [33]:
X_test[num_attr] = num_imputer.fit_transform(X_test[num_attr])

In [34]:
# Checking if imputed correctly
X_train.isna().sum()

age                0
job                0
marital            0
education          0
credit_default     0
housing            0
loan               0
contact            0
contacted_month    0
day_of_week        0
duration           0
campaign           0
pdays              0
previous           0
poutcome           0
emp_var_rate       0
cons_price_idx     0
cons_conf_idx      0
euribor3m          0
nr_employed        0
dtype: int64

In [35]:
# Checking if imputed correctly
X_test.isna().sum()

age                0
job                0
marital            0
education          0
credit_default     0
housing            0
loan               0
contact            0
contacted_month    0
day_of_week        0
duration           0
campaign           0
pdays              0
previous           0
poutcome           0
emp_var_rate       0
cons_price_idx     0
cons_conf_idx      0
euribor3m          0
nr_employed        0
dtype: int64

In [36]:
# X_train= pd.get_dummies(X_train, drop_first=True)
# X_train.head()

In [37]:
# Label encoding categorical data
label_encoder = preprocessing.LabelEncoder()

In [38]:
X_train['job'] = label_encoder.fit_transform(X_train['job'])
X_train['marital'] = label_encoder.fit_transform(X_train['marital'])
X_train['education'] = label_encoder.fit_transform(X_train['education'])
X_train['credit_default'] = label_encoder.fit_transform(X_train['credit_default'])
X_train['housing'] = label_encoder.fit_transform(X_train['housing'])
X_train['loan'] = label_encoder.fit_transform(X_train['loan'])
X_train['contact'] = label_encoder.fit_transform(X_train['contact'])
X_train['contacted_month'] = label_encoder.fit_transform(X_train['contacted_month'])
X_train['day_of_week'] = label_encoder.fit_transform(X_train['day_of_week'])
X_train['poutcome'] = label_encoder.fit_transform(X_train['poutcome'])


In [39]:
X_test['job'] = label_encoder.fit_transform(X_test['job'])
X_test['marital'] = label_encoder.fit_transform(X_test['marital'])
X_test['education'] = label_encoder.fit_transform(X_test['education'])
X_test['credit_default'] = label_encoder.fit_transform(X_test['credit_default'])
X_test['housing'] = label_encoder.fit_transform(X_test['housing'])
X_test['loan'] = label_encoder.fit_transform(X_test['loan'])
X_test['contact'] = label_encoder.fit_transform(X_test['contact'])
X_test['contacted_month'] = label_encoder.fit_transform(X_test['contacted_month'])
X_test['day_of_week'] = label_encoder.fit_transform(X_test['day_of_week'])
X_test['poutcome'] = label_encoder.fit_transform(X_test['poutcome'])


In [40]:
X_train.head()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
5402,35.0,2,1,1,0,0,0,1,6,0,178.0,2.0,999.0,0.0,1,1.1,93.994,-36.4,4.857,5191.0
40128,22.0,7,2,3,0,0,0,0,3,3,256.0,3.0,999.0,1.0,0,-1.7,94.215,-40.3,0.835,4991.6
11388,38.0,1,1,0,0,1,0,1,4,0,42.0,5.0,999.0,0.0,1,1.4,94.465,-41.8,4.959,5228.1
16361,25.0,1,2,0,0,1,0,0,3,4,442.0,2.0,999.0,0.0,1,1.4,93.918,-42.7,4.963,5228.1
23389,37.0,9,1,4,0,0,0,0,1,4,107.0,4.0,999.0,0.0,1,1.4,93.444,-36.1,4.964,5228.1


In [41]:
# Checking data types
X_train.dtypes

age                float64
job                  int32
marital              int32
education            int32
credit_default       int32
housing              int32
loan                 int32
contact              int32
contacted_month      int32
day_of_week          int32
duration           float64
campaign           float64
pdays              float64
previous           float64
poutcome             int32
emp_var_rate       float64
cons_price_idx     float64
cons_conf_idx      float64
euribor3m          float64
nr_employed        float64
dtype: object

In [42]:
# Standardising numerical data
scaler = StandardScaler()
scaler.fit(X_train[num_attr])

X_train[num_attr]=scaler.transform(X_train[num_attr])
X_test[num_attr]=scaler.transform(X_test[num_attr])

In [43]:
# Displaying top data after label encoding and standardising
X_train.head()

Unnamed: 0,age,job,marital,education,credit_default,housing,loan,contact,contacted_month,day_of_week,duration,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed
5402,-0.481846,2,1,1,0,0,0,1,6,0,-0.31057,-0.20485,0.195582,-0.349924,1,0.643045,0.720048,0.883266,0.707515,0.32621
40128,-1.730916,7,2,3,0,0,0,0,3,3,-0.009892,0.162686,0.195582,1.674657,0,-1.14095,1.101751,0.039589,-1.616536,-2.444456
11388,-0.193599,1,1,0,0,1,0,1,4,0,-0.834829,0.897757,0.195582,-0.349924,1,0.834187,1.533541,-0.284902,0.766454,0.841715
16361,-1.442669,1,2,0,0,1,0,0,3,4,0.707109,-0.20485,0.195582,-0.349924,1,0.834187,0.588783,-0.479597,0.768765,0.841715
23389,-0.289682,9,1,4,0,0,0,0,1,4,-0.584264,0.530222,0.195582,-0.349924,1,0.834187,-0.229892,0.948164,0.769343,0.841715


# Model Building

# Logistic Regression

In [44]:
logreg = LogisticRegression()
logreg.fit(X_train,y_train)

# Predictions on train data
y_pred1=logreg.predict(X_train)

# Predictions on test data
y_pred2=logreg.predict(X_test)

In [45]:
#Use accuracy_score function to get the accuracy
print("Logistic Accuracy Score on Train set -> ", accuracy_score(y_train, y_pred1)*100)
print("Logistic Accuracy Score on Validation set -> ", accuracy_score(y_test, y_pred2)*100)

Logistic Accuracy Score on Train set ->  90.99241274658574
Logistic Accuracy Score on Validation set ->  91.1143481427531


In [46]:
f1_score(y_train,y_pred1,average='macro')

0.7252518177573211

In [47]:
f1_score(y_test,y_pred2,average='macro')

0.7302656345499543

In [48]:
# filename = 'logreg_model.sav'
# pickle.dump(logreg, open(filename, 'wb'))

In [49]:
# # load the model from disk
# filename = 'logreg_model.sav'
# loaded_model1 = pickle.load(open(filename, 'rb'))
# result1 = loaded_model1.score(X_test, y_test)
# print(result1)

# Logistic Regression Gridsearch

In [50]:
logistic = LogisticRegression()

In [51]:
# Create regularization penalty space
penalty = ['l1', 'l2']

# Create regularization hyperparameter space
C = np.logspace(0, 4, 10)

# Create hyperparameter options
hyperparameters = dict(C=C, penalty=penalty)

In [52]:
clf1 = GridSearchCV(logistic, hyperparameters, cv=5, verbose=0)

In [53]:
best_model = clf1.fit(X_train, y_train)

In [54]:
print('Best Penalty:', best_model.best_estimator_.get_params()['penalty'])
print('Best C:', best_model.best_estimator_.get_params()['C'])

Best Penalty: l1
Best C: 1.0


In [55]:
ypred1 = best_model.predict(X_train)
ypred1

array(['no', 'no', 'no', ..., 'no', 'no', 'no'], dtype=object)

In [56]:
ypred2 = best_model.predict(X_test)
ypred2

array(['no', 'no', 'no', ..., 'no', 'no', 'no'], dtype=object)

In [57]:
#Use accuracy_score function to get the accuracy
print("Logistic Accuracy Score on Train set -> ", accuracy_score(y_train, ypred1)*100)
print("Logistic Accuracy Score on Validation set -> ", accuracy_score(y_test, ypred2)*100)

print(f1_score(y_train,ypred1,average='macro'))

print(f1_score(y_test,ypred2,average='macro'))

Logistic Accuracy Score on Train set ->  90.99241274658574
Logistic Accuracy Score on Validation set ->  91.10220927409566
0.7252518177573211
0.7300600425961659


In [58]:
# filename = 'logreg_model1.sav'
# pickle.dump(clf1, open(filename, 'wb'))

In [59]:
# # load the model from disk
# loaded_model2 = pickle.load(open(filename, 'rb'))
# result2 = loaded_model2.score(X_test, y_test)
# print(result2)

# Naive bayes  

In [60]:
from sklearn import model_selection, naive_bayes, svm
from sklearn.naive_bayes import GaussianNB

In [61]:
# fit the training dataset on the NB classifier
Naive = naive_bayes.GaussianNB()
Naive.fit(X_train,y_train)

# predict the labels on train dataset
y_pred3 = Naive.predict(X_train)
# predict the labels on validation dataset
y_pred4 = Naive.predict(X_test)

#Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score on Train set -> ", accuracy_score(y_train, y_pred3)*100)
print("Naive Bayes Accuracy Score on Validation set -> ", accuracy_score(y_test, y_pred4)*100)

print(f1_score(y_train,y_pred3,average='macro'))
print(f1_score(y_test,y_pred4,average='macro'))

Naive Bayes Accuracy Score on Train set ->  71.95447647951441
Naive Bayes Accuracy Score on Validation set ->  71.29157562515174
0.6068894149600741
0.6058875130361114


In [62]:
# filename = 'naivebayes_model.sav'
# pickle.dump(Naive, open(filename, 'wb'))

In [63]:
# # load the model from disk
# loaded_model3 = pickle.load(open(filename, 'rb'))
# result3 = loaded_model3.score(X_test, y_test)
# print(result3)

# Decision Tree

In [64]:
# set of parameters to test
param_grid = {"criterion": ["gini", "entropy"],
              "min_samples_split": [10, 20],
              "max_depth": [None, 5, 10],
              "min_samples_leaf": [5, 10],
              "max_leaf_nodes": [10, 20],
              }

In [65]:
dt = tree.DecisionTreeClassifier()
clf = GridSearchCV(dt, param_grid, cv=5)
clf.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=None,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 5, 10], 'max_leaf_nodes': [10, 2

In [66]:
clf.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=20,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [67]:
y_pred5 = clf.predict(X_train)
y_pred6 = clf.predict(X_test)

In [68]:
confusion_matrix_train = confusion_matrix(y_train, y_pred5)
confusion_matrix_test = confusion_matrix(y_test, y_pred6)

Accuracy_Train=(confusion_matrix_train[0,0]+confusion_matrix_train[1,1])/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1]+confusion_matrix_train[1,0]+confusion_matrix_train[1,1])
TNR_Train= confusion_matrix_train[0,0]/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1])
TPR_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,0]+confusion_matrix_train[1,1])

print("Train TNR: ",TNR_Train)
print("Train TPR: ",TPR_Train)
print("Train Accuracy: ",Accuracy_Train)

Accuracy_Test=(confusion_matrix_test[0,0]+confusion_matrix_test[1,1])/(confusion_matrix_test[0,0]+confusion_matrix_test[0,1]+confusion_matrix_test[1,0]+confusion_matrix_test[1,1])
TNR_Test= confusion_matrix_test[0,0]/(confusion_matrix_test[0,0] +confusion_matrix_test[0,1])
TPR_Test= confusion_matrix_test[1,1]/(confusion_matrix_test[1,0] +confusion_matrix_test[1,1])

print("Test TNR: ",TNR_Test)
print("Test TPR: ",TPR_Test)
print("Test Accuracy: ",Accuracy_Test)

Train TNR:  0.9597606837606838
Train TPR:  0.5908108108108108
Train Accuracy:  0.9183308042488619
Test TNR:  0.9564264181967662
Test TPR:  0.5914893617021276
Test Accuracy:  0.9147851420247632


In [69]:
f1_score(y_train,y_pred5,average='macro')

0.7866319492066105

In [70]:
f1_score(y_test,y_pred6, average='macro')

0.7825655261363857

In [71]:
# filename = 'dt_model.sav'
# pickle.dump(clf, open(filename, 'wb'))

In [72]:
# # load the model from disk
# loaded_model4 = pickle.load(open(filename, 'rb'))
# result4 = loaded_model4.score(X_test, y_test)
# print(result4)

# SVC linear

In [73]:
linear_svm = SVC(kernel='linear', C=1, random_state=0)

In [74]:
linear_svm.fit(X=X_train, y= y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=False, random_state=0,
    shrinking=True, tol=0.001, verbose=False)

In [75]:
y_pred7 = linear_svm.predict(X_train)
y_pred8 = linear_svm.predict(X_test)

In [76]:
### Train data accuracy
from sklearn.metrics import accuracy_score,f1_score

print("TRAIN Conf Matrix : \n", confusion_matrix(y_train, y_pred7))
print("\nTRAIN DATA ACCURACY",accuracy_score(y_train,y_pred7))

TRAIN Conf Matrix : 
 [[28654   596]
 [ 2664  1036]]

TRAIN DATA ACCURACY 0.9010622154779969


In [77]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred7,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred7,pos_label="no"))


Train data f1-score for class 'yes' 0.38859714928732186

Train data f1-score for class 'no' 0.9461761986527539


In [78]:
# recall_average = recall_score(y_train, y_pred7, average="binary", pos_label="yes")
# print(recall_average)

# recall_average1 = recall_score(y_test, y_pred8, average="binary", pos_label="yes")
# print(recall_average1)

# import sklearn.metrics as metrics
# from sklearn.metrics import precision_score
# print("Precision:",metrics.precision_score(y_test, y_pred8, pos_label='yes'))
#print("Precision:",metrics.precision_score(y_test, y_pred8, pos_label='yes'))

In [79]:
### Test data accuracy
print("TEST Conf Matrix : \n", confusion_matrix(y_test, y_pred8))
print("\nTEST DATA ACCURACY",accuracy_score(y_test,y_pred8))

TEST Conf Matrix : 
 [[7172  126]
 [ 669  271]]

TEST DATA ACCURACY 0.903495994173343


In [80]:
print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred8,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred8,pos_label="no"))


Test data f1-score for class 'yes' 0.4053851907255049

Test data f1-score for class 'no' 0.9474866239513838


In [81]:
# filename = 'svclinear_model.sav'
# pickle.dump(linear_svm, open(filename, 'wb'))

In [82]:
# # load the model from disk
# loaded_model5 = pickle.load(open(filename, 'rb'))
# result5 = loaded_model5.score(X_test, y_test)
# print(result5)

# SVC rbf

In [83]:
## Create an SVC object and print it to see the arguments
svc = SVC(kernel='rbf', random_state=0, gamma=0.01, C=1)
svc

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

In [84]:
## Train the model
svc.fit(X=X_train, y= y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

In [85]:
y_pred9 = svc.predict(X_train)
y_pred10 = svc.predict(X_test)

In [86]:
### Train data accuracy

print("TRAIN Conf Matrix : \n", confusion_matrix(y_train, y_pred9))
print("\nTRAIN DATA ACCURACY",accuracy_score(y_train,y_pred9))
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred9,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred9,pos_label="no"))

### Test data accuracy
print("\n\n--------------------------------------\n\n")

print("TEST Conf Matrix : \n", confusion_matrix(y_test, y_pred10))
print("\nTEST DATA ACCURACY",accuracy_score(y_test,y_pred10))
print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred10,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred10,pos_label="no"))

TRAIN Conf Matrix : 
 [[28764   486]
 [ 2570  1130]]

TRAIN DATA ACCURACY 0.9072534142640364

Train data f1-score for class 'yes' 0.4251316779533484

Train data f1-score for class 'no' 0.949557638980589


--------------------------------------


TEST Conf Matrix : 
 [[7187  111]
 [ 632  308]]

TEST DATA ACCURACY 0.9098082058752124

Test data f1-score for class 'yes' 0.4532744665194996

Test data f1-score for class 'no' 0.9508500363828803


In [87]:
# filename = 'svcrbf_model.sav'
# pickle.dump(svc, open(filename, 'wb'))

In [88]:
# # load the model from disk
# loaded_model6 = pickle.load(open(filename, 'rb'))
# result6 = loaded_model6.score(X_test, y_test)
# print(result6)

# SVC gridsearchCV

In [89]:
# svc_grid = SVC() 
# param_grid = { 
#                 'C': [0.1, 1],
#                 'gamma': [0.1, 1], 
#                 'kernel':['linear', 'rbf', 'poly' ]
#              }

# svc_cv_grid = GridSearchCV(estimator = svc_grid, param_grid = param_grid, cv = 5, verbose=3)

In [90]:
## Fit the grid search model
#svc_cv_grid.fit(X=X_train, y=y_train)

In [91]:
# Get the best parameters
#svc_cv_grid.best_params_

In [92]:
#svc_best = svc_cv_grid.best_estimator_

In [93]:
## Predict
# train_predictions = svc_best.predict(X_train)
# test_predictions = svc_best.predict(X_test)
# y_pred11 = svc_best.predict(X_train)
# y_pred12 = svc_best.predict(X_test)

In [94]:
# print("TRAIN DATA ACCURACY",accuracy_score(y_train,y_pred11))
# print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred11,pos_label="yes"))
# print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred11,pos_label="no"))

# ### Test data accuracy
# print("\n\n--------------------------------------\n\n")
# print("TEST DATA ACCURACY",accuracy_score(y_test,y_pred12))
# print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred12,pos_label="yes"))
# print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred12,pos_label="no"))

In [95]:
# filename = 'svcgridsearch_model.sav'
# pickle.dump(svc_cv_grid, open(filename, 'wb'))

In [96]:
# # load the model from disk
# loaded_model7 = pickle.load(open(filename, 'rb'))
# result7 = loaded_model7.score(X_test, y_test)
# print(result7)

# Random Forest

In [97]:
clf2 = RandomForestClassifier(n_estimators=10,max_depth=8)
clf2.fit(X=X_train, y=y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=8, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [98]:
y_pred13 = clf2.predict(X_train)
y_pred14 = clf2.predict(X_test)

In [99]:
print("TRAIN DATA ACCURACY",accuracy_score(y_train,y_pred13))
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred13,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred13,pos_label="no"))

### Test data accuracy
print("\n\n--------------------------------------\n\n")
print("TEST DATA ACCURACY",accuracy_score(y_test,y_pred14))
print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred14,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred14,pos_label="no"))

TRAIN DATA ACCURACY 0.9222154779969651

Train data f1-score for class 'yes' 0.5440313111545988

Train data f1-score for class 'no' 0.9574810464672607


--------------------------------------


TEST DATA ACCURACY 0.9107793153678078

Test data f1-score for class 'yes' 0.49760765550239233

Test data f1-score for class 'no' 0.9510424298940918


In [100]:
# filename = 'rf_model.sav'
# pickle.dump(clf2, open(filename, 'wb'))

In [101]:
# # load the model from disk
# loaded_model8 = pickle.load(open(filename, 'rb'))
# result8 = loaded_model8.score(X_test, y_test)
# print(result8)

# Random forest gridsearch

In [102]:
rfc = RandomForestClassifier(n_jobs=-1, max_features='sqrt') 
 
# Use a grid over parameters of interest
param_grid = {"n_estimators" : [9, 18],
              "max_depth" : [2,3],
              "min_samples_leaf" : [2, 4]
             }

scores = ['precision', 'recall']

In [103]:
for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print("\n")

    clf3 = GridSearchCV(estimator=rfc, param_grid=param_grid,cv=5,
                       scoring='%s_macro' % score)
    clf3.fit(X_train, y_train)

    print("Best parameters set found on training set:")
    print("\n")
    print(clf3.best_params_)
    print("\n")
    
    print("Grid scores on training set:")
    print("\n")
    means = clf3.cv_results_['mean_test_score']
    for mean, params in zip(means, clf3.cv_results_['params']):
        print("%0.3f for %r" % (mean, params))

# Tuning hyper-parameters for precision


Best parameters set found on training set:


{'max_depth': 2, 'min_samples_leaf': 4, 'n_estimators': 18}


Grid scores on training set:


0.838 for {'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 9}
0.840 for {'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 18}
0.819 for {'max_depth': 2, 'min_samples_leaf': 4, 'n_estimators': 9}
0.854 for {'max_depth': 2, 'min_samples_leaf': 4, 'n_estimators': 18}
0.824 for {'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 9}
0.844 for {'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 18}
0.821 for {'max_depth': 3, 'min_samples_leaf': 4, 'n_estimators': 9}
0.835 for {'max_depth': 3, 'min_samples_leaf': 4, 'n_estimators': 18}
# Tuning hyper-parameters for recall


Best parameters set found on training set:


{'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 9}


Grid scores on training set:


0.573 for {'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 9}
0.574 for {'max_dep

In [104]:
ypred15=clf3.predict(X_train)
ypred16=clf3.predict(X_test)

In [105]:
print(confusion_matrix(y_train, ypred15))

print(confusion_matrix(y_test, ypred16))

print(accuracy_score(y_train,ypred15))

print(accuracy_score(y_test,ypred16))

print("\nTrain data f1-score for class 'yes'",f1_score(y_train,ypred15,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,ypred15,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,ypred16,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,ypred16,pos_label="no"))

[[29082   168]
 [ 3138   562]]
[[7262   36]
 [ 782  158]]
0.8996661608497724
0.9007040543821316

Train data f1-score for class 'yes' 0.25372460496613997

Train data f1-score for class 'no' 0.9462176671547096

Test data f1-score for class 'yes' 0.27865961199294537

Test data f1-score for class 'no' 0.9466823099986964


In [106]:
# filename = 'rf_gridsearch_model.sav'
# pickle.dump(clf3, open(filename, 'wb'))

In [107]:
# # load the model from disk
# loaded_model9 = pickle.load(open(filename, 'rb'))
# result9 = loaded_model9.score(X_test, y_test)
# print(result9)

# Random forest gridsearch1 

In [108]:
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid,cv=5)
CV_rfc.fit(X=X_train, y=y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='sqrt',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=-1,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='

In [109]:
print(CV_rfc.best_score_, CV_rfc.best_params_)

0.9007587253414264 {'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 9}


In [110]:
y_pred15=CV_rfc.predict(X_train)
y_pred16=CV_rfc.predict(X_test)

In [111]:
print(confusion_matrix(y_train, y_pred15))

[[29051   199]
 [ 3079   621]]


In [112]:
print(confusion_matrix(y_test, y_pred16))

[[7250   48]
 [ 773  167]]


In [113]:
print(accuracy_score(y_train,y_pred15))

0.9005159332321699


In [114]:
print(accuracy_score(y_test,y_pred16))

0.9003398883224083


In [115]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred15,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred15,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred16,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred16,pos_label="no"))


Train data f1-score for class 'yes' 0.27477876106194693

Train data f1-score for class 'no' 0.9465949820788531

Test data f1-score for class 'yes' 0.2891774891774892

Test data f1-score for class 'no' 0.9464134194895895


In [116]:
# filename = 'rf_gridsearch_model1.sav'
# pickle.dump(CV_rfc, open(filename, 'wb'))

In [117]:
# # load the model from disk
# loaded_model10 = pickle.load(open(filename, 'rb'))
# result10 = loaded_model10.score(X_test, y_test)
# print(result10)

# GBM

In [118]:
GBM_model = GradientBoostingClassifier(n_estimators=50,
                                       learning_rate=0.3,
                                       subsample=0.8)
GBM_model.fit(X=X_train, y=y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.3, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=50,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=0.8, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [119]:
y_pred17 = GBM_model.predict(X_train)
y_pred18 = GBM_model.predict(X_test)

In [120]:
print("Accuracy for Train set:")
print(accuracy_score(y_train,y_pred17))

print("Accuracy for Test set:")
print(accuracy_score(y_test,y_pred18))

Accuracy for Train set:
0.925371775417299
Accuracy for Test set:
0.9155134741442098


In [121]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred17,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred17,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred18,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred18,pos_label="no"))


Train data f1-score for class 'yes' 0.6344581537089341

Train data f1-score for class 'no' 0.9584438848799284

Test data f1-score for class 'yes' 0.5929824561403508

Test data f1-score for class 'no' 0.9528646891507516


In [122]:
# filename = 'gbm_model.sav'
# pickle.dump(GBM_model, open(filename, 'wb'))

In [123]:
# # load the model from disk
# loaded_model11 = pickle.load(open(filename, 'rb'))
# result11 = loaded_model11.score(X_test, y_test)
# print(result11)

# GBM gridsearch

In [124]:
# Model in use
GBM = GradientBoostingClassifier() 
 
# Use a grid over parameters of interest
param_grid = {"n_estimators" : [100,150],
              "max_depth" : [5, 10],
              "learning_rate" : [0.1,0.2]
             } 

In [125]:
CV_GBM = GridSearchCV(estimator=GBM, param_grid=param_grid, cv=5)
CV_GBM.fit(X=X_train, y=y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=GradientBoostingClassifier(criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
                                                  presort=

In [126]:
# Find best model
best_gbm_model = CV_GBM.best_estimator_
print(CV_GBM.best_score_, CV_GBM.best_params_)

0.9169044006069803 {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 150}


In [127]:
y_pred19 =best_gbm_model.predict(X_train)
y_pred20=best_gbm_model.predict(X_test)

In [128]:
print(accuracy_score(y_train,y_pred19))

0.942701062215478


In [129]:
print(accuracy_score(y_test,y_pred20))

0.9159990288905074


In [130]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred19,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred19,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred20,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred20,pos_label="no"))


Train data f1-score for class 'yes' 0.72218952324897

Train data f1-score for class 'no' 0.9680563075257174

Test data f1-score for class 'yes' 0.605473204104903

Test data f1-score for class 'no' 0.952995516913463


In [131]:
# filename = 'gbm_gridsearch_model.sav'
# pickle.dump(CV_GBM, open(filename, 'wb'))

In [132]:
# # load the model from disk
# loaded_model12 = pickle.load(open(filename, 'rb'))
# result12 = loaded_model12.score(X_test, y_test)
# print(result12)

# XGBOOST

In [133]:
XGB_model = XGBClassifier(n_estimators=500, 
                          gamma=0.5,
                          learning_rate=0.1)
XGB_model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0.5,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [134]:
y_pred21 = XGB_model.predict(X_train)
y_pred22 = XGB_model.predict(X_test)

In [135]:
print("Accuracy for Train set:")
print(accuracy_score(y_train,y_pred21))

print("Accuracy for Test set:")
print(accuracy_score(y_test,y_pred22))

Accuracy for Train set:
0.9317450682852807
Accuracy for Test set:
0.9162418062636563


In [136]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred21,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred21,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred22,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred22,pos_label="no"))


Train data f1-score for class 'yes' 0.664378450977466

Train data f1-score for class 'no' 0.9620094934036048

Test data f1-score for class 'yes' 0.6039035591274396

Test data f1-score for class 'no' 0.9531695398398262


In [137]:
# filename = 'xgb_model.sav'
# pickle.dump(XGB_model, open(filename, 'wb'))

In [138]:
# # load the model from disk
# loaded_model13 = pickle.load(open(filename, 'rb'))
# result13 = loaded_model13.score(X_test, y_test)
# print(result13)

# XGB gridsearch

In [139]:
XGB = XGBClassifier(n_jobs=-1)
 
# Use a grid over parameters of interest
param_grid = {'colsample_bytree': [0.5, 0.9],
              'n_estimators':[100],
              'max_depth': [10, 15]
             } 

In [140]:
CV_XGB = GridSearchCV(estimator=XGB, param_grid=param_grid, cv= 10)
CV_XGB.fit(X = X_train, y=y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=-1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='warn', n_jobs=None,
             param_grid={'colsample_bytree': [0.5, 0.9], 'max_depth': [10, 15],
                         'n_estimators': [100]},
             pre_dispatch='2*n_jobs'

In [141]:
# Find best model
best_xgb_model = CV_XGB.best_estimator_
print(CV_XGB.best_score_, CV_XGB.best_params_)

0.9150227617602428 {'colsample_bytree': 0.9, 'max_depth': 10, 'n_estimators': 100}


In [142]:
y_pred23 =best_xgb_model.predict(X_train)
y_pred24=best_xgb_model.predict(X_test)

In [143]:
print(accuracy_score(y_train,y_pred23))
print(accuracy_score(y_test,y_pred24))

0.9797572078907435
0.9128429230395727


In [144]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred23,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred23,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred24,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred24,pos_label="no"))


Train data f1-score for class 'yes' 0.9052422219065208

Train data f1-score for class 'no' 0.9886682183449144

Test data f1-score for class 'yes' 0.5984340044742729

Test data f1-score for class 'no' 0.9511165577342048


In [145]:
# filename = 'XGB_gridsearch_model.sav'
# pickle.dump(CV_XGB, open(filename, 'wb'))

In [146]:
# # load the model from disk
# loaded_model14 = pickle.load(open(filename, 'rb'))
# result14 = loaded_model14.score(X_test, y_test)
# print(result14)

# Adaboost

In [147]:
Adaboost_model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),
                                    n_estimators = 600,
                                    learning_rate = 1)

Adaboost_model.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight=None,
                                                         criterion='gini',
                                                         max_depth=2,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=None,
                             

In [148]:
y_pred25 = Adaboost_model.predict(X_train)
y_pred26 = Adaboost_model.predict(X_test)

In [149]:
print("Accuracy for Train set:")
print(accuracy_score(y_train,y_pred25))

print("Accuracy for Test set:")
print(accuracy_score(y_test,y_pred26))

Accuracy for Train set:
0.9420333839150228
Accuracy for Test set:
0.9087157076960427


In [150]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred25,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred25,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred26,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred26,pos_label="no"))


Train data f1-score for class 'yes' 0.7236689814814814

Train data f1-score for class 'no' 0.9676205329897607

Test data f1-score for class 'yes' 0.5775280898876404

Test data f1-score for class 'no' 0.9488296135002722


In [151]:
# filename = 'Adaboost_model.sav'
# pickle.dump(Adaboost_model, open(filename, 'wb'))

In [152]:
# # load the model from disk
# loaded_model15 = pickle.load(open(filename, 'rb'))
# result15 = loaded_model15.score(X_test, y_test)
# print(result15)

# Adaboost gridsearch

In [153]:
# Model in use
ADB = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2))
 
# Use a grid over parameters of interest
param_grid = {'n_estimators' : [100, 150],
              'learning_rate' : [0.1, 0.5]
             } 

In [154]:
CV_ADB = GridSearchCV(estimator=ADB, 
                      param_grid=param_grid, 
                      cv=5,
                      n_jobs=-1
                     )

CV_ADB.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                          base_estimator=DecisionTreeClassifier(class_weight=None,
                                                                                criterion='gini',
                                                                                max_depth=2,
                                                                                max_features=None,
                                                                                max_leaf_nodes=None,
                                                                                min_impurity_decrease=0.0,
                                                                                min_impurity_split=None,
                                                                                min_samples_leaf=1,
                                                                                min

In [155]:
# Find best model
best_adb_model = CV_ADB.best_estimator_
print(CV_ADB.best_score_, CV_ADB.best_params_)

0.9150531107738998 {'learning_rate': 0.5, 'n_estimators': 150}


In [156]:
y_pred27 =best_adb_model.predict(X_train)
y_pred28 =best_adb_model.predict(X_test)

In [157]:
print(accuracy_score(y_train,y_pred27))
print(accuracy_score(y_test,y_pred28))

0.9229742033383915
0.9136926438455936


In [158]:
print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred27,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred27,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred28,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred28,pos_label="no"))


Train data f1-score for class 'yes' 0.6183458646616541

Train data f1-score for class 'no' 0.9571645569620253

Test data f1-score for class 'yes' 0.584938704028021

Test data f1-score for class 'no' 0.951839057102215


In [159]:
# filename = 'ADB_gridsearch_model.sav'
# pickle.dump(CV_ADB, open(filename, 'wb'))

In [160]:
# # load the model from disk
# loaded_model16 = pickle.load(open(filename, 'rb'))
# result16 = loaded_model16.score(X_test, y_test)
# print(result16)

# Store and load models sample code

In [161]:
# model = LogisticRegression()
# model.fit(X_train, Y_train)

# filename = 'finalized_model.sav'
# pickle.dump(model, open(filename, 'wb'))
 
# load the model from disk
# loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X_test, Y_test)
# print(result)

In [162]:
# import pickle 

# saved_model = pickle.dumps(knn) 
  
# Load the pickled model 
# knn_from_pickle = pickle.loads(saved_model) 
  
# # Use the loaded pickled model to make predictions 
# knn_from_pickle.predict(X_test) 

# Ridge regression gridsearch

In [163]:
from sklearn.linear_model import RidgeClassifier
ridge = RidgeClassifier()

In [164]:
parameters = {"alpha" : [1e-2, 1, 2]}

In [165]:
ridge_regressor = GridSearchCV(ridge, parameters, cv= 5)

In [166]:
ridge_regressor.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RidgeClassifier(alpha=1.0, class_weight=None,
                                       copy_X=True, fit_intercept=True,
                                       max_iter=None, normalize=False,
                                       random_state=None, solver='auto',
                                       tol=0.001),
             iid='warn', n_jobs=None, param_grid={'alpha': [0.01, 1, 2]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [167]:
print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

{'alpha': 0.01}
0.9056449165402124


In [168]:
y_pred29 =ridge_regressor.predict(X_train)
y_pred30 =ridge_regressor.predict(X_test)

In [169]:
print(accuracy_score(y_train,y_pred29))
print(accuracy_score(y_test,y_pred30))

print("\nTrain data f1-score for class 'yes'",f1_score(y_train,y_pred29,pos_label="yes"))
print("\nTrain data f1-score for class 'no'",f1_score(y_train,y_pred29,pos_label="no"))

print("\nTest data f1-score for class 'yes'",f1_score(y_test,y_pred30,pos_label="yes"))
print("\nTest data f1-score for class 'no'",f1_score(y_test,y_pred30,pos_label="no"))

0.9059180576631259
0.9089584850691915

Train data f1-score for class 'yes' 0.42464736451373425

Train data f1-score for class 'no' 0.9487704918032788

Test data f1-score for class 'yes' 0.44690265486725667

Test data f1-score for class 'no' 0.9503968253968254


In [170]:
# filename = 'Rigdge_gridsearch_model.sav'
# pickle.dump(ridge_regressor, open(filename, 'wb'))

In [171]:
# # load the model from disk
# loaded_model17 = pickle.load(open(filename, 'rb'))
# result17 = loaded_model17.score(X_test, y_test)
# print(result17)

# Ensembling

In [172]:
# from scipy.stats import mode

In [173]:
stack_train = pd.DataFrame([y_pred9,y_pred15,y_pred17,y_pred21,y_pred27])
stack_test = pd.DataFrame([y_pred10,y_pred16,y_pred18,y_pred22,y_pred28])
#stacked_pred_train = mode(stack_train,axis=1)[0]

In [174]:
stack_train = stack_train.T
stack_test = stack_test.T

In [175]:
stack_train.columns = ['SVC','RF','GBM','XGB','ADB']
stack_test.columns = ['SVC','RF','GBM','XGB','ADB']

In [176]:
print(stack_train)
print(stack_test)

      SVC  RF  GBM  XGB  ADB
0      no  no   no   no   no
1      no  no  yes  yes  yes
2      no  no   no   no   no
3      no  no   no   no   no
4      no  no   no   no   no
5      no  no   no   no   no
6      no  no  yes  yes  yes
7      no  no   no   no   no
8      no  no   no   no   no
9      no  no   no   no   no
10     no  no   no   no   no
11     no  no  yes  yes  yes
12     no  no   no   no   no
13     no  no   no   no   no
14     no  no   no   no   no
15     no  no   no   no   no
16     no  no   no   no   no
17     no  no   no   no   no
18     no  no   no   no   no
19     no  no   no   no   no
20     no  no   no   no   no
21     no  no   no   no   no
22     no  no   no   no   no
23     no  no   no   no   no
24     no  no   no   no   no
25     no  no   no   no   no
26     no  no   no   no   no
27     no  no   no   no   no
28     no  no   no   no   no
29     no  no   no   no   no
...    ..  ..  ...  ...  ...
32920  no  no   no   no   no
32921  no  no   no   no   no
32922  no  no 

In [177]:
stack_train['SVC'] = label_encoder.fit_transform(stack_train['SVC'])
stack_train['RF'] = label_encoder.fit_transform(stack_train['RF'])
stack_train['GBM'] = label_encoder.fit_transform(stack_train['GBM'])
stack_train['XGB'] = label_encoder.fit_transform(stack_train['XGB'])
stack_train['ADB'] = label_encoder.fit_transform(stack_train['ADB'])

stack_test['SVC'] = label_encoder.fit_transform(stack_test['SVC'])
stack_test['RF'] = label_encoder.fit_transform(stack_test['RF'])
stack_test['GBM'] = label_encoder.fit_transform(stack_test['GBM'])
stack_test['XGB'] = label_encoder.fit_transform(stack_test['XGB'])
stack_test['ADB'] = label_encoder.fit_transform(stack_test['ADB'])

In [178]:
stack_train.head()

Unnamed: 0,SVC,RF,GBM,XGB,ADB
0,0,0,0,0,0
1,0,0,1,1,1
2,0,0,0,0,0
3,0,0,0,0,0
4,0,0,0,0,0


In [179]:
dtc = LogisticRegression()

dtc.fit(stack_train,y_train)

stacked_pred_train = dtc.predict(stack_train)
stacked_pred_test = dtc.predict(stack_test)

In [180]:
print(accuracy_score(y_train, stacked_pred_train))
print(accuracy_score(y_test, stacked_pred_test))

0.9299544764795145
0.9161204175770818


In [181]:
print(f1_score(y_train,stacked_pred_train,average='macro'))

print(f1_score(y_test,stacked_pred_test,average='macro'))

0.8052188858040593
0.7742045306745556


In [182]:
# filename = 'logreg_ens_model.sav'
# pickle.dump(dtc, open(filename, 'wb'))

In [183]:
# # load the model from disk
# loaded_model18 = pickle.load(open(filename, 'rb'))
# result18 = loaded_model18.score(X_test, y_test)
# print(result18)

# Classification report

In [184]:
classification_error_report = {'Model':['LR_GS', 'NB', 'DT_GS', 'SVC_linear', 'SVC_rbf', 'RF', 'RF_GS', 'GBM', 'GBM_GS', 'XGB', 'XGB_GS', 'ADB', 'ADB_GS', 'RIDGE', 'ENSEMBLE'], 
                               'Train_Accuracy':[0.9, 0.71, 0.918, 0.901, 0.907, 0.92, 0.902, 0.92, 0.94, 0.93, 0.97, 0.94, 0.92, 0.9, 0.92],
                               'Test_Accuracy':[0.91, 0.71, 0.914, 0.903, 0.909, 0.91, 0.903, 0.91, 0.91, 0.91, 0.91, 0.90, 0.91, 0.9, 0.91],
                               'Train_f1_score_class_yes':[0.72, 0.606, 0.786, 0.38, 0.42, 0.52, 0.32, 0.62, 0.72, 0.66, 0.9, 0.72, 0.61, 0.42, 0.8],
                               'Train_f1_score_class_no':[0.72, 0.606, 0.786, 0.94, 0.94, 0.95, 0.94, 0.95, 0.96, 0.96, 0.98, 0.96, 0.95, 0.94, 0.8],
                               'Test_f1_score_class_yes':[0.73, 0.605, 0.782, 0.4, 0.45, 0.49, 0.33, 0.59, 0.6, 0.6, 0.59, 0.57, 0.58, 0.44, 0.77],
                               'Test_f1_score_class_no':[0.73, 0.605, 0.782, 0.94, 0.95, 0.95, 0.94, 0.95, 0.95, 0.95, 0.95, 0.94, 0.95, 0.95, 0.77]}

In [185]:
report = pd.DataFrame(classification_error_report) 
report.head()

Unnamed: 0,Model,Train_Accuracy,Test_Accuracy,Train_f1_score_class_yes,Train_f1_score_class_no,Test_f1_score_class_yes,Test_f1_score_class_no
0,LR_GS,0.9,0.91,0.72,0.72,0.73,0.73
1,NB,0.71,0.71,0.606,0.606,0.605,0.605
2,DT_GS,0.918,0.914,0.786,0.786,0.782,0.782
3,SVC_linear,0.901,0.903,0.38,0.94,0.4,0.94
4,SVC_rbf,0.907,0.909,0.42,0.94,0.45,0.95


In [186]:
# Creating excel file of classification report
#report.to_excel("Classification_report.xlsx",engine='xlsxwriter')