# Attribute Information:

## Input variables:
### bank client data:
1 - age (numeric)  
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')  
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)  
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')  
5 - default: has credit in default? (categorical: 'no','yes','unknown')  
6 - housing: has housing loan? (categorical: 'no','yes','unknown')  
7 - loan: has personal loan? (categorical: 'no','yes','unknown')  
### related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')  
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')  
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')  
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.  
### other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)  
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)  
14 - previous: number of contacts performed before this campaign and for this client (numeric)  
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')  
### social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)  
17 - cons.price.idx: consumer price index - monthly indicator (numeric)  
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)  
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)  
20 - nr.employed: number of employees - quarterly indicator (numeric)  

### Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')  

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [4]:
df= pd.read_csv('bank-clean.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)

In [7]:
df['y'] = df['y'].apply(lambda x : 1 if x=='yes' else 0)

In [37]:
x=df.drop('y', axis=1)
y=df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

# AI

In [53]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

## Orignal unbalanced Data

In [51]:
# Using DecisionTreeClassifier
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', DecisionTreeClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.93      0.92      0.93     13175
           1       0.45      0.47      0.46      1745

    accuracy                           0.87     14920
   macro avg       0.69      0.70      0.69     14920
weighted avg       0.87      0.87      0.87     14920

--------------------------------------
[[12160  1015]
 [  918   827]]
--------------------------------------
0.8704423592493298


In [49]:
# Using Logistec regresion
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.92      0.97      0.95     13175
           1       0.64      0.34      0.44      1745

    accuracy                           0.90     14920
   macro avg       0.78      0.66      0.69     14920
weighted avg       0.88      0.90      0.89     14920

--------------------------------------
[[12837   338]
 [ 1155   590]]
--------------------------------------
0.8999329758713137


In [52]:
# Using KNN
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', KNeighborsClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.91      0.97      0.94     13175
           1       0.59      0.32      0.41      1745

    accuracy                           0.89     14920
   macro avg       0.75      0.64      0.68     14920
weighted avg       0.88      0.89      0.88     14920

--------------------------------------
[[12785   390]
 [ 1191   554]]
--------------------------------------
0.8940348525469168


In [54]:
# Using Random forest
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', RandomForestClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.92      0.97      0.95     13175
           1       0.64      0.37      0.47      1745

    accuracy                           0.90     14920
   macro avg       0.78      0.67      0.71     14920
weighted avg       0.89      0.90      0.89     14920

--------------------------------------
[[12817   358]
 [ 1099   646]]
--------------------------------------
0.9023458445040214


## using upsampling to balance the data

In [44]:
#Make the ratio between classes 1:1 using upsampling
from sklearn.utils import resample

df_majority = df[df['y']==0]
df_minority = df[df['y']==1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=len(df_majority),    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled['y'].value_counts()

0    39922
1    39922
Name: y, dtype: int64

In [56]:
x=df_upsampled.drop('y', axis=1)
y=df_upsampled['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [57]:
# Using DecisionTreeClassifier
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', DecisionTreeClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      0.91      0.95     13165
           1       0.92      1.00      0.95     13184

    accuracy                           0.95     26349
   macro avg       0.96      0.95      0.95     26349
weighted avg       0.96      0.95      0.95     26349

--------------------------------------
[[11954  1211]
 [   29 13155]]
--------------------------------------
0.9529393904892026


In [58]:
# Using Logistec regresion
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', LogisticRegression())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.81      0.84      0.82     13165
           1       0.83      0.80      0.81     13184

    accuracy                           0.82     26349
   macro avg       0.82      0.82      0.82     26349
weighted avg       0.82      0.82      0.82     26349

--------------------------------------
[[11007  2158]
 [ 2651 10533]]
--------------------------------------
0.8174883297278834


In [59]:
# Using KNN
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', KNeighborsClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.96      0.82      0.89     13165
           1       0.85      0.97      0.90     13184

    accuracy                           0.89     26349
   macro avg       0.90      0.89      0.89     26349
weighted avg       0.90      0.89      0.89     26349

--------------------------------------
[[10845  2320]
 [  448 12736]]
--------------------------------------
0.8949485748984781


In [60]:
# Using Random forest
pipe=Pipeline([
    ('scale', StandardScaler()),
    ('model', RandomForestClassifier())
]) 

pipe.fit(x_train, y_train)

pred= pipe.predict(x_test)
print(classification_report(y_test, pred))
print('--------------------------------------')
print(confusion_matrix(y_test, pred))
print('--------------------------------------')
print(accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96     13165
           1       0.93      1.00      0.97     13184

    accuracy                           0.96     26349
   macro avg       0.97      0.96      0.96     26349
weighted avg       0.97      0.96      0.96     26349

--------------------------------------
[[12249   916]
 [   18 13166]]
--------------------------------------
0.9645527344491251


The performance of some ML algorithms were affected positivly and negativly this is due to the nature of how each algorithm assign labels to each point.