### Bank Marketing - Term Deposit Sale analysis and Modeling 

#### Objective

Goal of the project is to build multiple models that can predict the bank clients likelihood to subscribe to term deposit.

#### Import the necessary libraries

In [None]:
import os

In [None]:
os.getcwd()

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt# matplotlib.pyplot plots data
%matplotlib inline 
from sklearn.model_selection import train_test_split
import missingno as msno
import warnings
pd.options.display.max_columns = None
pd.options.display.max_rows = None
warnings.filterwarnings("ignore")
pd.options.display.float_format = '{:,.2f}'.format

#### Read the dataset 

In [None]:
data=pd.read_csv('bank-full.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe().transpose()

In [None]:
data.nunique() ## Getting the unique columns from each column

#### Basic checks on the data prior to analysis 

In [None]:
def basic_checks(df):
    
    print('='*50)
    print('Shape of the dataframe is: \n',df.shape)
    print('='*50)
    print('Basic stats for the data: \n',df.describe())
    print('='*50)
    print('Data type and info :')
    print(df.info())
    print('='*50)
    print('Missing value information : \n',df.isnull().any())
    print('='*50)
    print('Sum of missing values if any : \n',df.isnull().sum())

In [None]:
basic_checks(data)

#### Missing values matrix

In [None]:
msno.matrix(data)

No missing values are seen in the data set 

###### Univariate and bivariate analysis of the data 

Plotting correlations between the variables 

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(10,8))

sns.heatmap(data.corr(),
            annot=True,
            linewidths=.5,
            center=0,
            cbar=False,
            cmap="YlGnBu")

plt.show()

#### EDA Part 1 

Basic stats

1.The data set has 45211 records and 17 variables.
2.Average age of the clients is 40.9 ,with 18 being the smallest and 95 being the highest age .
3.Average account balance is 1362 ,with minimum at -8019 and 102127
4.last contact duration is at an average of 4.3 minutes
5.An average of 2.7 contacts were made with the customer during this campaign.
6.A majority of the clients seem to not be contacted for more than 900 days or they have not been contacted at all.
 
Data Types :

There seems to be a good mix of categorical and continuous variables in this dataset ,some continuous variables ,like age,balanace,pdays,would need to be further processed to understand the data better.

Missing values:

We do not see any missing values in this dataset

Correlation between the variables :

1.Campaign and day are very slightly correlated.
2.pdays and previous seem to be slightly correlated at 0.45 score . 
3.There is no significant correlation between the other variables in the dataset.




### Univariate and bivariate plots - Continous variables 

#### Age

In [None]:
def plots(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(221)
    sns.distplot(data[variable])
    plt.xticks(rotation=90)
    plt.subplot(222)
    sns.boxplot(x=data[variable])
    plt.xticks(rotation=90)
    plt.subplot(223)
    sns.boxplot(x=data['Target'],y=data[variable])
    plt.xticks(rotation=90)
    plt.subplot(224)
    sns.barplot(x='Target',y=data[variable],data=data)
    plt.xticks(rotation=90)

Age,Balance,day,duration,campaign,pdays and previous are int data types

In [None]:
plots('age')

In [None]:
def age_cat(x):
    if (x>=18)&(x<=35):
        return 0
    else :
        if(x>35)&(x<=50):
            return 1
        else: 
            if (x>50)&(x<=65):
                return 2
            else:
                if (x>65):
                    return 3

In [None]:
data['age_category']=data['age'].apply(age_cat)

In [None]:
data.head()

In [None]:
sns.countplot('age_category',hue='Target',data=data)

In [None]:

sns.countplot('age_category',hue='Target',data=data[data['Target']=='yes']);

In [None]:
cats=[0,1,2,3]

for i in cats:
    a=data[(data['age_category']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['age_category']==i].shape[0]
    print('Proportions for age category {} is: {}%'.format(i,np.round(a/b*100,2)))

### Balance

In [None]:
plots('balance')

In [None]:
fig=plt.figure(figsize=(15,5))
sns.distplot(data[(data['balance']>-9000) & (data['balance']<10000)]['balance']);


The data seems to be extremely right skewed .

Exploring the balance variable further to see the spread of the data.
We 1st define a function which will bucket the balance to individual categories.

In [None]:
def bal_cat(x):
        if (x>-8110) & (x<-2000):
            return 0
        else:
            if (x >= -2000) & (x <0):
                return 1
            else:
                if (x >=0) & (x < 3000):
                    return 2
                else:
                    if (x >=3000) & (x < 6000):
                        return 3
                    else: 
                        if (x >=6000) & (x < 8000):
                                return 4
                        else: 
                            if x >= 8000:
                                return 5      

In [None]:
data['balance_category']=data['balance'].map(bal_cat)

Balance distribution for those clients that subscribed

In [None]:

sns.countplot('balance_category',hue='Target',data=data[data['Target']=='yes']);

We see that the balance category 2 is the most common ,i.e balance>0 and lesser than 3000 ,
followed by customers that have a balance of 6000-8000.

Further exploring how the target variables change with these popular account balance bands.

In [None]:
sns.countplot('balance_category',hue='Target',data=data);

In [None]:
cats=[0,1,2,3,4,5]
for i in cats:
    a=data[(data['balance_category']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['balance_category']==i].shape[0]
    print('Subscriptions % for balance category : {} is {}%'.format(i,np.round(a/b*100,2)))

The highest band with the deposit subscribed is category 3(as a % of those in teh same category),5&4 respectively .So any account balance >3000 have subscribed to the term deposit better relatively.

In [None]:
cats=[0,1,2,3,4,5]
for i in cats:
    a=data[(data['balance_category']==i) & (data['Target']=='yes')].shape[0]
    b=data['balance_category'].count()
    print('Subscriptions % for balance category {}  against overall is: {}%'.format(i,np.round(a/b*100,2)))

#### Day

In [None]:
plots('day')

In [None]:
def days_cat(x):
    if (x>=1)&(x<=7):
        return 0
    else:
        if (x>8)&(x<=23):
            return 1
        else:
            if (x>24)&(x<=31):
                return 2

In [None]:
data['day_category']=data['day'].apply(days_cat)

In [None]:
sns.countplot(data['day_category'],hue='Target',data=data)

In [None]:
sns.countplot('day_category',hue='Target',data=data[data['Target']=='yes'])

In [None]:
cats=[0,1,2]

for i in cats:
    a=data[(data['day_category']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['day_category']==i].shape[0]
    print('Proportions for day category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Duration

In [None]:
plots('duration')

In [None]:
sns.distplot(data['duration'])

Distribution of the cuberoot transformed data 

In [None]:
data['duration']=np.cbrt(data['duration'])

In [None]:
sns.distplot(data['duration'])

Duration for the call follows a similar trend for both deposit subscribed users and thos that did not .The majority of users that have subscribed have a last contact duration less than 6 mins . 

#### Campaign

In [None]:
plots('campaign')

In [None]:
sns.distplot(data['campaign'])

In [None]:
def camp_cat(x):
    if (x>0)&(x<3):
        return 0
    else :
        if(x>=3)&(x<10):
            return 1
        else: 
            if (x>10):
                return 2

In [None]:
data['campaign_category']=data['campaign'].apply(camp_cat)

In [None]:
sns.countplot('campaign_category',hue='Target',data=data)

In [None]:
sns.countplot('campaign_category',hue='Target',data=data[data['Target']=='yes'])

Its clear that most customers have been contacted less than 10 times.The customers who have subscribed to the term deposit have been contacted less than 3 times.

#### Previous

In [None]:

fig=plt.figure(figsize=(10,5))
plt.subplot(131)
sns.distplot(data['previous'],kde=False)
plt.xticks(rotation=90)
plt.subplot(132)
sns.boxplot(x=data['previous'])
plt.xticks(rotation=90)
plt.subplot(133)
sns.boxplot(x=data['Target'],y=data['previous'])
plt.xticks(rotation=90)

In [None]:
sns.distplot(data['previous'],kde=False)
plt.xlim(0,100);

Binning the data into 3 categories as below 

In [None]:
def previous_cat(x):
    if (x>=0)&(x<5):
        return 0
    else:
        if (x>=5)&(x<20):
            return 1
        else:
            if (x>=20):
                return 2       
        

In [None]:
data['previous_category']=data['previous'].apply(previous_cat)

In [None]:
sns.countplot('previous_category',hue='Target',data=data)

In [None]:
sns.countplot('previous_category',hue='Target',data=data[data['Target']=='yes'])

Very clearly a majority were contacted less than 5 times before this campaign .

###### pdays

In [None]:

fig=plt.figure(figsize=(10,5))
plt.subplot(131)
sns.distplot(data['pdays'],kde=False)
plt.xticks(rotation=90)
plt.subplot(132)
sns.boxplot(x=data['pdays'])
plt.xticks(rotation=90)
plt.subplot(133)
sns.boxplot(x=data['Target'],y=data['pdays'])
plt.xticks(rotation=90)

In [None]:
sns.distplot(data['pdays'],kde=False)
plt.xlim(-10,400)

Binning into categories

In [None]:
def pdays_cat(x):
    if (x<1):
        return 0
    else:
        if(x>=1)&(x<30):
            return 1
        else:
            if(x>=30)&(x<100):
                return 2
            else:
                if(x>=100):
                    return 3

In [None]:
data['pdays_category']=data['pdays'].apply(pdays_cat)

In [None]:
sns.countplot(data['pdays_category'],hue='Target',data=data)

In [None]:
sns.countplot('pdays_category',hue='Target',data=data[data['Target']=='yes'])

In [None]:
cats=[0,1,2,3]

for i in cats:
    a=data[(data['pdays_category']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['pdays_category']==i].shape[0]
    print('Proportions for pdays category {} is: {}%'.format(i,np.round(a/b*100,2)))

In [None]:
cats=[0,1,2,3]

for i in cats:
    a=data[(data['pdays_category']==i)&(data['Target']=='yes')].shape[0]
    b=data['pdays_category'].count()
    print('Proportions for pdays category  {} against overall is :{}%'.format(i,np.round(a/b*100,2)))

#### EDA - Part 2 

##### Univariate and bivariate analysis - Continuous variables

Breaking the analysis down to each variable and understanding the relationship between the variables and the target we proceed to Univariate and Bivariate analysis 

Age:
1.Age is slightly right skewed data ,the customers are majorly between the ages 20-60.There’s a few outliers in the dataset after age 70 for the overall data and for the subscription data has outliers above age 80.
We will need to bin this data to generate meaningful insights as the current distribution does not allow us to visualise the insights.
2.The age range that has subscribed to term deposit ranges from 20 to above 80, with the median lying at 40.Number of subscribers are the highest in the age range 18-35 followed by 35 and 50. However,the % of subscribers for term deposit is highest for clients over 65 years .

Balance:
1. Balance data is extremely right skewed .There’s a lot of outliers in the data ,the distribution of the data is hard just looking at the graphs, the minimum balance is -8019 and the highest is at 102127.We will need to bin this variable to see how the data is distributed between those that subscribed and those that did not.
2. We see that the  balance>0 and less than 3000 ,followed by customers that have a balance of 6000-8000 are the most common for those clients that subscribed.Further exploring how the target variables change with these popular account balance bands.
3. The highest conversion with the deposit subscription is account balances >3000 .

Day:
1. Last contact day of the month doesn’t seem to be showing any particular trend.Although set as a continuous variable it would be useful to see if binning would help in understanding the last contact day of the month better.
2. We bin this variable based on days>1&<7, days>8&<=23 and anything else in another category 
3. Although there may not be a direct correlation between the target variable and the last contact day ,the data shows that those contacted in 1st week of the month show better conversion than others .This could be combined with multiple other factors to see why the conversion is better in the 1st week.

Duration:
1.Duration again ,may not directly impact the subscription conversion .We see that this is a continuous variable and the data is extremely right skewed
2. Duration for the call follows a similar trend for both deposit subscribed users and those that did not .The majority of users that have subscribed have a last contact duration less than 6 mins . Since the data is extremely skewed ,we need a transformation technique to normalise the data .Hence we apply the cube root transformation to this data.

Campaign : 
1. The data is extremely right skewed here and a good way to see the data distribution would be by binning the data and seeing the patterns.
2. Its clear that most customers have been contacted less than 10 times.The customers who have subscribed to the term deposit have been contacted less than 3 times.

Previous:
1. As with most of the other variables the data in previous needs transformation as this is clearly skewed.
2. From previous variable data,Very clearly a majority were contacted less than 5 times before this campaign .Those that subscribed had also been contacted less than 5 times.

pdays:
1. This variable is skewed too and we will bin the days to see if there’s any pattern in the data
2. Pdays- There is a very high customer base that has either not been contacted at least once or in the last 900 days .We see that approximately 50 % of the customers that have subscribed have not been contacted in the past 100 days .19% of the customers that subscribed haven’t been contacted for more than 100 days.This clearly highlights the potential in marketing to the clients  who haven’t subscribed to convert them.



 



#### Countplots for the categorical variables 

In [None]:
def catplot(variable):
    fig=plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    plt.xticks(rotation=90)
    sns.countplot(x=variable,data=data)
    plt.subplot(1,2,2)
    sns.countplot(x=variable, hue='Target', data=data)
    plt.xticks(rotation=90)

#### Job

In [None]:
catplot('job')

In [None]:
sns.countplot('job',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['management','technician','blue-collar','admin.']

for i in cats:
    a=data[(data['job']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['job']==i].shape[0]
    print('Proportions for job category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Marital

In [None]:
catplot('marital')

In [None]:
sns.countplot('marital',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['married','single','divorced']

for i in cats:
    a=data[(data['marital']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['marital']==i].shape[0]
    print('Proportions for marital category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Education

In [None]:
catplot('education')

In [None]:
sns.countplot('education',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['secondary','tertiary','unknown','primary']

for i in cats:
    a=data[(data['education']==i)&(data['Target']=='yes')].shape[0]
    b=data[data['education']==i].shape[0]
    print('Proportions for education category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Default

In [None]:
catplot('default')

In [None]:
sns.countplot('default',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['yes','no']

for i in cats:
    a=data[(data['default']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['default']==i].shape[0]
    print('Subscriptions % for default category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Housing

In [None]:
catplot('housing')

In [None]:
sns.countplot('housing',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['yes','no']

for i in cats:
    a=data[(data['housing']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['housing']==i].shape[0]
    print('Subscriptions % for housing category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Loan

In [None]:
catplot('loan')

In [None]:
sns.countplot('loan',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['yes','no']

for i in cats:
    a=data[(data['loan']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['loan']==i].shape[0]
    print('Subscriptions % for loan category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Contact

In [None]:
catplot('contact')

In [None]:
sns.countplot('contact',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['unknown','cellular','telephone']

for i in cats:
    a=data[(data['contact']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['contact']==i].shape[0]
    print('Subscriptions % for contact category {} is: {}%'.format(i,np.round(a/b*100,2)))

#### Month

In [None]:
catplot('month')

In [None]:
sns.countplot('month',hue='Target',data=data[data['Target']=='yes']);

In [None]:
def month_cat(x):
    if (x=='apr')|(x=='may')|(x=='jun'):
        return 'Q1'
    else:
        if (x=='jul')|(x=='aug')|(x=='sep'):
            return 'Q2'
        else:
            if (x=='oct')|(x=='nov')|(x=='dec'):
                return 'Q3'
            else:
                if (x=='jan')|(x=='feb')|(x=='mar'):
                    return 'Q4'
    

In [None]:
data['month_category']=data['month'].apply(month_cat)

In [None]:
sns.countplot('month_category',hue='Target',data=data)

#### poutcome

In [None]:
catplot('poutcome')

In [None]:
sns.countplot('poutcome',hue='Target',data=data[data['Target']=='yes'])
plt.xticks(rotation=90);

In [None]:
cats=['unknown','other','failure','success']

for i in cats:
    a=data[(data['poutcome']==i) & (data['Target']=='yes')].shape[0]
    b=data[data['poutcome']==i].shape[0]
    print('Subscriptions % for poutcome category : {} is {}%'.format(i,np.round(a/b*100,2)))

#### Distribution of the target variable

In [None]:
data['Target'].value_counts()

11.7 % of the overall customer data indicates subsriction to the term deposit .88.3 % of the customers hadn't subscribed to the term deposit.

Changes to other attributes with respect to the dependant variable

In [None]:
data.groupby(['Target']).mean()

In [None]:
data.groupby(['Target']).median()

#### EDA Part 2 - categorical variables & Target variable distribution

`Categorical variable distribution`

Job

1.From the overall numbers blue-collar, management,technician,admin and services respectively seem to be popular options.
2.For those that subscribed to the term deposit, the job popular categories are Management, Technician, blue-collar & admin respectively.The job % split for subscribed customers is Management -13.76%,technician-11.06%,admin-12.2%,blue-collar-7.27%.

Marital
1.Most subscribed customers are married followed by single and divorced categories.
2.Of the married customer population 10.12% have subscribed and 14.95% from the single category have subscribed to term deposit ,divorced customers are the one’s that are least to subscribe.

Education
1.Secondary education level is most popular followed by tertiary and primary in the subscribed customers.
2.The highest subscriptions are for tertiary education category at 15% followed by unknown category at 13.5%

Default
1.8 % of the overall population has credit in default.Those that have not been defaulted have a population of 11.8% conversion

Housing
 Approximately 55.6 % have a housing loan.16.7% of those that do not have a housing loan have subscribed to the term deposit.

Loan

84% of the customers do not have personal loan ,of the customers that don’t have a personal loan 12.6% subscribed to the loan and 6.6% from those that have a personal loan.

Contact
Most popular communication type is Cellular phone, followed by unknown and telephone.Those that have subscribed have been contacted using cell phone .

Month
Most popular subscription month is May,Aug,July,April followed by June. The data has been bucketed to make it more meaningful to understand subscriptions by quarter.Q1 (Apr,May,Jun) has the highest subscriptions followed by Q2(Jul,Aug,Sep) .

poutcome
The outcome of the previous campaign is majorly unknown.Of those that have the outcome category as success 64.73% have subscribed to the term deposit.


`Target variable distribution:`

1. 11.7 % of the overall customer data indicates subscription to the term deposit .88.3 % of the customers hadn't subscribed to the term deposit.
2. Average age of the client that subscribed to the term deposit is 38.
3. Average account balance is 1804.26 for those that opted subscription.
4. An average of 2-3 contacts were made during the campaign for both clients that subscribed and those that did not.
5. An average of 70 days passed after the last contact for the client to subscribe to the term deposit.
6. At least 1 contact was made before the campaign with the clients that subscribed as opposed to those that did not .
7. There’s a clear bias in the data distribution of the variable. 

###### Getting the data model ready (Deliverable 2)

In [None]:
data.dtypes

In [None]:
data['balance_category']=data['balance_category'].astype(str)

In [None]:
data['campaign_category']=data['campaign_category'].astype(str)

In [None]:
data['previous_category']=data['previous_category'].astype(str)

In [None]:
data['pdays_category']=data['pdays_category'].astype(str)

In [None]:
data['age_category']=data['age_category'].astype(str)

In [None]:
data['day_category']=data['day_category'].astype(str)

In [None]:
data.dtypes

#### Data Preparation for modeling 

1.In EDA,we have created a few categorical variables for continuous variables that did not have normal distribution and also needed binning to get clearer insight.

2.Checking the datatypes of the variables and converting the new variables created to categorical where needed ,since we applied binning to a few continuous variables and created new variables

3.Dropping the old variables as we may not need some continuous variables after they have been binned.

4.Mapping the data with categories ‘yes’ &’no’ to 1 & 0 for the variables Target,Default,Loan & Housing.

5.We will then split the data into Train and Test data and check the split was done correctly.

In [None]:
data2=data

In [None]:
data2.head()

In [None]:
data2=data2.drop(['age','balance','day','month','campaign','pdays','previous'],axis=1)

In [None]:
data2.head()

In [None]:
data2.dtypes 

In [None]:
data2.shape

#### Mapping

In [None]:
#target_map={'yes':1,'no':0}

In [None]:
def cat(x):
    if (x=='yes'):
        return 1
    else:
        if(x=='no'):
            return 0

In [None]:
data2['Target']=data2['Target'].apply(cat)

In [None]:
data2['default']=data2['default'].apply(cat)

In [None]:
data2['loan']=data2['loan'].apply(cat)

In [None]:
data2['housing']=data2['housing'].apply(cat)

In [None]:
data2.head()

In [None]:
data2.dtypes

###### Split the data into train and test 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=pd.get_dummies(data2.drop('Target',axis=1),drop_first=True)
Y=data2['Target']

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=1)

In [None]:
x_train.shape

In [None]:
x_train.head()

In [None]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data2.index)) * 100))

In [None]:
print("Original Target True Values    : {0} ({1:0.2f}%)".format(len(data2.loc[data2['Target'] == 1]), (len(data2.loc[data2['Target'] == 1])/len(data2.index)) * 100))
print("Original Target Loan False Values   : {0} ({1:0.2f}%)".format(len(data2.loc[data2['Target'] == 0]), (len(data2.loc[data2['Target'] == 0])/len(data2.index)) * 100))
print("")
print("Training Target True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Target False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Target True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Target False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")

#### Metrics to focus on

`Business Insights`

True Negative (observed=0,predicted=0)

Predicted that the customer would not subscribe and they actually do not.

False Positive (observed=0,predicted=1)

Predicted that the customer would subscribe while the customer did not.

True Negative (observed=0,predicted=0)

Predicted that the customer would not subscribe and the customer did not.

False Negative(observed=1,predicted=0)

Predicted the customer would not subscribe when the customer did.

`From the points above we know that we should be focusing on`

Low False Negatives as False negatives would mean missed opportunity for the bank in predicting clients who could convert as fixed term subscribers.

High score of True positives which would mean the prediction on the fixed term subscribers would be accurate.

False Positives in this case could be relatively harmless as this would not mean that the bank would lose money.

`Metrics of main interest`

Our fundamental assumption here is that we want to figure out customers that convert as subscribers of deposit for the term

1. Recall  
2. ROC/AUC Score

While trying to look at the above 2 metrics,we would also want to balance the following for a decent score 

1. Accuracy 
2. F1 Score
3. Precision



### Modelling

To achieve a good score of recall ,ROC/AUC ,we model the data using different models like 
1. Logistic Regression 
2. Decision Trees 
3. Random Forest Classifier 
4. Adaboost Classifier 
5. Gradientboost Classifier

#### Logistic Regression

`Logistic Regression- Modeling steps`

1. Build a basic Logistic regression with default parameters.Note the accuracy, recall, precision,F1 score and ROC AUC score.
2. Tweak the parameters - C,Solver, to see any improvement in the model performance.
3. Balance the data to improve the performance measures.
4. Compare the performance measures, print the confusion matrix and pick the model that gives us the best Recall,ROC AUC scores .

In [None]:
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

## Fitting the model on training set

model=LogisticRegression()
model.fit(x_train,y_train)

## Predicting on test set

y_predict=model.predict(x_test)

*** Model Scores on Train & Test Data set ***

In [None]:
model_score=model.score(x_test,y_test)
print(model_score)

In [None]:
model_score=model.score(x_train,y_train)
print(model_score)

The train and test datasets show good model score .Lets us now print the confusion matrix.

###### Confusion Matrix

In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,y_predict)
print('The confusion matrix is printed below')
print('')

print(confusion_matrix)

## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

In [None]:
print(metrics.classification_report(y_test,y_predict))

#### Accuracy,Precision,Recall,ROC_auc_score

In [None]:
from sklearn.metrics import confusion_matrix,recall_score,precision_score,f1_score,roc_auc_score,accuracy_score,roc_curve

In [None]:
## Accuracy

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,y_predict))
print('')



#### ROC/AUC curve

In [None]:
roc_auc_score(y_test,y_predict)

In [None]:
logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Although the Model accuracy is high the recall and ROC/AUC score could definitely be improved ,We would therefore try to improve our model by tuning the model.

#### Improving Model performance 

In [None]:
model.get_params()

Checking the model with different solvers,to see if there's an improvement in the scores 

In [None]:
train_score=[]
test_score=[]
solver=['newton-cg','lbfgs','liblinear','sag', 'saga']

for i in solver:
    model=LogisticRegression(random_state=None,penalty='l2', C = 1,solver=i)
    model.fit(x_train,y_train)
    y_predict=model.predict(x_test)
    train_score.append(round(model.score(x_train,y_train),2))
    test_score.append(round(model.score(x_test,y_test),2))
    
print(solver)
print()
print(train_score)
print()
print(test_score)

Picking 'saga' solver ,we pick this and proceed with tweaking the C value  to see if it improves the recall and precision.

In [None]:
### Model 

model=LogisticRegression(random_state=None,penalty='l2', C = 1,solver='saga')
model.fit(x_train,y_train)
y_predict=model.predict(x_test)


In [None]:
train_score=[]
test_score=[]

C=[0.01,0.1,0.25,0.5,0.75,1]

for i in C :
    model=LogisticRegression(random_state=None,penalty='l2', C = i ,solver='saga')
    model.fit(x_train,y_train)
    y_predict=model.predict(x_test)
    train_score.append(round(model.score(x_train,y_train),3))
    test_score.append(round(model.score(x_test,y_test),3))
  
print(C)
print()
print(train_score)
print()
print(test_score)

#print(metrics.f1_score(y_test,y_predict))

We will stick to C=1 which is the default value as there's no change in the accuracy of the model with other C values.

In [None]:
## Accuracy
model=LogisticRegression(random_state=None,penalty='l2',solver='saga',C=1)
model.fit(x_train,y_train)
y_predict=model.predict(x_test)

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

##### Confusion Matrix

In [None]:
confusionmatrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

The true positives show slight improvement .We will hence stick to the parameters used and then use 'class_weight'= 'balanced' as our data seems to be highly imbalanced.

##### Treating the imbalance in the data by tweaking the class_weight parameter .

In [None]:
## Accuracy
model=LogisticRegression(random_state=None,penalty='l2',solver='saga',C=1,class_weight='balanced')
model.fit(x_train,y_train)
y_predict=model.predict(x_test)

print('Training Accuracy is :',model.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',model.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

In [None]:
logreg_acc=model.score(x_test,y_test)
logreg_recall=metrics.recall_score(y_test,y_predict)
logreg_f1score=metrics.f1_score(y_test,y_predict)
logreg_ROCAUC=metrics.roc_auc_score(y_test,y_predict)
logreg_precision=metrics.precision_score(y_test,y_predict)

print(logreg_acc)
print(logreg_recall)
print(logreg_f1score)
print(logreg_ROCAUC)
print(logreg_precision)

In [None]:
confusionmatrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

#### ROC/AUC Curve

In [None]:
logit_roc_auc = roc_auc_score(y_test, model.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Balancing the data has significantly improved the Recall,F1 score and ROC AUC score.We therefore will retain this as our final model for Logistic Regression.

###### Decision trees

`Decision Trees - Modeling Steps`

1. Build the decision tree with default parameters.
2. Since Decision trees are prone to over fitting we check if there’s any over fitting of the data .
3. If the model is over fit we prune the tree by defining the depth of the tree
4. Check the train and test accuracy scores to see if the model is over fitting.
5. Print the metrics - Recall,ROC/AUC score,Accuracy,F1 Score,Precision.
6. The data is highly imbalanced ,hence we use class weight balancing to achieve optimal results .
7. Print the metrics - Recall,ROC/AUC score,Accuracy,F1 Score,Precision. Print the confusion matrix.Print the ROC/AUC curve.
8. Store the model metrics in a data frame

In [None]:
X=pd.get_dummies(data2.drop('Target',axis=1),drop_first=True)
Y=data2['Target']

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=1)
x_train.shape,x_test.shape

In [None]:
print("{0:0.2f}% data is in training set".format((len(x_train)/len(data2.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data2.index)) * 100))

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model_gini=DecisionTreeClassifier(criterion='gini')

In [None]:
model_gini.fit(x_train,y_train)
y_predict=model_gini.predict(x_test)

In [None]:

### Training accuracy

modelgini_score_train=model_gini.score(x_train,y_train)
print('Training Accuracy is:',modelgini_score_train)

### Testing accuracy

modelgini_score=model_gini.score(x_test,y_test)
print('Test Accuracy is :',modelgini_score)

Since decision trees are prone to overfitting and clearly we see the training accuracy score much higher than test ,our next step would be to prune the  decision tree.

In [None]:
clf_pruned=DecisionTreeClassifier(criterion='gini',max_depth=6)
clf_pruned.fit(x_train,y_train)
y_predict=clf_pruned.predict(x_test)

In [None]:

### Training accuracy

clf_pruned_score_train=clf_pruned.score(x_train,y_train)
print('Training Accuracy is:',clf_pruned_score_train)

### Testing accuracy

clf_pruned_score_test=clf_pruned.score(x_test,y_test)
print('Test Accuracy is :',clf_pruned_score_test)

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

1. Max depth 6 seems to give us good accuracy for both train and test without overfitting the test data.

2. Balancing the model to see how the metrics change after class weight balancing 

In [None]:
clf_pruned_bal=DecisionTreeClassifier(criterion='gini',max_depth=6,class_weight='balanced')
clf_pruned_bal.fit(x_train,y_train)
y_predict=clf_pruned_bal.predict(x_test)

In [None]:

### Training accuracy

clf_pruned_bal_score_train=clf_pruned_bal.score(x_train,y_train)
print('Training Accuracy is:',clf_pruned_bal_score_train)

### Testing accuracy

clf_pruned_bal_score_test=clf_pruned_bal.score(x_test,y_test)
print('Test Accuracy is :',clf_pruned_bal_score_test)


In [None]:
## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')

In [None]:
clf_pruned_bal_score_test=clf_pruned_bal.score(x_test,y_test)
clf_pruned_bal_recall=metrics.recall_score(y_test,y_predict)
clf_pruned_bal_f1score=metrics.f1_score(y_test,y_predict)
clf_pruned_bal_ROCAUC=metrics.roc_auc_score(y_test,y_predict)
clf_pruned_bal_precision=metrics.precision_score(y_test,y_predict)

print(clf_pruned_bal_score_test)
print(clf_pruned_bal_recall)
print(clf_pruned_bal_f1score)
print(clf_pruned_bal_ROCAUC)
print(clf_pruned_bal_precision)

##### Confusion Matrix

In [None]:
confusionmatrix=metrics.confusion_matrix(y_test,y_predict)
print('The Confusion Matrix is displayed below :')
print('')

print(confusionmatrix)
print('')


## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show()

In [None]:
dtree_roc_auc = roc_auc_score(y_test, clf_pruned_bal.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, clf_pruned_bal.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Tree (area = %0.2f)' % dtree_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

In [None]:
# Visualize model performance with yellowbrick library

from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ROCAUC
viz = ClassificationReport(DecisionTreeClassifier(criterion='gini',max_depth=6,class_weight='balanced'))
viz.fit(x_train, y_train)
viz.score(x_test, y_test)
viz.show()



#### Visualizing the decision tree

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
import graphviz

In [None]:
dot_data = StringIO()
export_graphviz(clf_pruned_bal, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('bankdata_pruned.png')
Image(graph.create_png())

Storing the Model output in a dataframe

In [None]:
results_DF=pd.DataFrame({'Model':['Decision Tree'],'Accuracy':clf_pruned_bal_score_test,'Recall':clf_pruned_bal_recall,'Precision':clf_pruned_bal_precision,'F1 Score':clf_pruned_bal_f1score,'ROC_AUC Score':clf_pruned_bal_ROCAUC})

In [None]:
results_DF

`Inferences`

We see form the above that the precision and Accuracy have dropped ,while the Recall,F1 score and the ROC AUC score have significantly improved after the class weight the balancing.

###### Random Forest Model

`Random Forest Model - Modeling Steps`

1. Build the Random forest model with default parameters
2. Check if the model has over fitted and needs to be tuned to achieve optimal results 
3. Adjust the hyper parameters ,in this case max depth to see if this changes the results.
4. Print the metrics -Recall,ROC/AUC score,Accuracy,F1 Score,Precision.
5. Since the data is highly imbalanced ,try class weight balancing on the model to see if this improves the metrics of the model 
6. If it improves the accuracy, print the metrics again.
7. Adjust other hyper parameters to see if there’s improvement in the model results.If yes, print the model results .
8. Store the model results into the data frame by concatenating it.

In [None]:
X=pd.get_dummies(data2.drop('Target',axis=1),drop_first=True)
Y=data2['Target']

x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=1)
x_train.shape,x_test.shape

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfcl=RandomForestClassifier()

## Fit the model on the training set 
rfcl_model=rfcl.fit(x_train, y_train)

## Predicting on test set
y_predict=rfcl.predict(x_test)

In [None]:
rfcl.score(x_train,y_train)
print(rfcl.score(x_train,y_train))

In [None]:
rfcl.score(x_test,y_test)
print(rfcl.score(x_test,y_test))

Seems like the test data is overfitting and requires some tuning

In [None]:
rfcl.get_params()

In [None]:
rfcl_dep=RandomForestClassifier(max_depth=10)

## Fit the model on the training set 
rfcl_model_1=rfcl_dep.fit(x_train, y_train)

## Predicting on test set
y_predict=rfcl_dep.predict(x_test)

In [None]:
rfcl_dep.score(x_train,y_train)
print(rfcl_dep.score(x_train,y_train))

In [None]:
rfcl_dep.score(x_test,y_test)
print(rfcl_dep.score(x_test,y_test))

In [None]:
## Accuracy

print('Training Accuracy is :',rfcl_dep.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',rfcl_dep.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,y_predict))
print('')



In [None]:
rfcl_bal=RandomForestClassifier(max_depth=10,class_weight='balanced')

## Fit the model on the training set 
rfcl_model_2=rfcl_bal.fit(x_train, y_train)

## Predicting on test set
y_predict=rfcl_bal.predict(x_test)

In [None]:
## Accuracy

print('Training Accuracy is :',rfcl_bal.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',rfcl_bal.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,y_predict))
print('')



In [None]:
rfcl_est=RandomForestClassifier(max_depth=10,class_weight='balanced',n_estimators=1000,min_samples_leaf=6)

## Fit the model on the training set 
rfcl_model_3=rfcl_est.fit(x_train, y_train)

## Predicting on test set
y_predict=rfcl_est.predict(x_test)

In [None]:
## Accuracy

print('Training Accuracy is :',rfcl_est.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',rfcl_est.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,y_predict))
print('')



In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,y_predict)
print('The confusion matrix is printed below')
print('')

print(confusion_matrix)

## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show() 

In [None]:
### Tuning using leaf depth to see if it improves the model performance

In [None]:
train_score=[]
test_score=[]
recall=[]
f1score=[]
precision=[]
min_samples_leaf=[1,2,3,4,5,6]

for i in min_samples_leaf:
    model=RandomForestClassifier(max_depth=10,class_weight='balanced',n_estimators=1000,min_samples_leaf=i)
    model.fit(x_train,y_train)
    y_predict=model.predict(x_test)
    train_score.append(round(model.score(x_train,y_train),2))
    test_score.append(round(model.score(x_test,y_test),2))
    recall.append(round(metrics.recall_score(y_test,y_predict),2))
    f1score.append(round(metrics.f1_score(y_test,y_predict),2))
    precision.append(round(metrics.precision_score(y_test,y_predict),2))
    
print(min_samples_leaf)
print()
print(train_score)
print('Test Accuracy')
print(test_score)
print('Recall')
print(recall)
print('f1score')
print(f1score)

We select min_samples_leaf=6 as the recall is much better than the remainder of the models

In [None]:
rfcl_acc=rfcl_est.score(x_test,y_test)
rfcl_recall=metrics.recall_score(y_test,y_predict)
rfcl_f1score=metrics.f1_score(y_test,y_predict)
rfcl_ROCAUC=metrics.roc_auc_score(y_test,y_predict)
rfcl_precision=metrics.precision_score(y_test,y_predict)

print(rfcl_acc)
print(rfcl_recall)
print(rfcl_f1score)
print(rfcl_ROCAUC)
print(rfcl_precision)

In [None]:
tempResultsDf = pd.DataFrame({'Model':['Random Forest'],'Accuracy':rfcl_acc,'Recall':rfcl_recall,'Precision':rfcl_precision,'F1 Score':rfcl_f1score,'ROC_AUC Score':rfcl_ROCAUC})
results_DF=pd.concat([results_DF,tempResultsDf])
results_DF


###### Adaboost Ensemble algorithm

`AdaBoost Ensemble algorithm - Modeling Steps` 

1. Build the Ada boost classifier algorithm with pre set parameters
2. Check if the model overfits .
3. If it doesn’t print the metrics 
4. Visualise  and print the confusion matrix
5. Add the Ada boost model results to the data frame by concatenating to the data frame.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abcl=AdaBoostClassifier(n_estimators=100,random_state=22)
abcl.fit(x_train,y_train)

In [None]:
ab_predict=abcl.predict(x_test)

In [None]:
## Accuracy

print('Training Accuracy is :',abcl.score(x_train,y_train))
print('')
     
print('Testing Accuracy is:',abcl.score(x_test,y_test))
print('')

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,ab_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,ab_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,ab_predict))
print('')

##ROC Auc score 
print('The ROC AUC score is :',metrics.roc_auc_score(y_test,ab_predict))
print('')



In [None]:
abcl_acc=abcl.score(x_test,y_test)
abcl_recall=metrics.recall_score(y_test,ab_predict)
abcl_f1score=metrics.f1_score(y_test,ab_predict)
abcl_ROCAUC=metrics.roc_auc_score(y_test,ab_predict)
abcl_precision=metrics.precision_score(y_test,ab_predict)

print(abcl_acc)
print(abcl_recall)
print(abcl_f1score)
print(abcl_ROCAUC)
print(abcl_precision)

In [None]:
# Visualize model performance with yellowbrick library
viz = ClassificationReport(AdaBoostClassifier(n_estimators=100,random_state=22))
viz.fit(x_train, y_train)
viz.score(x_test, y_test)
viz.show()


In [None]:
# Confusion Matrix

In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,ab_predict)
print('The confusion matrix is printed below')
print('')

print(confusion_matrix)

## Confusion Matrix

cm=metrics.confusion_matrix(y_test,ab_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show() 

In [None]:
print(metrics.classification_report(y_test,ab_predict))

In [None]:
roc_auc_score(y_test,ab_predict)

In [None]:
tempResultsDf = pd.DataFrame({'Model':['AdaBoost Classifier'],'Accuracy':abcl_acc,'Recall':abcl_recall,'Precision':abcl_precision,'F1 Score':abcl_f1score,'ROC_AUC Score':abcl_ROCAUC})

results_DF=pd.concat([results_DF,tempResultsDf])

results_DF

##### Bagging Classifier 

`Bagging Classifier Algorithm - Modeling Steps` 

1. Build the Bagging classifier algorithm with pre set parameters
2. Check if the model overfits .
3. If it doesn’t print the metrics 
4. Visualise  and print the confusion matrix
5. Add the Bagging Classifier Algorithm results to the data frame by concatenating to the data frame.

In [None]:
from sklearn.ensemble import BaggingClassifier
bgcl=BaggingClassifier(n_estimators=100, max_samples= .7, bootstrap=True, oob_score=True, random_state=22)
bgcl=bgcl.fit(x_train,y_train)
y_predict=bgcl.predict(x_test)

In [None]:

### Training accuracy

bgcl_score_train=bgcl.score(x_train,y_train)
print('Training Accuracy is:',bgcl_score_train)

### Testing accuracy

bgcl_score_test=bgcl.score(x_test,y_test)
print('Test Accuracy is :',bgcl_score_test)

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')


In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,y_predict)
print('The confusion matrix is printed below')
print('')

print(confusion_matrix)

## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show() 

In [None]:
bgcl_score=bgcl.score(x_test,y_test)
bgcl_recall=metrics.recall_score(y_test,y_predict)
bgcl_f1score=metrics.f1_score(y_test,y_predict)
bgcl_ROCAUC=metrics.roc_auc_score(y_test,y_predict)
bgcl_precision=metrics.precision_score(y_test,y_predict)

print(bgcl_score)
print(bgcl_recall)
print(bgcl_f1score)
print(bgcl_ROCAUC)
print(bgcl_precision)

In [None]:
temporary_DF=pd.DataFrame({'Model':['Bagging Classifier'],'Accuracy':bgcl_score,'Recall':bgcl_recall,'Precision':bgcl_precision,'F1 Score':bgcl_f1score,'ROC_AUC Score':bgcl_ROCAUC})

results_DF=pd.concat([results_DF,temporary_DF])

results_DF

In [None]:
# Visualize model performance with yellowbrick library
viz = ClassificationReport(BaggingClassifier(n_estimators=100, max_samples= .7, bootstrap=True, oob_score=True, random_state=22))
viz.fit(x_train, y_train)
viz.score(x_test, y_test)
viz.show()

#### Gradient Boost Algorithm

`Gradient boost classifier - Modeling steps`

1. Build the Gradient boost classifier algorithm with pre set parameters
2. Check if the model overfits .
3. If it doesn’t print the metrics 
4. Visualise  and print the confusion matrix
5. Add the Bagging Classifier Algorithm results to the data frame by concatenating to the data frame.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl=GradientBoostingClassifier(n_estimators = 50, learning_rate = .1, random_state=22)
gbcl=gbcl.fit(x_train,y_train)
y_predict=gbcl.predict(x_test)

In [None]:

### Training accuracy

gbcl_score_train=gbcl.score(x_train,y_train)
print('Training Accuracy is:',bgcl_score_train)

### Testing accuracy

gbcl_score_test=gbcl.score(x_test,y_test)
print('Test Accuracy is :',gbcl_score_test)

## Precision
print('Precision of the model is :',metrics.precision_score(y_test,y_predict))
print('')

##Recall 
print('Recall of the model is:',metrics.recall_score(y_test,y_predict))
print('')

##F1 score 
print('The F1 score is:',metrics.f1_score(y_test,y_predict))
print('')

##ROC Auc score 
print('The ROC auc score is :',metrics.roc_auc_score(y_test,y_predict))
print('')


In [None]:
confusion_matrix=metrics.confusion_matrix(y_test,y_predict)
print('The confusion matrix is printed below')
print('')

print(confusion_matrix)

## Confusion Matrix

cm=metrics.confusion_matrix(y_test,y_predict)
sns.heatmap(cm,annot=True,  fmt='.2f', xticklabels = [0,1] , yticklabels = [0,1])
plt.ylabel('Observed')
plt.xlabel('Predicted')
plt.show() 

The model overfits and needs futher hyperparameter tuning .Tuning min_samples_leaf ,min_samples_split,max_depth has not changed the accuracy.There is also no option to class balance in this classifier which may contribute to the difference in the model performance .

Hence we will go ahead and append this model result.

In [None]:
gbcl_score=gbcl.score(x_test,y_test)
gbcl_recall=metrics.recall_score(y_test,y_predict)
gbcl_f1score=metrics.f1_score(y_test,y_predict)
gbcl_ROCAUC=metrics.roc_auc_score(y_test,y_predict)
gbcl_precision=metrics.precision_score(y_test,y_predict)

print(gbcl_score)
print(gbcl_recall)
print(gbcl_f1score)
print(gbcl_ROCAUC)
print(gbcl_precision)

In [None]:
temporary_DF=pd.DataFrame({'Model':['Gradient Boost Algorithm'],'Accuracy':gbcl_score,'Recall':gbcl_recall,'Precision':gbcl_precision,'F1 Score':gbcl_f1score,'ROC_AUC Score':gbcl_ROCAUC})

results_DF=pd.concat([results_DF,temporary_DF])

results_DF

In [None]:
results_DF

#### Appending the Logistic regression model results too

In [None]:
temporary_DF=pd.DataFrame({'Model':['Logistic Regression'],'Accuracy':logreg_acc,'Recall':logreg_recall,'Precision':logreg_precision,'F1 Score':logreg_f1score,'ROC_AUC Score':logreg_ROCAUC})

In [None]:
results_DF=pd.concat([results_DF,temporary_DF])

results_DF

In [None]:
results_Final=results_DF
results_Final


`Business Insights`

True Negative (observed=0,predicted=0)

Predicted that the customer would not subscribe and they actually do not.

False Positive (observed=0,predicted=1)

Predicted that the customer would subscribe while the customer did not.

True Negative (observed=0,predicted=0)

Predicted that the customer would not subscribe and the customer did not.

False Negative(observed=1,predicted=0)

Predicted the customer would not subscribe when the customer did.

`From the points above we know that we should be focusing on`

Low False Negatives as False negatives would mean missed opportunity for the bank in predicting clients who could convert as fixed term subscribers.

High score of True positives which would mean the prediction on the fixed term subscribers would be accurate.

False Positives in this case could be relatively harmless as this would not mean that the bank would lose money.


`Conclusions` 

1. Random Forest gives the best balance of Recall,ROC AUC curve with decent values for Accuracy,F1 score & Precision.We see from each of the Confusion Matrix of the models that the False negatives are the least for Random Forest and Descision Tree classifiers

2. Although Decision tree gives us a higher recall value ,we also take into account the other parameters to contribute to the overalls.

3. It is important to note that the accuracy and precision have gone down a little bit as compared to the other models but these have gone down at the stake of Recall.Also we see a reassuring high ROC /AUC score for Random Forest as compared to all other models.


4. Our focus area is low False Negatives and high True positives .Therefore we could conclude that given the metrics we defined ,that RandomForest Classifier is the best model.


`N.B`: As per https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.


***Dropping 'Duration' from the dataframe before modeling as per the text in the link did not improve any of the modeling parameters and on the contrary had got the recall and accuracy scores down drastically,since there's not mcuh clarity in the problem statement document ,this problem was parked to analyse the complete dataset witout dropping duration(and only transforming it).***