

# Logistic Regression - Project 2


##  Bank Marketing


**Abstract:** 
- The data is related with __direct marketing campaigns__ (phone calls) of a Portuguese banking institution.
- The classification goal is to predict if the client will subscribe a __term deposit (variable y)__.
 

### Importing the dataset


In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import profile_report

In [78]:

bank = pd.read_csv('https://raw.githubusercontent.com/KommuriRaju/Machine-Learning-Projects/main/Logistic%20Regression/Bank%20data%20-%20Term%20Deposit.csv')


### Checking the columns present in the dataset

In [79]:
bank.columns

Index(['Unnamed: 0', 'age', 'job', 'marital', 'education', 'default',
       'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration',
       'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

### Checking the shape of Dataset

In [80]:
bank.shape 

(10297, 22)

### Checking the descriptive statistics of the dataset

In [81]:
bank.describe() 

Unnamed: 0.1,Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0,10297.0
mean,5148.0,40.080606,261.388268,2.561134,963.609692,0.171506,0.077256,93.574206,-40.44578,3.621923,5166.850442
std,2972.632195,10.47219,263.722874,2.827084,184.098592,0.496992,1.576732,0.577502,4.622221,1.73417,72.492914
min,0.0,18.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6
25%,2574.0,32.0,103.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1
50%,5148.0,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0
75%,7722.0,47.0,327.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1
max,10296.0,94.0,3643.0,56.0,999.0,6.0,1.4,94.767,-26.9,5.045,5228.1


### Checking the info 

In [82]:
bank.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10297 entries, 0 to 10296
Data columns (total 22 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      10297 non-null  int64  
 1   age             10297 non-null  int64  
 2   job             10297 non-null  object 
 3   marital         10297 non-null  object 
 4   education       10297 non-null  object 
 5   default         10297 non-null  object 
 6   housing         10297 non-null  object 
 7   loan            10297 non-null  object 
 8   contact         10297 non-null  object 
 9   month           10297 non-null  object 
 10  day_of_week     10297 non-null  object 
 11  duration        10297 non-null  int64  
 12  campaign        10297 non-null  int64  
 13  pdays           10297 non-null  int64  
 14  previous        10297 non-null  int64  
 15  poutcome        10297 non-null  object 
 16  emp.var.rate    10297 non-null  float64
 17  cons.price.idx  10297 non-null 

### Let's see the unique values of a couple of columns

In [83]:
def log():
    print('Job have these unique values:',bank['job'].unique())
    print('Marital have these unique values:',bank['marital'].unique())
log()

Job have these unique values: ['blue-collar' 'admin.' 'management' 'technician' 'retired' 'services'
 'entrepreneur' 'self-employed' 'unemployed' 'student' 'housemaid'
 'unknown']
Marital have these unique values: ['married' 'single' 'divorced' 'unknown']


### Checking the Min and Max age.

In [84]:
def log():
    print("The maximum age is: ", bank.age.max()) # print the max age
    print("The minimum age is: ", bank.age.min()) # print the min age
log()

The maximum age is:  94
The minimum age is:  18


### We can see with the bank.info(), there were no missing values in any of the columns.

### Count of Yes and No for the term deposit.

In [85]:
bank.y.value_counts() 

no     9137
yes    1160
Name: y, dtype: int64

### Let's write a user defined function to calculate the Inter quartile range for quantile values outside 25 to 75 range. And do the outlier capping for lower level with min value and for upper level with 'q3=1.5*iqr' value.


In [86]:
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25) # quantile 1 using quantile(0.25)
    q3 = df_in[col_name].quantile(0.75) # quantile 3
    iqr = q3-q1 #  IQR as difference of Quantile 3 and quantile 1
    lower_bound = df_in[col_name].min() #  the lower bound using the min() function
    upper_bound = q3+1.5*iqr # the upper bound as quantile3 + 1.5*IQR
    print(lower_bound) # Printing the lower and upper bound of the column
    print(upper_bound)
    df_out = df_in.loc[(df_in[col_name] > lower_bound) & (df_in[col_name] < upper_bound)] # Removing the values lying outside min and upper bound range
    return df_out

### Using the above created function , removing the outlier from 'age' variables:

In [87]:
remove_outlier(bank, 'age')

18
69.5


Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,0,46,blue-collar,married,basic.9y,no,no,yes,telephone,may,tue,329,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,1,29,admin.,single,university.degree,no,no,no,cellular,may,wed,132,2,999,0,nonexistent,-1.8,92.893,-46.2,1.281,5099.1,no
2,2,50,management,married,university.degree,no,yes,yes,cellular,apr,fri,206,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,no
3,3,31,admin.,married,high.school,unknown,yes,no,telephone,may,thu,199,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,no
4,4,32,admin.,single,high.school,no,yes,no,cellular,jun,tue,350,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10292,10292,38,admin.,married,university.degree,no,no,no,cellular,nov,thu,126,1,7,4,failure,-3.4,92.649,-30.1,0.714,5017.5,yes
10293,10293,34,self-employed,married,university.degree,no,no,no,cellular,oct,wed,201,1,5,3,failure,-3.4,92.431,-26.9,0.740,5017.5,yes
10294,10294,52,blue-collar,married,professional.course,unknown,yes,no,cellular,nov,wed,442,2,999,1,failure,-0.1,93.200,-42.0,4.120,5195.8,no
10295,10295,35,technician,married,university.degree,no,yes,no,telephone,may,thu,330,3,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no


### Using the above created function , removing the outlier from 'campaign' variables:

In [88]:
remove_outlier(bank, 'campaign')# to remove the outlier from campaign

1
6.0


Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,0,46,blue-collar,married,basic.9y,no,no,yes,telephone,may,tue,329,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,1,29,admin.,single,university.degree,no,no,no,cellular,may,wed,132,2,999,0,nonexistent,-1.8,92.893,-46.2,1.281,5099.1,no
3,3,31,admin.,married,high.school,unknown,yes,no,telephone,may,thu,199,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,no
5,5,33,admin.,single,university.degree,no,yes,yes,cellular,aug,mon,174,4,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,no
8,8,34,blue-collar,married,professional.course,unknown,no,no,cellular,may,wed,65,2,999,1,failure,-1.8,92.893,-46.2,1.281,5099.1,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10289,10289,36,technician,married,professional.course,no,yes,yes,cellular,apr,thu,266,2,2,2,success,-1.8,93.075,-47.1,1.365,5099.1,yes
10290,10290,43,unemployed,married,university.degree,unknown,unknown,unknown,telephone,may,tue,87,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
10291,10291,58,technician,married,high.school,no,no,no,telephone,jun,thu,246,3,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1,no
10294,10294,52,blue-collar,married,professional.course,unknown,yes,no,cellular,nov,wed,442,2,999,1,failure,-0.1,93.200,-42.0,4.120,5195.8,no


### Using the above created function , removing the outlier from 'duration' variables:

In [89]:
remove_outlier(bank, 'duration')# to remove the outlier from duration

0
663.0


Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,0,46,blue-collar,married,basic.9y,no,no,yes,telephone,may,tue,329,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,1,29,admin.,single,university.degree,no,no,no,cellular,may,wed,132,2,999,0,nonexistent,-1.8,92.893,-46.2,1.281,5099.1,no
2,2,50,management,married,university.degree,no,yes,yes,cellular,apr,fri,206,1,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,no
3,3,31,admin.,married,high.school,unknown,yes,no,telephone,may,thu,199,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,no
4,4,32,admin.,single,high.school,no,yes,no,cellular,jun,tue,350,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10292,10292,38,admin.,married,university.degree,no,no,no,cellular,nov,thu,126,1,7,4,failure,-3.4,92.649,-30.1,0.714,5017.5,yes
10293,10293,34,self-employed,married,university.degree,no,no,no,cellular,oct,wed,201,1,5,3,failure,-3.4,92.431,-26.9,0.740,5017.5,yes
10294,10294,52,blue-collar,married,professional.course,unknown,yes,no,cellular,nov,wed,442,2,999,1,failure,-0.1,93.200,-42.0,4.120,5195.8,no
10295,10295,35,technician,married,university.degree,no,yes,no,telephone,may,thu,330,3,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no


### Dividing dataset into two, on the basis of categorical and numerical.

In [90]:
bank_cat=bank[['job', 'marital','default', 'education', 'loan', 'housing', 'contact', 'month', 'day_of_week', 'poutcome', 'y']]

In [91]:
bank_cont = bank.drop(['job', 'marital','default', 'education', 'loan', 'housing', 'contact', 'month', 'day_of_week', 'poutcome', 'y'], axis=1)

### Label encode the categorical variable to numerical values.

- As we have more categorical features we are not using the one hot encoding, because it may take large memory in this case.


In [92]:
from sklearn.preprocessing import LabelEncoder
def log(bank_cat):
    return bank_cat.apply(LabelEncoder().fit_transform) # code to return the value applying fit_transorm
bank_cat = log(bank_cat)

In [93]:
bank_cat.head(1)

Unnamed: 0,job,marital,default,education,loan,housing,contact,month,day_of_week,poutcome,y
0,1,1,0,2,2,0,1,6,3,1,0


### Combining the numerical and categorical dataset.

In [94]:
bank_final= pd.concat([bank_cont, bank_cat], axis = 1)

### Extracting independent column to prepare X 


In [95]:
X = pd.DataFrame()
def log():
    X = bank_final.drop('y', axis=1) # to create a dataframe of dependent variables excluding 'y' variable
    return X
X = log()

In [96]:
X.head(4)

Unnamed: 0.1,Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,job,marital,default,education,loan,housing,contact,month,day_of_week,poutcome
0,0,46,329,3,999,0,1.1,93.994,-36.4,4.857,5191.0,1,1,0,2,2,0,1,6,3,1
1,1,29,132,2,999,0,-1.8,92.893,-46.2,1.281,5099.1,0,2,0,6,0,0,0,6,4,1
2,2,50,206,1,999,0,-1.8,93.075,-47.1,1.405,5099.1,4,1,0,6,2,2,0,0,0,1
3,3,31,199,2,999,0,1.1,93.994,-36.4,4.86,5191.0,0,1,1,3,0,2,1,6,2,1


### Extracting dependent variable into a dataframe 'y' for model predcition

In [97]:
y = pd.DataFrame()
def log():
    y = bank_final['y'] # to create a dataframe which consists only of dependepent variable
    return y
y = log()

In [98]:
y.head()

0    0
1    0
2    0
3    0
4    1
Name: y, dtype: int32

In [99]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### Splitting X and y intro train and test dataset

In [100]:
def log():
    return train_test_split(X, y, test_size=0.25, random_state=1)
X_train, X_test, y_train, y_test = log()

### Checking the shape of X an y of train dataset.

In [101]:
def log():
    print(X_train.shape)
    print(y_train.shape)
log()

(7722, 21)
(7722,)


### Checking the shape of X and y of test dataset.

In [102]:
def log():
    print(X_test.shape)
    print(y_test.shape)
log()

(2575, 21)
(2575,)


### Instantitate Logistic Regression model using scikit-learn


In [103]:
from sklearn.linear_model import LogisticRegression
def log():
    logreg = LogisticRegression() # initiating the logistic regression model to new variable logreg
    return logreg
logreg = log()

### logistic model on X_train and y_train Fit


In [104]:
def log():
    logreg.fit(X_train,y_train)# X_train and y_train fit
log()    

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Using the model for prediction


In [105]:
y_pred_train = pd.DataFrame()
def log():
    y_pred_train = logreg.predict(X_train)#  model prediction on X_train data using the above created dataframe
    return y_pred_train
y_pred_train = log()

In [106]:
y_pred_test = pd.DataFrame()
def log():
    y_pred_test = logreg.predict(X_test)#  model prediction on X_test data using the above created dataframe
    return y_pred_test
y_pred_test = log()

### Model evaluation using accuracy classification score

In [107]:
from sklearn.metrics import accuracy_score
def log():
    accuracy_score(y_test,y_pred_test)#Calculating and print the accuracy score
log()

In [108]:
accuracy_score(y_test,y_pred_test)

0.9064077669902912

### Model evaluation using Confusion matrix

In [109]:
from sklearn.metrics import confusion_matrix
confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred_test))
def log():
    confusion_matrix.index = ['Actual No_deposit', 'Actual Deposit'] 
    confusion_matrix.columns = ['Predicted No_Deposit','Predicted Deposit'] 
    print(confusion_matrix)
log()

                   Predicted No_Deposit  Predicted Deposit
Actual No_deposit                  2236                 51
Actual Deposit                      190                 98


### Accuracy prediction setting the threshold = 0.75


In [110]:
import numpy as np
def log():
    pred1 = np.where(logreg.predict_proba(X_test)[:,1]> 0.75,1,0)
    print('Accuracy score for test data is:', accuracy_score(y_test,pred1)) 
log()

Accuracy score for test data is: 0.8959223300970873


### Accuracy prediction setting the threshold = 0.25

In [111]:
def log():
    pred2 = np.where(logreg.predict_proba(X_test)[:,1]> 0.25,1,0)
    print('Accuracy score for test data is:', accuracy_score(y_test,pred2))
log()

Accuracy score for test data is: 0.8966990291262136
