## Agenda
   
    ♦ Problem Description
    ♦ Data Understanding and exploration
    ♦ Split the data into Train and Validation sets
    ♦ Model Building - Logistic Regression

## Problem Description

A Regional Bank XYZ with 40000+ Customers would like to expand its business by predicting Customer's behavior to better sell cross products (eg: Selling Term Deposits to Retail Customers). The Bank has approached us to assess the same by providing access to their Customer campaign data. 

The data is related with direct marketing campaigns. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. 

Predict if an existing customer would subscribe to a Term Deposit

#### Attribute information:



Input variables:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                   "blue-collar","self-employed","retired","technician","services") 

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric) 

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no") 

##### Related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 

10 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

11 - duration: last contact duration, in seconds (numeric)

##### Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

##### Output variable (desired target):

16 - y - has the client subscribed a term deposit? (binary: "yes","no")

 

In [1]:
import pandas as pd
import numpy as np

### Loading the data

In [2]:
df=pd.read_csv("Bank_Data.csv")

### Understanding the data

In [3]:
df.shape

(4521, 16)

In [4]:
df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

In [5]:
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,may,226,1,-1,0,unknown,no


In [6]:
df.tail()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,month,duration,campaign,pdays,previous,poutcome,y
4516,33,services,married,secondary,no,-333,yes,no,cellular,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,feb,129,4,211,3,other,no
4520,44,entrepreneur,single,tertiary,no,1136,yes,yes,cellular,apr,345,2,249,7,other,no


### Summary statistics

In [7]:
df.describe()

Unnamed: 0,age,balance,duration,campaign,pdays,previous
count,4521.0,4521.0,4521.0,4521.0,4521.0,4521.0
mean,41.170095,1422.657819,263.961292,2.79363,39.766645,0.542579
std,10.576211,3009.638142,259.856633,3.109807,100.121124,1.693562
min,19.0,-3313.0,4.0,1.0,-1.0,0.0
25%,33.0,69.0,104.0,1.0,-1.0,0.0
50%,39.0,444.0,185.0,2.0,-1.0,0.0
75%,49.0,1480.0,329.0,3.0,-1.0,0.0
max,87.0,71188.0,3025.0,50.0,871.0,25.0


In [9]:
df.describe(include='O').T

Unnamed: 0,count,unique,top,freq
job,4521,12,management,969
marital,4521,3,married,2797
education,4521,4,secondary,2306
default,4521,2,no,4445
housing,4521,2,yes,2559
loan,4521,2,no,3830
contact,4521,3,cellular,2896
month,4521,12,may,1398
poutcome,4521,4,unknown,3705
y,4521,2,no,4000


In [10]:
df.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'month', 'duration', 'campaign', 'pdays', 'previous',
       'poutcome', 'y'],
      dtype='object')

In [13]:
df.y.value_counts(normalize=True)

no     0.88476
yes    0.11524
Name: y, dtype: float64

In [14]:
df.y.value_counts(normalize=True)*100 # seeing value counts in percentage

no     88.476001
yes    11.523999
Name: y, dtype: float64

### Recode the levels of target on  data ; yes=1 and no=0


In [15]:
# df['y'] = df['y'].apply(lambda x: 0 if x.strip()=='no' else 1)

df['y']=np.where(df['y']== 'no',0,1)

In [16]:
cat_attr=df.select_dtypes(include ='object').columns 

df[cat_attr]= df[cat_attr].astype('category')

In [19]:
data = pd.get_dummies(columns=cat_attr, data = df, prefix=cat_attr, prefix_sep="_", drop_first=True)
data.head()

Unnamed: 0,age,balance,duration,campaign,pdays,previous,y,job_blue-collar,job_entrepreneur,job_housemaid,...,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,79,1,-1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,33,4789,220,1,339,4,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,35,1350,185,1,330,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,30,1476,199,4,-1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
4,59,0,226,1,-1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1


### Splitting the data into train and Validation sets

In [23]:
pd.set_option('display.max_columns', 50)

In [24]:
data

Unnamed: 0,age,balance,duration,campaign,pdays,previous,y,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,job_unemployed,job_unknown,marital_married,marital_single,education_secondary,education_tertiary,education_unknown,default_yes,housing_yes,loan_yes,contact_telephone,contact_unknown,month_aug,month_dec,month_feb,month_jan,month_jul,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_other,poutcome_success,poutcome_unknown
0,30,1787,79,1,-1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
1,33,4789,220,1,339,4,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,35,1350,185,1,330,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,30,1476,199,4,-1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1
4,59,0,226,1,-1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,-333,329,5,-1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
4517,57,-3313,153,1,-1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1
4518,57,295,151,11,-1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4519,28,1137,129,4,211,3,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0


In [26]:
X = data.drop(['y'],axis=1)
y = data['y']

In [29]:
# train test split 
from sklearn.model_selection import train_test_split

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify= y, test_size = 0.3, random_state=124)

print('Train shapes ')
print(X_train.shape)
print(X_test.shape)
print('-------------')
print('Test shapes')
print(y_train.shape)
print(y_test.shape)

Train shapes 
(3164, 41)
(1357, 41)
-------------
Test shapes
(3164,)
(1357,)


### Standardizing the numeric attributes in the train and test data

In [34]:
from sklearn.preprocessing import MinMaxScaler

In [36]:
#fit_transform in train data only transform in test data to avoid data leakage 
scaler = MinMaxScaler()
X_train[['age','balance','duration','pdays','previous','campaign']] = scaler.fit_transform(X_train[['age','balance','duration','pdays','previous','campaign']])
X_test[['age','balance','duration','pdays','previous','campaign']]=scaler.transform(X_test[['age','balance','duration','pdays','previous','campaign']])

In [38]:
# R api Logistic Regresion 
import statsmodels.api as sm


In [40]:
# adding constant to both train and test data
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

### Model Building

In [41]:
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.239023
         Iterations 8
                          Results: Logit
Model:               Logit            Pseudo R-squared: 0.332      
Dependent Variable:  y                AIC:              1596.5377  
Date:                2023-03-03 11:24 BIC:              1851.0406  
No. Observations:    3164             Log-Likelihood:   -756.27    
Df Model:            41               LL-Null:          -1131.4    
Df Residuals:        3122             LLR p-value:      3.7969e-131
Converged:           1.0000           Scale:            1.0000     
No. Iterations:      8.0000                                        
-------------------------------------------------------------------
                     Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-------------------------------------------------------------------
const               -1.9699   0.5973 -3.2979 0.0010 -3.1406 -0.7992
age                 -0.0470   0.5894 -0.0

### Logistic Regression

In [42]:
# using sklearn package 
from sklearn.linear_model import LogisticRegression

In [43]:
logistic_model = LogisticRegression()

logistic_model.fit(X_train,y_train)

LogisticRegression()

### Generating predictions

In [44]:
train_preds = logistic_model.predict(X_train)
train_preds_prob=logistic_model.predict_proba(X_train)[:,1]  
test_preds = logistic_model.predict(X_test)
test_preds_prob=logistic_model.predict_proba(X_test)[:,1]

In [45]:
train_preds

array([0, 0, 0, ..., 0, 0, 0])

In [46]:
logistic_model.coef_

array([[-1.04334838e-03,  5.94262264e-02, -3.74925626e-01,
         8.63894571e+00, -6.99424286e-01, -4.83425369e-02,
         1.02710834e-01, -4.72220479e-01,  7.12089724e-03,
        -1.66935009e-01, -2.22062094e-01,  4.34210400e-01,
        -3.44240795e-01, -3.81876514e-01,  2.80553116e-01,
        -2.69694947e-01, -4.43655707e-01,  2.65157319e-01,
        -5.42116811e-01, -2.09487154e-01, -5.30025036e-02,
         1.47315579e-01, -2.57403869e-01,  9.86795175e-02,
        -2.56431816e-01, -7.80399287e-01, -8.59938795e-03,
        -1.12647836e+00, -2.74468057e-01,  5.91655060e-01,
         1.31640046e-02, -5.87621811e-01, -4.52945124e-01,
         4.19939405e-01,  1.12567177e+00, -4.38195428e-01,
        -7.73490677e-01,  1.06695716e+00,  7.20975508e-01,
         1.71229347e-01,  2.08253799e+00, -3.12909033e-01]])

In [48]:
# checking probabilty for train predictions
train_preds_prob

array([0.05759544, 0.02531042, 0.06054418, ..., 0.01790884, 0.01110849,
       0.06057148])

### Error metrics 

In [52]:
from sklearn.metrics import accuracy_score,classification_report

In [54]:
train_accuracy_1= accuracy_score(y_train,train_preds)
print(f'train accuray :{train_accuracy_1}')
test_accuracy_1= accuracy_score(y_test,test_preds)
print(f'test accuray :{test_accuracy_1}')

train accuray :0.9020227560050569
test accuray :0.8975681650700074


In [55]:
#Classification report for train report 
print(classification_report(y_train,train_preds))

              precision    recall  f1-score   support

           0       0.91      0.98      0.95      2799
           1       0.70      0.27      0.38       365

    accuracy                           0.90      3164
   macro avg       0.80      0.63      0.67      3164
weighted avg       0.89      0.90      0.88      3164



In [56]:
# classification report for test data 
print(classification_report(y_test,test_preds))

              precision    recall  f1-score   support

           0       0.91      0.98      0.94      1201
           1       0.64      0.24      0.35       156

    accuracy                           0.90      1357
   macro avg       0.78      0.61      0.65      1357
weighted avg       0.88      0.90      0.88      1357



## END  

- "Acuuracy Paradox = for imbalnce data dont chooose accuray as evaluation metrics "
# Pros
* Logistic regression is easier to implement, interpret, and very efficient to train.
* It makes no assumptions about distributions of classes in feature space.
* It not only provides a measure of how appropriate a predictor(coefficient size)is, but also its direction of association (positive or negative).
* It is very fast at classifying unknown records.
* It can interpret model coefficients as indicators of feature importance.

# cons
* It can only be used to predict discrete functions. Hence, the dependent variable of Logistic Regression is bound to the discrete number set.
* Logistic Regression requires average or no multicollinearity between independent variables.
* If the number of observations is lesser than the number of features, Logistic Regression should not be used, otherwise, it may lead to overfitting