# Problem Description :

* A relatively young bank is growing rapidly in terms of overall customer acquisition. Majority of these are Liability customers with varying sizes of relationship with the bank. The customer base of Asset customers is quite small, and the bank wants to grow this base rapidly to bring in more loan business. 

* Specifically, it want to explore ways of converting its liability customers to Personal Loan customers.

* A campaign the bank ran for liability customers last year showed a healthy conversion rate of over 9% success. This has encouraged the Retail Marketing department to devise smarter campaigns with better target marketing.

## Anlaytics Objectives :
	
	
1)	While designing a new campaign, can we model the previous campaign's customer behavior to 
	analyze what combination of parameters make a customer more likely to 
	accept a personal loan?
	
2)	There are several special products / facilities the bank offers like CD and security accounts, 
	online services, credit cards, etc. Can we spot any association among these
	for finding cross-selling opportunities?

## Data Set Description :

* ID:	Customer ID			
* Age:	Customer's age in completed years			
* Experience:	# of years of professional experience			
* Income:	Annual income of the customer in thousands of Dollars			
* ZIPCode:	Home Address ZIP code.			Do not use ZIP code
* Family:	Family size of the customer			
* CCAvg:	Avg. spending on credit cards per month in thousands of Dollars		
* Education:	Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional			
* Mortgage:	Value of house mortgage if any. (thousands of Dollars)			
* **PersonalLoan:	Did this customer accept the personal loan offered in the last campaign?**			
* SecuritiesAccount:	Does the customer have a securities account with the bank?			
* CDAccount:	Does the customer have a certificate of deposit (CD) account with the bank?			
* Online:	Does the customer use internet banking facilities?			
* CreditCard:	Does the customer use a credit card issued by UniversalBank?			

### Note:
* While reading the data set  replace the '?',',' as NAs

### Error Metric ?

Recall or Sensitivity or True Positive Rate.

## Experiment :
* Build a BaggingClassifier to predict whether a person will take a personal loan or not

### Loading required libraries

In [1]:
import os
import numpy as np
import pandas as pd


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.model_selection import train_test_split

### Handle Warnings

In [2]:
import warnings
warnings.filterwarnings('ignore')

###  Read Data & Check the dimensions

In [3]:
bank=pd.read_csv("C:/Users/gsk44/OneDrive/Desktop/Ensemble Bagging/UnionBank.csv",na_values=["?",","])
print("The number of rows in the bank data set =",(bank.shape[0]))
print("The number of columns in the bank data set =",(bank.shape[1]))

The number of rows in the bank data set = 5000
The number of columns in the bank data set = 14


In [4]:
bank.shape

(5000, 14)

### Print column names and check the datatypes of columns

In [5]:
print("The columns in the data set are : \n")
bank.columns

The columns in the data set are : 



Index(['ID', 'Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'PersonalLoan', 'SecuritiesAccount',
       'CDAccount', 'Online', 'CreditCard'],
      dtype='object')

In [6]:
print("The datatypes of the columns are :\n ")
bank.dtypes

The datatypes of the columns are :
 


ID                     int64
Age                  float64
Experience           float64
Income               float64
ZIPCode                int64
Family                 int64
CCAvg                float64
Education              int64
Mortgage               int64
PersonalLoan           int64
SecuritiesAccount      int64
CDAccount              int64
Online                 int64
CreditCard             int64
dtype: object

### Check the top 10 & Bottom 10 rows to glance the dataset 

In [7]:
bank.head(10)

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,1,25.0,1.0,49.0,91107,4,1.6,1,0,0,1,0,0,0
1,2,45.0,19.0,34.0,90089,3,1.5,1,0,0,1,0,0,0
2,3,39.0,15.0,11.0,94720,1,1.0,1,0,0,0,0,0,0
3,4,35.0,9.0,100.0,94112,1,2.7,2,0,0,0,0,0,0
4,5,35.0,8.0,45.0,91330,4,1.0,2,0,0,0,0,0,1
5,6,37.0,13.0,29.0,92121,4,0.4,2,155,0,0,0,1,0
6,7,53.0,27.0,72.0,91711,2,1.5,2,0,0,0,0,1,0
7,8,50.0,24.0,22.0,93943,1,0.3,3,0,0,0,0,0,1
8,9,35.0,10.0,81.0,90089,3,0.6,2,104,0,0,0,1,0
9,10,,9.0,180.0,93023,1,8.9,3,0,1,0,0,0,0


In [8]:
bank.tail(10)

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
4990,4991,55.0,25.0,58.0,95023,4,2.0,3,219,0,0,0,0,1
4991,4992,51.0,25.0,92.0,91330,1,1.9,2,100,0,0,0,0,1
4992,4993,30.0,5.0,13.0,90037,4,0.5,3,0,0,0,0,0,0
4993,4994,45.0,21.0,218.0,91801,2,6.67,1,0,0,0,0,1,0
4994,4995,64.0,40.0,75.0,94588,3,2.0,3,0,0,0,0,1,0
4995,4996,29.0,3.0,40.0,92697,1,1.9,3,0,0,0,0,1,0
4996,4997,30.0,4.0,15.0,92037,4,0.4,1,85,0,0,0,1,0
4997,4998,63.0,39.0,24.0,93023,2,0.3,3,0,0,0,0,0,0
4998,4999,65.0,40.0,49.0,90034,3,0.5,2,0,0,0,0,1,0
4999,5000,28.0,4.0,83.0,92612,3,0.8,1,0,0,0,0,1,1


### Check the summary of the dataframe

In [9]:
bank.describe(include='all')

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
count,5000.0,4998.0,4998.0,4987.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.336335,20.108043,73.807098,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.460241,11.468603,46.037325,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


### Target Distribution with counts & percentage

In [10]:
print(bank["PersonalLoan"].value_counts())

0    4520
1     480
Name: PersonalLoan, dtype: int64


In [11]:
print(bank["PersonalLoan"].value_counts(normalize=True)*100)

0    90.4
1     9.6
Name: PersonalLoan, dtype: float64


### Check the number of unique values for attributes
#### Check the unique values under ZIP code

In [12]:
print("The number of Unique ZIP Codes in the bank data set is",bank['ZIPCode'].nunique())
print("\n")
print(bank['ZIPCode'].value_counts())

The number of Unique ZIP Codes in the bank data set is 467


94720    169
94305    127
95616    116
90095     71
93106     57
        ... 
96145      1
94087      1
91024      1
9307       1
94598      1
Name: ZIPCode, Length: 467, dtype: int64


#### Check unique values of  family attribute

In [13]:
print("The number of family members for each level in the bank data set:\n")
print(bank['Family'].value_counts())

The number of family members for each level in the bank data set:

1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64


#### Check unique values of Education attribute

In [14]:
print("The number of Education levels in the bank data set:\n")
print(bank['Education'].value_counts())

The number of Education levels in the bank data set:

1    2096
3    1501
2    1403
Name: Education, dtype: int64


### TypeCasting of attributes

In [15]:
cat_cols = ['Education', 'CDAccount', 'Online','CreditCard',
            'SecuritiesAccount','Family','ZIPCode','PersonalLoan']
bank[cat_cols]=bank[cat_cols].astype('category')

In [16]:
bank.dtypes

ID                      int64
Age                   float64
Experience            float64
Income                float64
ZIPCode              category
Family               category
CCAvg                 float64
Education            category
Mortgage                int64
PersonalLoan         category
SecuritiesAccount    category
CDAccount            category
Online               category
CreditCard           category
dtype: object

### Remove the unncessary Columns

In [17]:
bank=bank.drop(["ID","ZIPCode"],axis=1)

In [18]:
bank.head()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,PersonalLoan,SecuritiesAccount,CDAccount,Online,CreditCard
0,25.0,1.0,49.0,4,1.6,1,0,0,1,0,0,0
1,45.0,19.0,34.0,3,1.5,1,0,0,1,0,0,0
2,39.0,15.0,11.0,1,1.0,1,0,0,0,0,0,0
3,35.0,9.0,100.0,1,2.7,2,0,0,0,0,0,0
4,35.0,8.0,45.0,4,1.0,2,0,0,0,0,0,1


###  Missing values 

In [19]:
bank.isnull().sum()

Age                   2
Experience            2
Income               13
Family                0
CCAvg                 0
Education             0
Mortgage              0
PersonalLoan          0
SecuritiesAccount     0
CDAccount             0
Online                0
CreditCard            0
dtype: int64

In [20]:
bank.isna().sum()

Age                   2
Experience            2
Income               13
Family                0
CCAvg                 0
Education             0
Mortgage              0
PersonalLoan          0
SecuritiesAccount     0
CDAccount             0
Online                0
CreditCard            0
dtype: int64

### Split the data into train and test

In [21]:
y=bank["PersonalLoan"]
X=bank.drop('PersonalLoan', axis=1)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20,random_state=123,stratify=y)  

In [22]:
print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

(4000, 11)
(1000, 11)
(4000,)
(1000,)


In [23]:
print(type(X_train))
print(type(y_train))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


#### Split the attributes into numerical and categorical types

In [24]:
num_attr=X_train.select_dtypes(['int64','float64']).columns
num_attr

Index(['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage'], dtype='object')

In [25]:
cat_attr = X_train.select_dtypes('category').columns
cat_attr

Index(['Family', 'Education', 'SecuritiesAccount', 'CDAccount', 'Online',
       'CreditCard'],
      dtype='object')

### Imputation

In [26]:
X_train.isnull().sum()

Age                   2
Experience            2
Income               10
Family                0
CCAvg                 0
Education             0
Mortgage              0
SecuritiesAccount     0
CDAccount             0
Online                0
CreditCard            0
dtype: int64

In [27]:
X_val.isnull().sum()

Age                  0
Experience           0
Income               3
Family               0
CCAvg                0
Education            0
Mortgage             0
SecuritiesAccount    0
CDAccount            0
Online               0
CreditCard           0
dtype: int64

#### Imputing missing values with median

In [28]:
imputer = SimpleImputer(strategy='median')
imputer = imputer.fit(X_train[num_attr])

X_train[num_attr] = imputer.transform(X_train[num_attr])
X_val[num_attr] = imputer.transform(X_val[num_attr])

In [29]:
X_train.isnull().sum()

Age                  0
Experience           0
Income               0
Family               0
CCAvg                0
Education            0
Mortgage             0
SecuritiesAccount    0
CDAccount            0
Online               0
CreditCard           0
dtype: int64

In [30]:
X_val.isnull().sum()

Age                  0
Experience           0
Income               0
Family               0
CCAvg                0
Education            0
Mortgage             0
SecuritiesAccount    0
CDAccount            0
Online               0
CreditCard           0
dtype: int64

### Standardize the data 


In [31]:
scaler = StandardScaler()
scaler.fit(X_train[num_attr])

StandardScaler()

In [32]:
num_attr

Index(['Age', 'Experience', 'Income', 'CCAvg', 'Mortgage'], dtype='object')

In [33]:
scaler.mean_

array([45.30675  , 20.08275  , 74.03225  ,  1.9490125, 56.8095   ])

In [34]:
scaler.var_

array([1.29518154e+02, 1.29723402e+02, 2.15864758e+03, 3.11726960e+00,
       1.03160472e+04])

In [35]:
X_train_num = pd.DataFrame(scaler.transform(X_train[num_attr]), columns=num_attr)
X_val_num = pd.DataFrame(scaler.transform(X_val[num_attr]), columns=num_attr)

In [36]:
print(X_train_num.shape)
print(X_val_num.shape)

print(type(X_train_num))
print(type(X_val_num))

(4000, 5)
(1000, 5)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


### One Hot Encoding of categorical attributes

In [37]:
ohe = OneHotEncoder(handle_unknown='error')

In [None]:
ohe.fit(X_train[cat_attr])

columns_ohe = list(ohe.get_feature_names(cat_attr))
print(columns_ohe)

X_train_cat = ohe.transform(X_train[cat_attr])
X_val_cat = ohe.transform(X_val[cat_attr])

In [None]:
print(X_train_cat[0:5])
print(X_val_cat[0:5])

  (0, 1)	1.0
  (0, 4)	1.0
  (0, 7)	1.0
  (0, 9)	1.0
  (0, 11)	1.0
  (0, 13)	1.0
  (1, 0)	1.0
  (1, 4)	1.0
  (1, 7)	1.0
  (1, 9)	1.0
  (1, 11)	1.0
  (1, 13)	1.0
  (2, 2)	1.0
  (2, 5)	1.0
  (2, 7)	1.0
  (2, 9)	1.0
  (2, 11)	1.0
  (2, 13)	1.0
  (3, 1)	1.0
  (3, 5)	1.0
  (3, 7)	1.0
  (3, 9)	1.0
  (3, 12)	1.0
  (3, 13)	1.0
  (4, 3)	1.0
  (4, 5)	1.0
  (4, 7)	1.0
  (4, 9)	1.0
  (4, 12)	1.0
  (4, 13)	1.0
  (0, 0)	1.0
  (0, 4)	1.0
  (0, 7)	1.0
  (0, 9)	1.0
  (0, 11)	1.0
  (0, 13)	1.0
  (1, 2)	1.0
  (1, 4)	1.0
  (1, 7)	1.0
  (1, 9)	1.0
  (1, 12)	1.0
  (1, 14)	1.0
  (2, 1)	1.0
  (2, 6)	1.0
  (2, 7)	1.0
  (2, 9)	1.0
  (2, 11)	1.0
  (2, 14)	1.0
  (3, 3)	1.0
  (3, 6)	1.0
  (3, 7)	1.0
  (3, 9)	1.0
  (3, 12)	1.0
  (3, 13)	1.0
  (4, 2)	1.0
  (4, 5)	1.0
  (4, 7)	1.0
  (4, 9)	1.0
  (4, 11)	1.0
  (4, 13)	1.0


In [None]:
X_train_cat = pd.DataFrame(X_train_cat.todense(), columns=columns_ohe)
X_val_cat = pd.DataFrame(X_val_cat.todense(), columns=columns_ohe)

In [None]:
print(X_train_cat.head(4))

   Family_1  Family_2  Family_3  Family_4  Education_1  Education_2  \
0       0.0       1.0       0.0       0.0          1.0          0.0   
1       1.0       0.0       0.0       0.0          1.0          0.0   
2       0.0       0.0       1.0       0.0          0.0          1.0   
3       0.0       1.0       0.0       0.0          0.0          1.0   

   Education_3  SecuritiesAccount_0  SecuritiesAccount_1  CDAccount_0  \
0          0.0                  1.0                  0.0          1.0   
1          0.0                  1.0                  0.0          1.0   
2          0.0                  1.0                  0.0          1.0   
3          0.0                  1.0                  0.0          1.0   

   CDAccount_1  Online_0  Online_1  CreditCard_0  CreditCard_1  
0          0.0       1.0       0.0           1.0           0.0  
1          0.0       1.0       0.0           1.0           0.0  
2          0.0       1.0       0.0           1.0           0.0  
3          0.0    

In [None]:
print(X_train_cat.shape)
print(X_val_cat.shape)

print(type(X_train_cat))
print(type(X_val_cat))

(4000, 15)
(1000, 15)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


### Merging of Numerical and Categorical Dataframes

In [None]:
X_train_proc = pd.concat([X_train_num, X_train_cat], axis=1)
X_val_proc = pd.concat([X_val_num, X_val_cat], axis=1)

In [None]:
print(X_train_proc.shape)
print(y_train.shape)
print(X_val_proc.shape)
print(y_val.shape)

(4000, 20)
(4000,)
(1000, 20)
(1000,)


# Model Building 

###  Build Bagging Tree Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier

In [None]:
clf = BaggingClassifier(n_estimators=10, random_state=0)
clf.fit(X=X_train_proc, y=y_train)

BaggingClassifier(random_state=0)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score

In [None]:
y_pred = clf.predict(X_train_proc)
print("Accuracy for Train set:")
print(accuracy_score(y_train,y_pred))

y_pred_val = clf.predict(X_val_proc)
print("Accuracy for Validation set:")
print(accuracy_score(y_val,y_pred_val))

print("\n")

print("Recall for Train set:")
print(recall_score(y_train,y_pred,pos_label=1))

print("Recall for Validation set:")
print(recall_score(y_val,y_pred_val,pos_label=1))

print("\n")

print("Precision for Train set:")
print(precision_score(y_train,y_pred,pos_label=1))

print("Precision for Validation set:")
print(precision_score(y_val,y_pred_val,pos_label=1))


Accuracy for Train set:
0.999
Accuracy for Validation set:
0.984


Recall for Train set:
0.9895833333333334
Recall for Validation set:
0.875


Precision for Train set:
1.0
Precision for Validation set:
0.9545454545454546


### Hyperparameter Tuning


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# set of parameters to test
param_grid = {"n_estimators": [10, 15, 20, 25],
              "bootstrap": [False, True]
              }

In [None]:
bt = BaggingClassifier(random_state=0)
clf2 = GridSearchCV(bt, param_grid, cv=5, scoring='recall', n_jobs=-1)
clf2.fit(X_train_proc, y_train)

GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=0), n_jobs=-1,
             param_grid={'bootstrap': [False, True],
                         'n_estimators': [10, 15, 20, 25]},
             scoring='recall')

In [None]:
clf2.best_params_

{'bootstrap': True, 'n_estimators': 15}

In [None]:
clf2.best_estimator_

BaggingClassifier(n_estimators=15, random_state=0)

In [None]:
train_pred = clf2.predict(X_train_proc)
val_pred = clf2.predict(X_val_proc)

#### Calculate Accuracy and True Positive Rate

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix_train = confusion_matrix(y_train, train_pred)
confusion_matrix_val = confusion_matrix(y_val, val_pred)

In [None]:
print(confusion_matrix_train)
print("\n")
print(confusion_matrix_val)

[[3616    0]
 [   0  384]]


[[900   4]
 [ 11  85]]


In [None]:
Accuracy_Train=(confusion_matrix_train[0,0]+confusion_matrix_train[1,1])/(confusion_matrix_train[0,0]+confusion_matrix_train[0,1]+confusion_matrix_train[1,0]+confusion_matrix_train[1,1])
TPR_Train= confusion_matrix_train[1,1]/(confusion_matrix_train[1,0]+confusion_matrix_train[1,1])

print("Train TPR: ",TPR_Train)
print("Train Accuracy: ",Accuracy_Train)

Accuracy_val=(confusion_matrix_val[0,0]+confusion_matrix_val[1,1])/(confusion_matrix_val[0,0]+confusion_matrix_val[0,1]+confusion_matrix_val[1,0]+confusion_matrix_val[1,1])
TPR_val= confusion_matrix_val[1,1]/(confusion_matrix_val[1,0] +confusion_matrix_val[1,1])

print("Validation TPR: ",TPR_val)
print("Validation Accuracy: ",Accuracy_val)

Train TPR:  1.0
Train Accuracy:  1.0
Validation TPR:  0.8854166666666666
Validation Accuracy:  0.985
