# WE04-Universal Bank


## Hema Sai Ari (U59528014)

### 1.0 Import and install python libraries we require


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import recall_score


np.random.seed(1)

### 2.0 Loading data 

In [2]:
df = pd.read_csv("UniversalBank.csv")

### Overview of data

In [3]:

df.head(3)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0


### Summary of the data

In [4]:
df.info()
#.info gives us the null character count and the data type of each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIP Code            5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal Loan       5000 non-null   int64  
 10  Securities Account  5000 non-null   int64  
 11  CD Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


 **We can observe that there are no categorical variables in the given data**
 

### Checking for missing values in the data

In [5]:
# Check the missing values by summing the total na's for each variable
df.isna().sum()

ID                    0
Age                   0
Experience            0
Income                0
ZIP Code              0
Family                0
CCAvg                 0
Education             0
Mortgage              0
Personal Loan         0
Securities Account    0
CD Account            0
Online                0
CreditCard            0
dtype: int64

**Here we can clearlly obeserve that there are no missing values in the columns**

### Checking the count of unique data in the target variable

In [6]:
#seeing the number of 0's and 1's in the target variable
df['CD Account'].value_counts()


0    4698
1     302
Name: CD Account, dtype: int64

**We can clearly observe that the data have a huge imbalance in it which can effect the results**

## 3.0 Process the data

We dont have any categorical variables and we dont have any missing values so we need not do any covertion into numeric or we need not impute values 

### Dropping unessasary columns

Dropping unuseful data can help us to process the model quickly

In [7]:
# drop ID, and Zip Code as predictors as they are not gonna effect the result 
df = df.drop(columns=['ID', 'ZIP Code'])

### Splitting the data into train and test sets

Lets split the data into training data and the test data with the ratio of 70-30

In [8]:
# split the data into validation and training set
train_df, test_df = train_test_split(df, test_size=0.3, random_state=1)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'CD Account'
predictors = list(df.columns)
predictors.remove(target)

### Now lets standardize the variables

We standardize our variables to eliminate the differences in scale between the predictors/features. 

We will use the sklearn library's 'standard scaler' to accomplish this. The standard scaler function will standardize our variables. To achieve this, we will first need to train the scaler on the training data and then apply this trained scaler to standardize both the training and validation sets.

In [9]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
scaler.fit(train_df[predictors])

# Transform the predictors of training and test sets
X_train = scaler.transform(train_df[predictors]) 
y_train = train_df[target] 

X_test = scaler.transform(test_df[predictors])
y_test = test_df[target] 

## 4.0 Modeling


Lets create a data frame to store all the results of our models

In [10]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 4.1 Logistic Regression model

### 4.1.1 Logistic Regression using RandomSearch and Grid Search

In [11]:
score_measure = "recall"
kfolds = 5

param_grid = {'C':[0.001,0.01,0.1,1,10], # C is the regulization strength
               'penalty':['l1', 'l2','elasticnet','none'],
              'solver':['saga','liblinear'],
              'max_iter': np.arange(500,1000)
                  
}

lg = LogisticRegression()
rand_search = RandomizedSearchCV(estimator =lg, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1  # n_jobs=-1 will utilize all available CPUs 
                                )

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestlogestic = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.6708245243128964
... with parameters: {'solver': 'liblinear', 'penalty': 'l2', 'max_iter': 851, 'C': 0.1}


910 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
280 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1091, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 71, in _check_solver
    raise ValueError(
ValueError: Only 'saga' solver supports elasticnet penalty, got solver=liblinear.

------------------------

**Now lets use these best parameters in the grid search so obatain best results**

In [12]:
score_measure = "recall"
kfolds = 5
best_penality = rand_search.best_params_['penalty']
best_solver = rand_search.best_params_['solver']
min_regulization_strength=rand_search.best_params_['C']
min_iter = rand_search.best_params_['max_iter']

#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    
    'C':np.arange(min_regulization_strength-0.05,min_regulization_strength+0.05), 
               'penalty':[best_penality],
              'solver':[best_solver],
              'max_iter': np.arange(min_iter-300,min_iter+300)
}

lgr =  LogisticRegression()
grid_search = GridSearchCV(estimator = lgr, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1 # n_jobs=-1 will utilize all available CPUs 
                )

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestlgr = grid_search.best_estimator_

Fitting 5 folds for each of 600 candidates, totalling 3000 fits
The best recall score is 0.6708245243128964
... with parameters: {'C': 0.05, 'max_iter': 551, 'penalty': 'l2', 'solver': 'liblinear'}


The obatined best recall score and the best parameters are for the trainning set. Now we will use those best paameters in the model to predict the test set results and then compare them with the acutal test set traget variabls and draw a confusion matrix. Using the confusion matrix we can get the scoring metrices values

In [13]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
Recall_lr={TP/(TP+FN)}

Accuracy=0.9780000 Precision=1.0000000 Recall=0.6024096 F1=0.7518797


In [14]:
print(f"The recall score from the Logestic Regression model using Random Search and Grid Search is :{Recall_lr}")

The recall score from the Logestic Regression model using Random Search and Grid Search is :{0.6024096385542169}


In [15]:
performance = pd.concat([performance, pd.DataFrame({'model':"logistic using random & grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

### 4.1.2 Modeling the data using individual logestic regression models

#### 4.1.2.1 Fit and test a Logistic Regression model

In [16]:
log_reg_model = LogisticRegression(penalty='none', max_iter=900)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [17]:
model_preds = log_reg_model.predict(X_test)
c_matrix_1 = confusion_matrix(y_test, model_preds)
TP = c_matrix_1[1][1]
TN = c_matrix_1[0][0]
FP = c_matrix_1[0][1]
FN = c_matrix_1[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188


#### 4.1.2.2 Change to liblinear solver

In [18]:
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(X_train, np.ravel(y_train))

In [19]:
model_preds = log_reg_liblin_model.predict(X_test)
c_matrix_2 = confusion_matrix(y_test, model_preds)
TP = c_matrix_2[1][1]
TN = c_matrix_2[0][0]
FP = c_matrix_2[0][1]
FN = c_matrix_2[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188


#### 4.1.2.3 L2 Regularization

In [20]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [21]:
model_preds_3 = log_reg_L2_model.predict(X_test)
c_matrix_3 = confusion_matrix(y_test, model_preds)
TP = c_matrix_3[1][1]
TN = c_matrix_3[0][0]
FP = c_matrix_3[0][1]
FN = c_matrix_3[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188


#### 4.1.2.4 L1 Regularization

In [22]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [23]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix_4 = confusion_matrix(y_test, model_preds)
TP = c_matrix_4[1][1]
TN = c_matrix_4[0][0]
FP = c_matrix_4[0][1]
FN = c_matrix_4[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188


#### 4.1.2.5 Elastic Net Regularization

In [24]:
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000)
_ = log_reg_elastic_model.fit(X_train, np.ravel(y_train))

In [25]:
model_preds = log_reg_elastic_model.predict(X_test)
c_matrix_5 = confusion_matrix(y_test, model_preds)
TP = c_matrix_5[1][1]
TN = c_matrix_5[0][0]
FP = c_matrix_5[0][1]
FN = c_matrix_5[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elestic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188


####  Summary for logistic model

In [26]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188


### 4.2 Model the data using the SVM models

### 4.2.1 SVM using RandomSearch and Grid Search

In [27]:
score_measure = "recall"
kfolds = 3

param_grid = {'C':np.arange(0.1,100,10),  #  regularization parameter.
               'kernel':['linear', 'rbf','poly'],
              'gamma':['scale','auto'],
              'degree':np.arange(1,10), #degree is for the polynomial kernal
              'coef0':np.arange(1,10) #coef0 is for the polynomial kernal
                  
}

svc = SVC()
rand_search = RandomizedSearchCV(estimator =svc, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1  # n_jobs=-1 will utilize all available CPUs 
                                )

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestsvc = rand_search.best_estimator_

Fitting 3 folds for each of 500 candidates, totalling 1500 fits
The best recall score is 0.7214611872146118
... with parameters: {'kernel': 'poly', 'gamma': 'scale', 'degree': 4, 'coef0': 5, 'C': 40.1}


In [28]:
score_measure = "recall"
kfolds = 3
best_kernel = rand_search.best_params_['kernel']
best_gamma = rand_search.best_params_['gamma']
min_regulization=rand_search.best_params_['C']
best_degree = rand_search.best_params_['degree']
best_coef0=rand_search.best_params_['coef0']

#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    
    'C':np.arange(min_regulization-3,min_regulization+3), 
               'kernel':[best_kernel],
              'gamma':[best_gamma],
              'degree': np.arange(best_degree-1,best_degree+1),
            'coef0': np.arange(best_coef0-3,best_coef0+3)
}

svm_grid =  SVC()
grid_search = GridSearchCV(estimator = svm_grid, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1 # n_jobs=-1 will utilize all available CPUs 
                )

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

best_svm = grid_search.best_estimator_

Fitting 3 folds for each of 72 candidates, totalling 216 fits
The best recall score is 0.7260273972602739
... with parameters: {'C': 37.1, 'coef0': 7, 'degree': 4, 'gamma': 'scale', 'kernel': 'poly'}


In [29]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
Recall_svm={TP/(TP+FN)}

Accuracy=0.9553333 Precision=0.5869565 Recall=0.6506024 F1=0.6171429


In [30]:
print(f"The recall score from the SVM model using Random Search and Grid Search is {Recall_svm}")

The recall score from the SVM model using Random Search and Grid Search is {0.6506024096385542}


In [31]:
performance = pd.concat([performance, pd.DataFrame({'model':"svm using Random & Grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])


### 4.2.2 Modeling the Data using indivdual SVM models

### 4.2.2.1 Fit a SVM classification model using linear kernal

In [32]:
svm_lin_model = SVC(kernel="linear")
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [33]:
model_preds = svm_lin_model.predict(X_test)
c_matrix_6 = confusion_matrix(y_test, model_preds)
TP = c_matrix_6[1][1]
TN = c_matrix_6[0][0]
FP = c_matrix_6[0][1]
FN = c_matrix_6[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"linear svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188
0,svm using Random & Grid search,0.955333,0.586957,0.650602,0.617143
0,linear svm,0.978,1.0,0.60241,0.75188


### 4.2.2.2 Fit a SVM classification model using rbf kernal

In [34]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [35]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix_7 = confusion_matrix(y_test, model_preds)
TP = c_matrix_7[1][1]
TN = c_matrix_7[0][0]
FP = c_matrix_7[0][1]
FN = c_matrix_7[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"rbf svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188
0,svm using Random & Grid search,0.955333,0.586957,0.650602,0.617143
0,linear svm,0.978,1.0,0.60241,0.75188
0,rbf svm,0.974667,0.909091,0.60241,0.724638


### 4.2.2.3 Fit a SVM classification model using polynomial kernal¶

In [36]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [37]:
model_preds = svm_poly_model.predict(X_test)
c_matrix_8 = confusion_matrix(y_test, model_preds)
TP = c_matrix_8[1][1]
TN = c_matrix_8[0][0]
FP = c_matrix_8[0][1]
FN = c_matrix_8[1][0]
performance= pd.concat([performance, pd.DataFrame({'model':"poly svm", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188
0,svm using Random & Grid search,0.955333,0.586957,0.650602,0.617143
0,linear svm,0.978,1.0,0.60241,0.75188
0,rbf svm,0.974667,0.909091,0.60241,0.724638
0,poly svm,0.97,0.806452,0.60241,0.689655


### Summary of the SVM models

In [38]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188
0,linear svm,0.978,1.0,0.60241,0.75188
0,rbf svm,0.974667,0.909091,0.60241,0.724638
0,poly svm,0.97,0.806452,0.60241,0.689655
0,svm using Random & Grid search,0.955333,0.586957,0.650602,0.617143


### 4.3 Model the data using the Decision Trees using RandomSearchCV combined with GridSearchCV

Using the Random search to get the best parameters from the range which can be later used in the Grid search to get more refined results with less overfitting

In [39]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.0005),
    'max_leaf_nodes': np.arange(5, 100), 
    'max_depth': np.arange(1,25), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.7076109936575052
... with parameters: {'min_samples_split': 92, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 78, 'max_depth': 8, 'criterion': 'gini'}


15 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 969, in fit
    super().fit(
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 265, in fit
    check_scalar(
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 1480, in check_scalar
    raise ValueError(
ValueError: min_samples_split == 1, must be >= 2.

 0.29682

In [40]:
score_measure = "recall"
kfolds = 5
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']
#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 1024 candidates, totalling 5120 fits
The best recall score is 0.7076109936575052
... with parameters: {'criterion': 'gini', 'max_depth': 7, 'max_leaf_nodes': 76, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2, 'min_samples_split': 91}


2560 fits failed out of a total of 5120.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1280 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 969, in fit
    super().fit(
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 247, in fit
    check_scalar(
  File "C:\Users\arihe\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 1480, in check_scalar
    raise ValueError(
ValueError: min_samples_leaf == -1, must be >= 1.

----

In [41]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")

Accuracy=0.9760000 Precision=0.9122807 Recall=0.6265060 F1=0.7428571


In [42]:
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])


### Summary

In [43]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,logistic using random & grid search,0.978,1.0,0.60241,0.75188
0,default logistic,0.978,1.0,0.60241,0.75188
0,liblinear logistic,0.978,1.0,0.60241,0.75188
0,L2 logistic,0.978,1.0,0.60241,0.75188
0,L1 logistic,0.978,1.0,0.60241,0.75188
0,Elestic logistic,0.978,1.0,0.60241,0.75188
0,linear svm,0.978,1.0,0.60241,0.75188
0,rbf svm,0.974667,0.909091,0.60241,0.724638
0,poly svm,0.97,0.806452,0.60241,0.689655
0,Decision Tree,0.976,0.912281,0.626506,0.742857


## Conclusion

Summarizing the recall score for different models

> Logistic Regression

    Logistic Regression using Random & Grid search - 0.60241
    default logistic Regression	                   - 0.60241	
	liblinear logistic Regression     	           - 0.60241
    L2 logistic	Regression                         - 0.60241
	L1 logistic	Regression                         - 0.60241
	Elestic logistic Rgression                     - 0.60241
    
   **Here we can observe that all logistic Regression models are having the same Recall score**
    
> Support Vector Machine 
    
    SVM using Random & Grid search - 0.65060
    Linear SVM	                   - 0.60241	
	RBF SVM     	               - 0.60241
    Poly SVM                       - 0.60241
    
   **The SVM using the Grid search and the Random Search has the highest recall score**
> Decision Tree
    
    Using Random & Grid search - 0.626506
	

From the results we can observe that the recall score in each type of model is greater if we use the Random Search and Grid Search approch. And the recall score is same for the all logistic regression model and the Support Vector Machines model except the SVM with RandomSearch and GridSearch. This could be due to the imbalance in the data. In the begining we have observed that there is very huge imbalace in the target variable data and due to that all the models are producing almost same recall score.

But if consider the differneces in the scores,  SVM model has the highest recall score i.e. 0.65060 which is not so high when compared to the other two models but is the best of all three models. So the best model from all the models we have trained and tested is **Support Vector Machine**. And the parameters at which the highest recall score acheived is 

- C(regulization parameter): 37.1
- coef0(for ploynomial in ploy kernal SVM): 7 
- degree(Degree of the ploynomial in polySVM): 4
- gamma= 'scale'
- kernel: 'poly'.

Universal bank can use the Support Vector Machines with poly kernal to find the potential coustmers to sell their new CD account product.