# Week07 - Prediction using Decision Tree (with Hyperparameter Tuning)

This census data is just too tough... try using another dataset -- univeral bank????


## Introduction and Overview


In this notebook, we will reuse the Universal Bank dataset.

This time, we are developing a model to predict whether a customer will accept a personal loan offer. The dataset contains 5000 observations and 14 variables. The data is available on one of my GitHub repos.

## Install and import necessary packages

In [62]:
# You may need to install xgboost (it's not part of the sklearn package)
# !conda install xgboost 

In [49]:
# import packages
import pandas as pd
from pandas import MultiIndex
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from scipy.stats import randint
from sklearn.model_selection import cross_val_score

np.random.seed(1)

## Load data 

In [50]:
df = pd.read_csv('/Users/shambhavimishra/Downloads/DSP/UniversalBank.csv')
df.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


## Explore the dataset

In [51]:
# Explore the dataset
# read the first row of the dataset 
print(df.head())
print(df.columns)
print(df.describe())
print(df.info())

   ID  Age  Experience  Income  ZIP Code  Family  CCAvg  Education  Mortgage  \
0   1   25           1      49     91107       4    1.6          1         0   
1   2   45          19      34     90089       3    1.5          1         0   
2   3   39          15      11     94720       1    1.0          1         0   
3   4   35           9     100     94112       1    2.7          2         0   
4   5   35           8      45     91330       4    1.0          2         0   

   Personal Loan  Securities Account  CD Account  Online  CreditCard  
0              0                   1           0       0           0  
1              0                   1           0       0           0  
2              0                   0           0       0           0  
3              0                   0           0       0           0  
4              0                   0           0       0           1  
Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education'

## Clean/transform data (where necessary)

In [52]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

Drop the columns we are not using as predictors (see previous notebooks -- we are given a subset of input variables to consider)

In [53]:
df = df.drop(columns=['ID', 'ZIP Code'])

In [54]:
# translation education categories into dummy vars
df = df.join(pd.get_dummies(df['Education'], prefix='Edu', drop_first=True))
df.drop('Education', axis=1, inplace = True)

df.head(3)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard,Edu_2,Edu_3
0,25,1,49,4,1.6,0,0,1,0,0,0,0,0
1,45,19,34,3,1.5,0,0,1,0,0,0,0,0
2,39,15,11,1,1.0,0,0,0,0,0,0,0,0


## Split data intro training and validation sets

In [55]:
# construct datasets for analysis
target = 'Personal Loan'
predictors = list(df.columns)
predictors.remove(target)
X = df[predictors]
y = df[target]

In [56]:
# create the training set and the test set 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

## Prediction with Decision Tree (using default parameters)



You can find details about SKLearm's DecisionTree classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

Create a decision tree using all of the default parameters

In [57]:
dtree=DecisionTreeClassifier()

Fit the model to the training data

In [58]:
_ = dtree.fit(X_train, y_train)

Review of the performance of the model on the validation/test data

In [59]:
y_pred = dtree.predict(X_test)

In [60]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.9060402684563759
Accuracy Score:   0.9873333333333333
Precision Score:  0.9642857142857143
F1 Score:         0.9342560553633219


Save the recall result from this model

In [61]:
dtree_recall = recall_score(y_test, y_pred)

## Prediction with RandomForest (using default parameters)

Like all our classifiers, RandomeForestClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* n_estimators: The number of trees in the forsest
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 100.  
* max_depth: The maximum depth per tree. 
    - Deeper trees might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None, which allows the tree to grow without constraint.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [62]:
rforest = RandomForestClassifier()

In [63]:
_ = rforest.fit(X_train, y_train)

In [64]:
y_pred = rforest.predict(X_test)

In [65]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8456375838926175
Accuracy Score:   0.9833333333333333
Precision Score:  0.984375
F1 Score:         0.9097472924187726


Save the recall result from this model

In [66]:
rforest_recall = recall_score(y_test, y_pred)

## Prediction with ADABoost (using default parameters)

Like all our classifiers, ADABoostClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [67]:
aboost = AdaBoostClassifier()

In [68]:
_ = aboost.fit(X_train, y_train)

In [69]:
y_pred = aboost.predict(X_test)

In [70]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.7248322147651006
Accuracy Score:   0.9626666666666667
Precision Score:  0.8780487804878049
F1 Score:         0.7941176470588235


Save the recall result from this model

In [71]:
aboost_recall = recall_score(y_test, y_pred)

## Prediction with ADABoost using RandomSearchCV

Like all our classifiers, ADABoostClassifier has a number of parameters that can be adjusted/tuned. In this example below, we use RandomSearchCV to explore the ranges of values.. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [76]:
dt = DecisionTreeClassifier(max_depth=1)
aboost = AdaBoostClassifier(estimator=dt)

In [77]:
param_dist = {
    "n_estimators": randint(50, 200),
    "learning_rate": [0.01, 0.1, 1]
}

In [78]:
random_search = RandomizedSearchCV(aboost, param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

In [79]:
print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)

Best parameters: {'learning_rate': 1, 'n_estimators': 56}
Best score: 0.9694285714285714


In [81]:
y_pred = random_search.predict(X_test)

In [82]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.7315436241610739
Accuracy Score:   0.9626666666666667
Precision Score:  0.872
F1 Score:         0.7956204379562044


In [83]:
RandomSearch_aboost_recall = recall_score(y_test, y_pred)

## Prediction with GradientBoostingClassifier

Like all our classifiers, GradientBoostingClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

In [84]:
gboost = GradientBoostingClassifier()

In [85]:
_ = gboost.fit(X_train, y_train)

In [86]:
y_pred = gboost.predict(X_test)

In [87]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8657718120805369
Accuracy Score:   0.9826666666666667
Precision Score:  0.9555555555555556
F1 Score:         0.9084507042253522


Save the recall result from this model

In [28]:
gboost_recall = recall_score(y_test, y_pred)

## Prediction with XGBoost

Like all our classifiers, XGBoost has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 6.
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* colsample_bytree: Represents the fraction of columns to be randomly sampled for each tree. 
    - It might improve overfitting.
    - The value must be between 0 and 1. Default is 1.
* subsample: Represents the fraction of observations to be sampled for each tree. 
    - A lower values prevent overfitting but might lead to under-fitting.
    - The value must be between 0 and 1. Default is 1.
* See the XGBoost documentation for more details. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn 

In [88]:
xgboost = XGBClassifier()

In [89]:
_ = xgboost.fit(X_train, y_train)



  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [90]:
y_pred = xgboost.predict(X_test)

In [91]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8926174496644296
Accuracy Score:   0.9853333333333333
Precision Score:  0.9568345323741008
F1 Score:         0.9236111111111113


Save the recall result from this model

In [92]:
xgboost = recall_score(y_test, y_pred)

## Step 6: Summarize results    

As usual -- in this section you provide a recap your approach, results, and discussion of findings. 


In [93]:
print("Recall scores...")
print(f"{'Decision Tree:':18}{dtree_recall}")
print(f"{'Random Forest:':18}{rforest_recall}")
print(f"{'Ada Boosted Tree:':18}{aboost_recall}")
print(f"{'Gradient Tree:':18}{gboost_recall}")
print(f"{'XGBoost Tree:':18}{xgboost}")
print(f"{'RandomSearch Ada Boosted Tree:' :18}{RandomSearch_aboost_recall}")

Recall scores...
Decision Tree:    0.9060402684563759
Random Forest:    0.8456375838926175
Ada Boosted Tree: 0.7248322147651006
Gradient Tree:    0.8657718120805369
XGBoost Tree:     0.8926174496644296
RandomSearch Ada Boosted Tree:0.7315436241610739


## Conclusion:
##### From above, it is evident that Recall score for Ada Boosed Tree (72.4%) and RandomSearchCV Ada Boosted Tree (73.1%), is almost similar.
##### The recall score can be increased with the increase in the number of iterations and learning rate. However, overfitting can be an issue to consider. Also, use of different models like Random Forest can also increase the recall score.