# Week07 - Prediction using Decision Tree (with Hyperparameter Tuning)

This census data is just too tough... try using another dataset -- univeral bank????


## Introduction and Overview


In this notebook, we will reuse the [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income) census data.

Previously, we modeled this data using k-NN (with a k search from 1 through root N) and an unpruned decision tree. As previously discuss, for this particular context (relatively equal cost/benefits for Fn vs FP, and imbalanced data classes) we argued that f1 was the best metric to maximize. 

In previous attempts we found a default (note pruned, and no hyperparameter tuning applied) produced an f1 score of approximately .49, while k-NN (with k=?) produced a recall of approimately 0.70 @ k=19 (see a3_template_approach.ipynb and accompaning video) 

In this notebook, we will explore if we can apply hyperparameter tuning to develop a better performing model.

## Install and import necessary packages

In [1]:
# You may need to install xgboost (it's not part of the sklearn package)
# !conda install xgboost 

In [2]:
# import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier

  from pandas import MultiIndex, Int64Index


> NOTE: The current version of XGBoost (10/09/2022) will display a depreciation warning. You can ignore this. This will most likely be addressed in future XGBoost versions.

In [3]:
random_seed = 1
np.random.seed(random_seed)

## Load data 

In [4]:
df = pd.read_csv('https://github.com/timcsmith/MIS536-Public/raw/master/Data/UniversalBank.csv')
df.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


## Explore the dataset

In [5]:
# Explore the dataset
# read the first row of the dataset 
print(df.head())
print(df.columns)
print(df.describe())
print(df.info())

   ID  Age  Experience  Income  ZIP Code  Family  CCAvg  Education  Mortgage  \
0   1   25           1      49     91107       4    1.6          1         0   
1   2   45          19      34     90089       3    1.5          1         0   
2   3   39          15      11     94720       1    1.0          1         0   
3   4   35           9     100     94112       1    2.7          2         0   
4   5   35           8      45     91330       4    1.0          2         0   

   Personal Loan  Securities Account  CD Account  Online  CreditCard  
0              0                   1           0       0           0  
1              0                   1           0       0           0  
2              0                   0           0       0           0  
3              0                   0           0       0           0  
4              0                   0           0       0           1  
Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education'

## Clean/transform data (where necessary)

In [6]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

Drop the columns we are not using as predictors (see previous notebooks -- we are given a subset of input variables to consider)

In [7]:
df = df.drop(columns=['ID', 'ZIP Code'])

In [8]:
# translation education categories into dummy vars
df['Education'] = df['Education'].astype('category')
df = pd.get_dummies(df, prefix_sep='_', drop_first=False)

## Split data intro training and validation sets

In [9]:
# construct datasets for analysis
target = 'Personal Loan'
predictors = list(df.columns)
predictors.remove(target)
X = df[predictors]
y = df[target]

In [10]:
# create the training set and the test set 
train_X, valid_X, train_y, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

## Prediction with Decision Tree (using default parameters)



You can find details about SKLearm's DecisionTree classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

Create a decision tree using all of the default parameters

In [11]:
dtree=DecisionTreeClassifier(random_state=random_seed)

Fit the model to the training data

In [12]:
_ = dtree.fit(train_X, train_y)

Review of the performance of the model on the validation/test data

In [13]:
y_pred = dtree.predict(valid_X)

In [14]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.87248322147651
Accuracy Score:   0.9766666666666667
Precision Score:  0.8904109589041096
F1 Score:         0.8813559322033899


Save the recall result from this model

In [15]:
dtree_recall = recall_score(y_test, y_pred)

## Prediction with RandomForest (using default parameters)

Like all our classifiers, RandomeForestClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* n_estimators: The number of trees in the forsest
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 100.  
* max_depth: The maximum depth per tree. 
    - Deeper trees might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None, which allows the tree to grow without constraint.
* remaining tuning parameters similar to DecisionTree and covered last class.

In [16]:
rforest = RandomForestClassifier(random_state=random_seed)

In [17]:
_ = rforest.fit(train_X, train_y)

In [18]:
y_pred = rforest.predict(valid_X)

In [19]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8523489932885906
Accuracy Score:   0.9826666666666667
Precision Score:  0.9694656488549618
F1 Score:         0.9071428571428571


Save the recall result from this model

In [20]:
rforest_recall = recall_score(y_test, y_pred)

## Prediction with ADABoost (using default parameters)

Like all our classifiers, ADABoostClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.

In [21]:
aboost = AdaBoostClassifier(random_state=random_seed)

# as with most classifiers, you can experiment with the parameter values
#aboost = AdaBoostClassifier(random_state=random_seed, base_estimator=DecisionTreeClassifier(max_depth=4, random_state=random_seed))
#aboost = AdaBoostClassifier(random_state=random_seed, n_estimators=1000)

In [22]:
_ = aboost.fit(train_X, train_y)

In [23]:
y_pred = aboost.predict(valid_X)

In [24]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.738255033557047
Accuracy Score:   0.9646666666666667
Precision Score:  0.8870967741935484
F1 Score:         0.8058608058608058


Save the recall result from this model

In [25]:
aboost_recall = recall_score(y_test, y_pred)

## Prediction with GradientBoostingClassifier

Like all our classifiers, GradientBoostingClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* Other tuning parameters similar to DecisionTree's and were covered last class.

In [26]:
gboost = GradientBoostingClassifier(random_state=random_seed)

In [27]:
_ = gboost.fit(train_X, train_y)

In [28]:
y_pred = gboost.predict(valid_X)

In [29]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8791946308724832
Accuracy Score:   0.9833333333333333
Precision Score:  0.9492753623188406
F1 Score:         0.9128919860627177


Save the recall result from this model

In [30]:
gboost_recall = recall_score(y_test, y_pred)

## Prediction with XGBoost

Like all our classifiers, XGBoost has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 6.
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* colsample_bytree: Represents the fraction of columns to be randomly sampled for each tree. 
    - It might improve overfitting.
    - The value must be between 0 and 1. Default is 1.
* subsample: Represents the fraction of observations to be sampled for each tree. 
    - A lower values prevent overfitting but might lead to under-fitting.
    - The value must be between 0 and 1. Default is 1.

In [31]:
xgboost = XGBClassifier(random_state=random_seed)

In [32]:
_ = xgboost.fit(train_X, train_y)



  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


In [33]:
y_pred = xgboost.predict(valid_X)

In [34]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8791946308724832
Accuracy Score:   0.9846666666666667
Precision Score:  0.9632352941176471
F1 Score:         0.9192982456140351


Save the recall result from this model

In [35]:
xgboost = recall_score(y_test, y_pred)

## Step 6: Summarize results    

As usual -- in this section you provide a recap your approach, results, and discussion of findings. 


In [36]:
print("Recall scores...")
print(f"{'Decision Tree:':18}{dtree_recall}")
print(f"{'Random Forest:':18}{rforest_recall}")
print(f"{'Ada Boosted Tree:':18}{aboost_recall}")
print(f"{'Gradient Tree:':18}{gboost_recall}")
print(f"{'XGBoost Tree:':18}{xgboost}")


Recall scores...
Decision Tree:    0.87248322147651
Random Forest:    0.8523489932885906
Ada Boosted Tree: 0.738255033557047
Gradient Tree:    0.8791946308724832
XGBoost Tree:     0.8791946308724832
