## Bagging and Random Forest Models


Using ensemble methods can greatly improve the results achieved with weak machine learning algorithms, also called weak learners. Ensemble methods achieve better performance by aggregating the results of many statistically independent models. This process averages out the errors and produces a final better prediction.

In this lab you will work with a widely used ensemble method known as bootstrap aggregating or simply bagging. Bagging follows a simple procedure:

* N learners (machine learning models) are defined.
* N subsamples of the training data are created by Bernoulli sampling with replacement.
* The N learners are trained on the subsamples of the training data.
* The ensemble is scored by averaging, or taking a majority vote, of the predictions from the N learners.

Classification and regression tree models are most typically used with bagging methods. The most common such algorithm is know as the random forest. The random forest method is highly scalable and generally produces good results, even for complex problems.

Classification and regression trees tend to be robust to noise or outliers in the training data. This is true for the random forest algorithm as well.

Example: Creditcard dataset


As a second example you will use credit card data to find fraud

* In this lab we implement Random forest algorithm on crididt card information dataset.

Now, you will try a more complex example using the credit scoring data. You will use the prepared data which had the following preprocessing:

* Cleaning missing values.
* Aggregating categories of certain categorical variables.
* Encoding categorical variables as binary dummy variables.
* Standardizing numeric variables.
* Execute the code in the cell below to load the features and labels as numpy arrays for the example.

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn import datasets ## Get dataset from sklearn
import sklearn.model_selection as ms
import sklearn.metrics as sklm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import numpy.random as nr

%matplotlib inline

In [2]:
Features = np.array(pd.read_csv('Credit_Features.csv'))
Labels = np.array(pd.read_csv('Credit_Labels.csv'))
Labels = Labels.reshape(Labels.shape[0],)
print(Features.shape)
print(Labels.shape)

(1000, 35)
(1000,)


Nested cross validation is used to estimate the optimal hyperparameters and perform model selection for the random forest model. Since random forest models are efficient to train, 10 fold cross validation is used. Execute the code in the cell below to define inside and outside fold objects.

In [3]:
nr.seed(123)
inside = ms.KFold(n_splits=10, shuffle = True)
nr.seed(321)
outside = ms.KFold(n_splits=10, shuffle = True)

The code in the cell below estimates the best hyperparameters using 10 fold cross validation. There are a few points to note here:

1. In this case, a grid of two hyperparameters is searched:
    
        * max_features determines the maximum number of features used to determine the splits. Minimizing the number of features can prevent model over-fitted by induces bias.
        
        * min_samples_leaf determines the minimum number of samples or leaves which must be on each terminal node of the tree. Maintaining the minimum number of samples per terminal node is a "regularization method". Having too few samples on terminal leaves allows the model training to memorize the data, leading to high variance. Forcing too many samples on the terminal nodes leads to biased predictions.
        
2. Since there is a class imbalance and a difference in the cost to the bank of misclassification of a bad credit risk customer, the "balanced" argument is used. The balanced argument ensures that the subsamples used to train each tree have balanced cases.
3. The model is fit on each set of hyperparameters from the grid.
4. The best estimated hyperparameters are printed.

Notice that the model uses regularization rather than feature selection. The hyperparameter search is intended to optimize the level of regularization.

In [4]:
## Define the dictionary for the grid search and the model object to search on
param_grid = {"max_features": [2, 3, 5, 10, 15], "min_samples_leaf":[3, 5, 10, 20]}
## Define the random forest model
nr.seed(3456)
rf_clf = RandomForestClassifier(class_weight = "balanced") # class_weight = {0:0.33, 1:0.67}) 

## Perform the grid search over the parameters
nr.seed(4455)
rf_clf = ms.GridSearchCV(estimator = rf_clf, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      scoring = 'roc_auc',
                      return_train_score = True)
rf_clf.fit(Features, Labels)
print(rf_clf.best_estimator_.max_features)
print(rf_clf.best_estimator_.min_samples_leaf)

3
10


Now, you will run the code in the cell below to perform the outer cross validation of the model.



In [5]:
nr.seed(498)
cv_estimate = ms.cross_val_score(rf_clf, Features, Labels, 
                                 cv = outside) # Use the outside folds

print('Mean performance metric = %4.3f' % np.mean(cv_estimate))
print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))

Mean performance metric = 0.778
SDT of the metric       = 0.045
Outcomes by cv fold
Fold  1    0.782
Fold  2    0.728
Fold  3    0.704
Fold  4    0.739
Fold  5    0.789
Fold  6    0.828
Fold  7    0.772
Fold  8    0.864
Fold  9    0.806
Fold 10    0.765


Examine these results. Notice that the standard deviation of the mean of the AUC is more than an order of magnitude smaller than the mean. This indicates that this model is likely to generalize well.

Now, you will build and test a model using the estimated optimal hyperparameters. As a first step, execute the code in the cell below to create training and testing dataset.

In [6]:
## Randomly sample cases to create independent training and test data
nr.seed(1115)
indx = range(Features.shape[0])
indx = ms.train_test_split(indx, test_size = 300)
X_train = Features[indx[0],:]
y_train = np.ravel(Labels[indx[0]])
X_test = Features[indx[1],:]
y_test = np.ravel(Labels[indx[1]])

The code in the cell below defines a random forest model object using the estimated optimal model hyperparameters and then fits the model to the training data. Execute this code.

In [8]:
nr.seed(1115)
rf_mod = RandomForestClassifier(class_weight = "balanced", 
                                max_features = rf_clf.best_estimator_.max_features, 
                                min_samples_leaf = rf_clf.best_estimator_.min_samples_leaf) 
rf_mod.fit(X_train, y_train)

RandomForestClassifier(class_weight='balanced', max_features=3,
                       min_samples_leaf=10)

As expected, the hyperparemeters of the random forest model object reflect those specified.

The code in the cell below scores and prints evaluation metrics for the model, using the test data subset.

In [10]:
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])

def print_metrics(labels, probs, threshold):
    scores = score_model(probs, threshold)
    metrics = sklm.precision_recall_fscore_support(labels, scores)
    conf = sklm.confusion_matrix(labels, scores)
    print('                 Confusion matrix')
    print('                 Score positive    Score negative')
    print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
    print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
    print('')
    print('Accuracy        %0.2f' % sklm.accuracy_score(labels, scores))
    print('AUC             %0.2f' % sklm.roc_auc_score(labels, probs[:,1]))
    print('Macro precision %0.2f' % float((float(metrics[0][0]) + float(metrics[0][1]))/2.0))
    print('Macro recall    %0.2f' % float((float(metrics[1][0]) + float(metrics[1][1]))/2.0))
    print(' ')
    print('           Positive      Negative')
    print('Num case   %6d' % metrics[3][0] + '        %6d' % metrics[3][1])
    print('Precision  %6.2f' % metrics[0][0] + '        %6.2f' % metrics[0][1])
    print('Recall     %6.2f' % metrics[1][0] + '        %6.2f' % metrics[1][1])
    print('F1         %6.2f' % metrics[2][0] + '        %6.2f' % metrics[2][1])
    
probabilities = rf_mod.predict_proba(X_test)
print_metrics(y_test, probabilities, 0.5)     

                 Confusion matrix
                 Score positive    Score negative
Actual positive       152                60
Actual negative        21                67

Accuracy        0.73
AUC             0.80
Macro precision 0.70
Macro recall    0.74
 
           Positive      Negative
Num case      212            88
Precision    0.88          0.53
Recall       0.72          0.76
F1           0.79          0.62


Overall, these performance metrics look quite good. A large majority of negative (bad credit) cases are identified at the expense of significant false positives. The reported AUC is within a standard deviation of the figure obtained with cross validation indicating that the model is generalizing well.