<a href="https://colab.research.google.com/github/IndraniMandal/CSC310-S20/blob/master/23a_ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/IndraniMandal/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)      # add home folder to module search path

Cloning into 'ds-assets'...
remote: Enumerating objects: 205, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 205 (delta 54), reused 50 (delta 50), pack-reused 147 (from 1)[K
Receiving objects: 100% (205/205), 12.58 MiB | 6.35 MiB/s, done.
Resolving deltas: 100% (80/80), done.


# Ensemble Techniques

We start by asking the following questions:

> What is better than one classifier? Answer: Two classifiers.

> What is better than two classifiers? Answer: Three classifiers.

> What is better than three classifiers? Answer: As many classifiers as you can computationally afford!

This gives rise to the notion of **ensemble techniques** combining multiple classifiers to form a single meta-classifier.  The idea is that each individual classifier will work on a different part of the domain and then contribute to the overall solution.

Even though we started the discussion with classifiers this also extends to regressors. To show that the techniques discussed below extend to both classes of models we refer to the machine learning models as "learners".  In particular, we call them "weak learners" in the sense that we don't require them to be particularly complex.  It has been demonstrated that ensemble techniques work with extremely weak learners such as decision trees limited to a depth of one.

Two of the more popular approaches to ensemble techniques are **bagging** and **boosting**.

# Bagging

  In bagging the learners act in **parallel** as is demonstrated by the following figure.

  <center>

  <!-- ![figure](https://miro.medium.com/v2/resize:fit:1050/1*a6hnuJ8WM37mLimHfMORmQ.png) -->

<img src="https://miro.medium.com/v2/resize:fit:1050/1*a6hnuJ8WM37mLimHfMORmQ.png"  height="300" width="625">

  </center>

  [source](https://medium.com/@brijesh_soni/boost-your-machine-learning-models-with-bagging-a-powerful-ensemble-learning-technique-692bfc4d1a51)

Notice that in steps 1-3 each of the learners is trained on a different **bootstrap sample** of the original dataset.  Given our previous discussion we know that bootstrap samples have the ability to expose variability of a given training set.  With this in mind we see that each learner is trained on a slightly different dataset exposing different aspects of the original data.  Once each learner has been trained they make a prediction (step 4).  These predictions are then aggregated in step 5.  

For classification the aggregation is usually some form of **voting** and for regression problems it is common to take the **mean** of the various predictions.



## Random Forests

Perhaps the most well known machine learning model based on bagging is the Random Forest where each learner is either a classification or regression tree depending on whether we are looking at a classification or regression problem.

The interesting thing about random forests is that they not only bootstrap sample the data rows but they also perform something called **feature bagging** which means each tree is trained on a **random sample of features** instead of the entire feature set. This helps to improve the diversity of the trees in a similar way as creating the bootstrap samples of the data rows.

Because of feature bagging random forests tend to work really well with high-dimensional problems.


## Text Classification with Random Forests

From our work before we know that text classification tends to be very high-dimensional due to the vector model.

In [2]:
# data handling
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from confint import classification_confint

In [3]:
# get the newsgroup database
#newsgroups = pd.read_csv(home+"newsgroups.csv")
newsgroups = pd.read_csv(home+"newsgroups-noheaders.csv")
newsgroups.head(n=10)

Unnamed: 0,text,label
0,\nIn billions of dollars (%GNP):\nyear GNP ...,space
1,ajteel@dendrite.cs.Colorado.EDU (A.J. Teel) w...,space
2,\nMy opinion is this: In a society whose econ...,space
3,"Ahhh, remember the days of Yesterday? When we...",space
4,"\n""...a la Chrysler""?? Okay kids, to the near...",space
5,"\n As for advertising -- sure, why not? A N...",politics
6,"\n What, pray tell, does this mean? Just who ...",space
7,\nWhere does the shadow come from? There's no...,politics
8,^^^^^^^^^...,politics
9,"#Yet, when a law was proposed for Virginia tha...",space


In [4]:
# construct the docterm matrix

# build the stemmer object
stemmer = PorterStemmer()

# build a new default analyzer using CountVectorizer that only
# uses words, [a-zA-Z]+, and also eliminates stop words
analyzer= CountVectorizer(analyzer = "word",
                          stop_words = 'english',
                          token_pattern = "[a-zA-Z]+").build_analyzer()

# build a new analyzer that stems using the default analyzer to
# create the words to be stemmed
def stemmed_words(doc):
    return [stemmer.stem(w) for w in analyzer(doc)]

# build docarray
vectorizer = CountVectorizer(analyzer=stemmed_words,
                             binary=True,
                             min_df=2) # each word has to appear at least twice
docarray = vectorizer.fit_transform(newsgroups['text']).toarray()
docterm = pd.DataFrame(docarray, columns=list(vectorizer.get_feature_names_out()))
print("We have {} articles with {} features".format(docterm.shape[0],docterm.shape[1]))
docterm.head()

We have 1038 articles with 6045 features


Unnamed: 0,aa,abandon,abbey,abc,abil,abl,aboard,abolish,abort,abroad,...,yugoslavia,yup,z,zealand,zenit,zero,zeta,zip,zone,zoo
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [5]:
# set up train and test sets
X_train, X_test, y_train, y_test = \
  train_test_split(docterm,   # as X
                   newsgroups['label'],  # as y
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

The tricky part with random forest is to figure out how many learners/estimators to incorporate into the meta-model.  Here is a simple rule of thumb:

$$
\mbox{n_estimators} = \frac{2*\mbox{n_features}}{\sqrt{\mbox{n_features}}}
$$

In our text classification case we have $\mbox{n_features}\approx 6000$ therefore,

$$
\mbox{n_estimators} = \frac{2*6000}{\sqrt{\mbox{6000}}} = 155 \approx 200
$$

Given that this is just a rule-of-thumb it is a good idea to round up to the nearest round integer.

In [6]:
# train model

# model object
model = RandomForestClassifier(
    max_features = 'sqrt', # sqrt(6000) ~ 80
    bootstrap = True,      # use bootstrap samples
    n_estimators = 200,    # 80*200 ~ 16000 -- good feature coverage
    criterion = 'gini',    # assume this for now cutting down on search
    random_state = 0
)

# grid search
param_grid = {
    'max_depth':list(range(20,41))  # NLP program, deep trees!
}
grid = GridSearchCV(model, param_grid, cv=3, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)
print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Fitting 3 folds for each of 21 candidates, totalling 63 fits
Grid Search: best parameters: {'max_depth': 35}


In [7]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.90 (0.86,0.94)


In [8]:
# build the confusion matrix
labels = ['politics','space']
cm = confusion_matrix(y_test, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
cm_df

Unnamed: 0,politics,space
politics,107,2
space,18,81


**Observation**: The random forest performs significantly better than the decision tree on this data set.  Consider the following results,

**decision tree**: Accuracy: 0.74 (0.68,0.80)

**random forest**: Accuracy: 0.90 (0.86,0.94)

**naive bayes**: 0.96 (0.93,0.98)

Given these results we can see that the performance difference between decision trees and random forests is statistically significant.  We also see that the performance difference between random forests and naive bayes is **not** statistically significant.


## Handwritten Digit Classification

Another high-dimensional problem encountered was the handwritten digit classification problem with a 64-dimensional space.

In [9]:
# we need UCI repo access
!pip install ucimlrepo

import numpy as np # we need numpy arrays
from ucimlrepo import fetch_ucirepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [10]:
# fetch dataset
digits = fetch_ucirepo(id=80)

# data (as pandas dataframes)
X = digits.data.features
X.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute55,Attribute56,Attribute57,Attribute58,Attribute59,Attribute60,Attribute61,Attribute62,Attribute63,Attribute64
0,0,1,6,15,12,1,0,0,0,7,...,0,0,0,0,6,14,7,1,0,0
1,0,0,10,16,6,0,0,0,0,7,...,3,0,0,0,10,16,15,3,0,0
2,0,0,8,15,16,13,0,0,0,1,...,0,0,0,0,9,14,0,0,0,0
3,0,0,0,3,11,16,0,0,0,0,...,0,0,0,0,0,1,15,2,0,0
4,0,0,5,14,4,0,0,0,0,0,...,12,0,0,0,4,12,14,7,0,0


In [11]:
y = digits.data.targets
y.head()

Unnamed: 0,class
0,0
1,0
2,7
3,4
4,6


In [12]:
# setting up training/testing data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y['class'], # we want a series as target
    train_size=0.8,
    test_size=0.2,
    random_state=1
)

Let's do the rule-of-thumb calculation again to give us the number of estimators,

$$
\mbox{n_estimators} = \frac{2*64}{\sqrt{\mbox{64}}} = 16 \approx 20
$$


In [None]:
# train model

# model object
model = RandomForestClassifier(
    max_features = 'sqrt', # sqrt(64) = 8
    bootstrap = True,      # use bootstrap samples
    n_estimators = 20,     # 8*20 ~ 160 -- good feature coverage
    criterion = 'entropy', # assume this for now cutting down on search
    random_state = 0
)

# grid search
param_grid = {
    'max_depth':list(range(10,21))
}
grid = GridSearchCV(model, param_grid, cv=3, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)
print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Fitting 3 folds for each of 11 candidates, totalling 33 fits


In [None]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

In [None]:
# build the confusion matrix
cm = confusion_matrix(y_test, predict_y)
cm_df = pd.DataFrame(cm)
cm_df

**Observation**: Very few mistakes and no digit stands out in terms of mistakes.  Furthermore, the performance increase over decision trees is statistically significant!

**decision tree**: 0.92 (0.90, 0.93)

**random forest**: 0.98 (0.97,0.99)

The confidence intervals do not overlap!

## Takeaway

Random forests demonstrates that **bagging significantly improves the performance** of learners such as decision trees via an ensemble techniques based on bootstrap samples and feature bagging.

# Boosting

The kind of boosting we are talking about here is **gradient boosting** where weak learners act in serial trying to "rectify" the mistakes that the previous learner made.  This gives rise to the following figure.


<center>

<!-- ![figure](https://miro.medium.com/v2/resize:fit:1358/1*4XuD6oRrgVqtaSwH-cu6SA.png) -->

<img src="https://miro.medium.com/v2/resize:fit:1358/1*4XuD6oRrgVqtaSwH-cu6SA.png"  height="300" width="625">


</center>

[source](https://medium.com/@brijeshsoni121272/understanding-boosting-in-machine-learning-a-comprehensive-guide-bdeaa1167a6)

Here we gradient boosted model with m stages.  After training and testing the initial stage (steps 1-3) the mistakes of the first stage are incorporated into the training of the second stage (steps 4,5).  This pattern is repeated until the last stage can provide the overall prediction.  It is considered an ensembled technique because the boosted model consists of many learning models.

The name "gradient boosting" comes from the fact that step 4 can be interpreted as a **gradient descent optimization** of the loss function (a function that describes the errors a learner makes).

Here we look at the implementation of gradient boosting as implemented in sklearn.  We use the same two examples we used for decision trees in order to study gradient boosting.

## Text Classification with Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
# set up train and test sets -- use the docterm matrix from above
X_train, X_test, y_train, y_test = \
  train_test_split(docterm,   # as X
                   newsgroups['label'],  # as y
                   train_size=0.8,
                   test_size=0.2,
                   random_state=2)

In [None]:
# train model

# model object
model = GradientBoostingClassifier(
    max_depth = 3,       # tree complexity has almost no impact
    n_estimators = 400,  # the more stages the better the performance
    random_state = 0
)

model.fit(X_train, y_train)

In [None]:
# Evaluate the best model
predict_y = model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

In [None]:
# build the confusion matrix
labels = ['politics','space']
cm = confusion_matrix(y_test, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
cm_df

**Observation**: The random forest performs significantly better than the decision tree on this data set.  Consider the following results,

**decision tree**: Accuracy: 0.74 (0.68,0.80)

**random forest**: Accuracy: 0.90 (0.86,0.94)

**gradient boosting**: Accuracy: 0.91 (0.87,0.95)

**naive bayes**: 0.96 (0.93,0.98)

Given these results we can see that the performance difference between decision trees and random forests is statistically significant.  We also see that the performance difference between random forests and naive bayes is **not** statistically significant.


## Handwritten Digit Classification

In [None]:
# fetch dataset
digits = fetch_ucirepo(id=80)

# data (as pandas dataframes)
X = digits.data.features
X.head()

In [None]:
y = digits.data.targets
y.head()

In [None]:
# setting up training/testing data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y['class'], # we want a series as target
    train_size=0.8,
    test_size=0.2,
    random_state=1
)

In [None]:
# model object
model = GradientBoostingClassifier(
    max_depth = 3,       # tree complexity has almost no impact
    n_estimators = 400,  # the more stages the better the performance
    random_state = 0
)

model.fit(X_train, y_train)

In [None]:
# Evaluate the best model
predict_y = model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

In [None]:
# build the confusion matrix
cm = confusion_matrix(y_test, predict_y)
cm_df = pd.DataFrame(cm)
cm_df

**Observation**: Very few mistakes. The digit seven is being misclassified as three in 2 instances.  

**decision tree**: 0.92 (0.90, 0.93)

**random forest**: 0.98 (0.98,0.99)

**gradient boosting**: 0.99 (0.98,0.99)



## Takeaway

The performances of random forests and gradient boosting are comparable.  Both boosting strategies significantly improve performance in both of our high-dimensional domains.

# Ensemble Techniques and Regression

All the ensemble models discussed here are also available as regression models in sklearn: **RandomForestRegressor** and **GradientBoostingRegressor**.