# Ensembles Method

In this topics, we will learn how to combine (or simply ensemble) the models we have tried in a way that makes combination of these models make better at predicting than the individual models.

Commonly the "weak" learners we use are decision trees. In fact the default for most ensemble methods is a decision tree in sklearn. However, we can change this value to any of the models we have seen so far.

## Why do we need to ensemble learner?

There are two competing variables in finding a well fitting machine learning model: **Bias** and **Variance**.

**Bias**: When a model has high bias, this means that means it doesn't do a good job of bending to the data. An example of an algorithm that usually has high bias is linear regression. Even with completely different datasets, we end up with the same line fit to the data. When models have high bias, this is bad.

**Variance**: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset. Linear models like the one above is low variance, but high bias. An example of an algorithm that tends to have a high variance and low bias is a decision tree (especially decision trees with no early stopping parameters). A decision tree, as a high variance algorithm, will attempt to split every point into it's own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.

## Ensembles Method in Scikit-Learn

In [None]:
from time import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.tree import DecisionTreeClassifier

In [None]:
spam = pd.read_csv("../../data/spam.tsv", sep='\t', names=["label", "message"])
spam.head()

In [None]:
print(spam.shape)
spam.info()

In [None]:
vectorizer = CountVectorizer(stop_words='english')
spam_vector = vectorizer.fit_transform(spam["message"])
spam_features = vectorizer.get_feature_names()

In [None]:
spam_vector.toarray()

In [None]:
data = pd.DataFrame(spam_vector.toarray(), columns=spam_features)

In [None]:
data.head()

In [None]:
spam['label'] = spam.label.map({'ham': 0, 'spam': 1})

In [None]:
X_train, X_test, y_train, y_test = train_test_split(spam_vector.toarray(), spam['label'], test_size=.2, random_state=111)
X_train.shape, X_test.shape

In [None]:
model = RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=1111)

In [None]:
model.fit(X_train, y_train)

In [None]:
accuracy_score(y_train, model.predict(X_train))

In [None]:
accuracy_score(y_test, model.predict(X_test))

In [None]:
print(f"training accuracy: {accuracy_score(y_train, model.predict(X_train))}")
print(f"test accuracy: {accuracy_score(y_test, model.predict(X_test))}")

print(f"training recall: {recall_score(y_train, model.predict(X_train))}")
print(f"test recall: {recall_score(y_test, model.predict(X_test))}")

print(f"training precision: {precision_score(y_train, model.predict(X_train))}")
print(f"test precision: {precision_score(y_test, model.predict(X_test))}")

print(f"f1 score: {f1_score(y_train, model.predict(X_train))}")
print(f"f1 score: {f1_score(y_test, model.predict(X_test))}")

In [None]:
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=1111)
dt_model.fit(X_train, y_train)

In [None]:
bagging_clf = BaggingClassifier(n_estimators=10, random_state=111)

In [None]:
bagging_clf.fit(X_train, y_train)

In [None]:
print(f"training accuracy: {accuracy_score(y_train, bagging_clf.predict(X_train))}")
print(f"test accuracy: {accuracy_score(y_test, bagging_clf.predict(X_test))}\n")

print(f"training recall: {recall_score(y_train, bagging_clf.predict(X_train))}")
print(f"test recall: {recall_score(y_test, bagging_clf.predict(X_test))}\n")

print(f"training precision: {precision_score(y_train, bagging_clf.predict(X_train))}")
print(f"test precision: {precision_score(y_test, bagging_clf.predict(X_test))}\n")

print(f"f1 score: {f1_score(y_train, bagging_clf.predict(X_train))}")
print(f"f1 score: {f1_score(y_test, bagging_clf.predict(X_test))}")

In [None]:
ada_clf = AdaBoostClassifier(n_estimators=50, learning_rate=.5)
t0 = time()
ada_clf.fit(X_train, y_train)
print(f"finish training in {time()-t0:.3f}s")

In [None]:
print(f"training accuracy: {accuracy_score(y_train, ada_clf.predict(X_train))}")
print(f"test accuracy: {accuracy_score(y_test, ada_clf.predict(X_test))}\n")

print(f"training recall: {recall_score(y_train, ada_clf.predict(X_train))}")
print(f"test recall: {recall_score(y_test, ada_clf.predict(X_test))}\n")

print(f"training precision: {precision_score(y_train, ada_clf.predict(X_train))}")
print(f"test precision: {precision_score(y_test, ada_clf.predict(X_test))}\n")

print(f"f1 score: {f1_score(y_train, ada_clf.predict(X_train))}")
print(f"f1 score: {f1_score(y_test, ada_clf.predict(X_test))}")