# Ensemble

Combines predictions of several estimators

## Methods

### Averaging Method
1. Several estimators are built independently and then their predictions are averaged
1. Better because variance is reduced
1. Works best with strong & complex models
1. e.g. [Bagging Methods](http://scikit-learn.org/stable/modules/ensemble.html#bagging-meta-estimator), Random Forest

### Boosting Method
1. Base estimators are built sequentially and one tries to reduce the bias of the combined estimator
1. Works best with weak models
1. e.g. [AdaBoost](http://scikit-learn.org/stable/modules/ensemble.html#adaboost), Gradient Tree Boosting

> http://scikit-learn.org/stable/modules/ensemble.html#ensemble-methods

# Random Forest
1. Widely Used
1. A random forest is essentially a collection of decision trees, where each tree is slightly different from the others
1. Unlikely to overfit
1. For regression output is based on average of all random trees
1. For classification output is based on soft voting
1. Gives out feature importance
1. Can be parallelized

**Disadvantage**
1. Random forests don’t tend to perform well on very high dimensional, sparse data, such as text
1. Interpreting randomly generated trees is difficult, hence in that case Decision Tree can be used
1. random forests require more memory and are slower to train and to predict than linear models

> http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees

In [1]:
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

import pandas as pd

In [2]:
iris_df = pd.read_csv('../data/iris.csv', dtype = {'species': 'category'})
iris_df.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [3]:
X = iris_df.iloc[:, :-1]
y = iris_df.species

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)

In [4]:
clf = DecisionTreeClassifier(max_depth = 2)
cross_val_score(clf, X, y).mean()

0.94730392156862742

In [5]:
clf = RandomForestClassifier(n_estimators = 10, max_depth = None, min_samples_split = 2, max_features = 4)
cross_val_score(clf, X, y).mean()

0.9673202614379085

# Gradient Tree Boosting
1. Used in Web Search Ranking
1. Widely Used
1. Unline Random Forest Gradient boosting works by building trees in a serial manner, where each tree tries to correct the mistakes of the previous one
1. There's no randomization by default
1. Strong pre-pruning is used
1. Use very shallow trees, of depth one to five, which makes the model smaller in terms of memory and makes predictions faster
1. Param (learning_rate): Higher implies complex models
1. Outputs feature importance as well

**Advantages**
1. Can handle heterogeneous features
1. Predictive power
1. Robustness to outliers in output space
1. Winning entries in competetions

**Disadvantages**
1. Due to the sequential nature of boosting it can hardly be parallelized
1. It can't be scaled
1. Sensitive to parameters
1. Requires careful tuning of the parameters and may take a long time to train
1. As with other tree-based models, it also often does not work well on high-dimensional sparse data.

> http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting

In [6]:
clf = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 2)
clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=1.0, loss='deviance', max_depth=2,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [7]:
clf.score(X_test, y_test)

1.0

# Voting Classifier
1. Combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities
1. Used with equally well performing models to balance out their weaknesses

> http://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

In [8]:
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = GaussianNB()

vclf = VotingClassifier(estimators = [('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft')

for clf, label in zip([clf1, clf2, clf3, vclf], ['Logistic Regression', 'Random Forest', 'Naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv = 5, scoring = 'accuracy')
    print("Accuracy: {:.2f} (+/- {:.2f}) [{}]".format(scores.mean(), scores.std(), label))

Accuracy: 0.96 (+/- 0.04) [Logistic Regression]
Accuracy: 0.95 (+/- 0.03) [Random Forest]
Accuracy: 0.95 (+/- 0.03) [Naive Bayes]
Accuracy: 0.96 (+/- 0.02) [Ensemble]
