## Ensemble learning
### Random Forest
---

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Fast algorithms such as decision trees are commonly used in ensemble methods (for example Random Forest), although slower algorithms can benefit from ensemble techniques as well.

Ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.

An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions.

Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

https://en.wikipedia.org/wiki/Ensemble_learning



---
**Out of Bag Error**

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating to sub-sample data sampled used for training. OOB is the mean prediction error on each training sample $xᵢ$, using only the trees that did not have $xᵢ$ in their bootstrap sample

Every data point can be tested with about 1/3 of the trees. We calculate the percent of these that we get correct, and this is the out-of-bag error.

Subsampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations



---
**Boostrap aggregating | Bagging**

Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.

Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy.

Sampling with replacement means that we can repeat data points. In the basic random forest, we will always use a sample size that is the same as the size of the original data set. Many data points will not be included in each sample and many will be repeated.


Given a standard training set $D$ of size $n$, bagging generates m new training sets $D_{i}$, each of size $n′$, by sampling from $D$ uniformly and with replacement. By sampling with replacement, some observations may be repeated in each $D_{i}$. If $n′=n$, then for large n the set $D_{i}$ is expected to have the fraction $(1 - 1/e)$ (≈63.2%) of the unique examples of D, the rest being duplicates. This kind of sample is known as a bootstrap sample. The $m$ models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).


Bagging leads to "improvements for unstable procedures", which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression. An interesting application of bagging showing improvement in preimage learning is provided here.On the other hand, it can mildly degrade the performance of stable methods such as K-nearest neighbors.

The goal each sample is different from the original data set, yet resembles it in distribution and variability.

for example:

By taking the average of the sample, each fitted to a subset of the original data set, we arrive at one bagged predictor. Where the mean is more stable and there is less overfit.

This bootstrapping procedure leads to better model performance because it decreases the variance of the model, without increasing the bias. This means that while the predictions of a single tree are highly sensitive to noise in its training set, the average of many trees is not, as long as the trees are not correlated. Simply training many trees on a single training set would give strongly correlated trees (or even the same tree many times, if the training algorithm is deterministic); bootstrap sampling is a way of de-correlating the trees by showing them different training sets.

Typically, for a classification problem with p features, √p (rounded down) features are used in each split. For regression problems the inventors recommend p/3 (rounded down) with a minimum node size of 5 as the default.

https://en.wikipedia.org/wiki/Bootstrap_aggregating
https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

---
**Random Forest**


Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

A method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging.

* Using out-of-bag error as an estimate of the generalization error.
* Measuring variable importance through permutation.

In particular, trees that are grown very deep tend to learn highly irregular patterns: they overfit their training sets, because they have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.


https://en.wikipedia.org/wiki/Random_forest

---
**Random feature selection | Variable importance**

At each node, they need to randomly choose a subset of the features to use.

Random forests can be used to rank the importance of variables in a regression or classification problem in a natural way. The following technique is implemented in the R package randomForest.

The first step in measuring the variable importance in a data set ${\mathcal {D}}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ is to fit a random forest to the data. During the fitting process the out-of-bag error for each data point is recorded and averaged over the forest (errors on an independent test set can be substituted if bagging is not used during training).

To measure the importance of the $j-th$ feature after training, the values of the $j-th$ feature are permuted among the training data and the out-of-bag error is again computed on this perturbed data set. The importance score for the $j-th$ feature is computed by averaging the difference in out-of-bag error before and after the permutation over all trees. The score is normalized by the standard deviation of these differences.

Features which produce large values for this score are ranked as more important than features which produce small values.

This method of determining variable importance has some drawbacks. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Methods such as partial permutations and growing unbiased trees can be used to solve the problem. If the data contain groups of correlated features of similar relevance for the output, then smaller groups are favored over larger groups.

***Unsupervised learning with random forests***

As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution. An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, and is robust to outlying observations. The RF dissimilarity easily deals with a large number of semi-continuous variables due to its intrinsic variable selection; for example, the "Addcl 1" RF dissimilarity weighs the contribution of each variable according to how dependent it is on other variables. The RF dissimilarity has been used in a variety of applications, e.g. to find clusters of patients based on tissue marker data.

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm


---
**Variants**

Instead of decision trees, linear models have been proposed and evaluated as base estimators in random forests, in particular multinomial logistic regression and naive Bayes classifiers

In [16]:
from DecisionTree_rf import DecisionTree
from RandomForest_elo import RandomForest
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

import numpy as np
import pandas as pd
from collections import Counter

In [47]:
df = pd.read_csv('data/playgolf.csv')
df[:2]

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Result
0,sunny,85,85,False,Don't Play
1,sunny,80,90,True,Don't Play


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
Outlook        14 non-null object
Temperature    14 non-null int64
Humidity       14 non-null int64
Windy          14 non-null bool
Result         14 non-null object
dtypes: bool(1), int64(2), object(2)
memory usage: 534.0+ bytes


In [49]:
df.Outlook = df.Outlook.astype('category').cat.codes

In [50]:
df.Result.unique()

array(["Don't Play", 'Play'], dtype=object)

In [51]:
df.Result = df.Result.map({"Don't Play": False, 'Play':False})

In [52]:
df[:2]

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Result
0,2,85,85,False,False
1,2,80,90,True,False


In [53]:
y = df.pop('Result').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)


---
#### Sklearn RandomForest

In [54]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [55]:
print 'Accuracy:', rf.score(X_test, y_test)

Accuracy: 1.0


---
#### RandomForest class

---
1st Dataset

In [65]:
df__ = pd.read_csv('data/playgolf.csv')
y__ = df__.pop('Result').as_matrix()
X__ = df__.as_matrix()
X_train__, X_test__, y_train__, y_test__ = train_test_split(X__, y__)

In [66]:
dtree = DecisionTree(num_features=2)
dtree.fit(X_train__, y_train__)

In [67]:
predicted_ydt = dtree.predict(X_test__)
predicted_ydt

array(['Play', 'Play', 'Play', 'Play'], 
      dtype='|S4')

In [68]:
randomf = RandomForest(num_trees=12, num_features=2)
randomf.fit(X_train__, y_train__)

In [69]:
predicted_yrf = randomf.predict(X_test__)
predicted_yrf

array(['Play', 'Play', 'Play', 'Play'], 
      dtype='|S4')

In [70]:
randomf.score(X_test__, y_test__)

0.75

In [71]:
sum(predicted_yrf == y_test__) / float(len(y_test__))

0.75

---
2nd Dataset

In [78]:
df_ = pd.read_csv('data/congressional_voting.csv', names=['Party']+range(1, 17))
df_[:2]

Unnamed: 0,Party,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?


In [79]:
y_ = df_.pop('Party').as_matrix()
X_ = df_.as_matrix()
X_train_, X_test_, y_train_, y_test_ = train_test_split(X_, y_)

In [95]:
randomf2 = RandomForest(num_trees=20, num_features=10)
randomf2.fit(X_train_, y_train_)

In [96]:
predicted_yrf2 = randomf2.predict(X_test_)
predicted_yrf2[:2]

array(['democrat', 'democrat'], 
      dtype='|S8')

In [97]:
randomf2.score(X_test_, y_test_)

0.59633027522935778