# Ensemble extra trees

* Random Forests are prone to overfitting
* Extra Trees are an alternative
* The 'extra' refers to extra randomness
  * Build many decision trees
  * Sampling of each tree is random without replacement
  * Therefore each tree has a unique set of samples
  * Randomly select a subset of features for each tree
  * Gini or entropy is NOT used to split data
  * Algorithm randomly selects a split value

In [1]:
from sklearn.datasets import make_regression
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

### Create dataset

* A regression dataset of 10k samples and 20 features

In [2]:
X, y = make_regression(n_samples=10000, n_features=20)

### Decision Tree

* Mean accuracy score is 0.62

In [10]:
clf = DecisionTreeRegressor(
    max_depth=None,
    min_samples_split=2,
    random_state=0
)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation mean accuracy score = {scores.mean():.2f}")

Cross-validation mean accuracy score = 0.62


### Random Forest

* Mean accuracy score is 0.84

In [11]:
clf = RandomForestRegressor(
    n_estimators=10,
    max_depth=None,
    min_samples_split=2,
    random_state=0
)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation mean accuracy score = {scores.mean():.2f}")

Cross-validation mean accuracy score = 0.84


### Extra Trees

* Mean accuracy score is 0.87

In [12]:
clf = ExtraTreesRegressor(
    n_estimators=10,
    max_depth=None,
    min_samples_split=2,
    random_state=0
)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Cross-validation mean accuracy score = {scores.mean():.2f}")

Cross-validation mean accuracy score = 0.87
