## Forests of randomized trees
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

```
class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]
```

### Basic usage

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

#### Generating some `dummy` data

In [2]:
X, y = make_classification(n_samples=100000, n_features=8, random_state=0, shuffle=False)

In [3]:
X[:2], y[:2]

(array([[ 0.68233622, -1.13561564, -0.23913112, -0.41773772, -1.05225342,
         -1.3299612 ,  1.38865101, -0.24899491],
        [-0.16601206, -2.76211096, -1.00726164, -0.00424735, -0.5529632 ,
          0.4356544 ,  0.50993161,  1.8285584 ]]),
 array([0, 0]))

> Split our data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 10, test_size=.3)

> Using a `DecisionTreeClassifier`

In [5]:
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('clf',  DecisionTreeClassifier()),
])

In [6]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('scale', StandardScaler()), ('clf', DecisionTreeClassifier())])

In [7]:
pipe.score(X_train, y_train), pipe.score(X_test, y_test)

(1.0, 0.8648333333333333)

> Our model is `100%` accurate on the train dataset and `92%` on the test.

> Using the `GridSearchCV` to impove our model.

In [8]:
param_grid={"splitter": ("best", "random"),'criterion': ["gini", "entropy"]
            
           }

clf = GridSearchCV(DecisionTreeClassifier(), param_grid, scoring="accuracy")

In [9]:
clf.fit(X_train, y_train)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'splitter': ('best', 'random')},
             scoring='accuracy')

In [10]:
clf.score(X_train, y_train)

1.0

In [11]:
model = clf.best_estimator_
model

DecisionTreeClassifier(criterion='entropy')

In [30]:
model.score(X_test, y_test)

0.92

> Using a `RandomForestClassifier`

In [12]:
clf = RandomForestClassifier(max_depth=2, random_state=0)

In [13]:
clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [14]:
clf.score(X_train, y_train)

0.8936428571428572

In [15]:
clf.score(X_test, y_test)

0.8936

> The `RandomForestClassifier` works more efficiently on large samples.

### Extremely Randomized Trees
In extremely randomized trees (see ExtraTreesClassifier and ExtraTreesRegressor classes), randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias:

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
X, y = make_blobs(n_samples=10000, n_features=10, centers=100,
    random_state=0)

lf = DecisionTreeClassifier(max_depth=None, min_samples_split=2,
random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
print("Mean..... ", scores.mean())


clf = RandomForestClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0)
scores = cross_val_score(clf, X, y, cv=5)
print("Mean..... ", scores.mean())

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
min_samples_split=2, random_state=0) 
scores = cross_val_score(clf, X, y, cv=5)
scores.mean() > 0.999


Mean.....  1.0
Mean.....  0.9997


True

> [Docs](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)