## **Random Forests**

#### **Random Forest**

- Random Forest is an ensemble learning technique combining multiple decision trees.
- It uses the bagging method to train trees on different subsets of the data, typically with replacement.

#### **Random Forest Classifier**

- A pre-optimized implementation of Random Forest designed for classification tasks.
- More convenient than manually configuring `BaggingClassifier` with `DecisionTreeClassifier`.

#### **Code**

The following code trains a Random Forest classifier with 500 trees (each limited to maximum 16 nodes), using all available CPU cores:

In [None]:
from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [3]:
X, y=make_moons(n_samples=1000, noise=0.2, random_state=42)

In [4]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=0)

In [5]:
rnd_clf=RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

In [6]:
y_pred_rf=rnd_clf.predict(X_test)

With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model. 

The following BaggingClassifier is roughly equivalent to the previous RandomForestClassifier:

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

In [8]:
bag_clf=BaggingClassifier(DecisionTreeClassifier(splitter='random', max_leaf_nodes=16), n_estimators=500, max_samples=1.0, bootstrap=True, n_jobs=-1)

#### **Extra-Trees**

Extra-Trees (Extremely Randomized Trees) offer an alternative to Random Forests by introducing even more randomness during tree construction. 

- **Characteristics of Extra-Trees**
    - **Random Feature Selection:** Like Random Forests, at each node, Extra-Trees use a random subset of features to decide the split.
    - **Random Thresholds:** Unlike Random Forests, where the best possible threshold for splitting is determined by evaluating all possible splits, Extra-Trees choose thresholds randomly for each feature.

- **Bias-Variance Tradeoff:**
    - **Higher Bias:** By using random thresholds, Extra-Trees increase bias compared to Random Forests.
    - **Lower Variance:** The added randomness helps reduce overfitting, resulting in lower variance.
    - **Efficiency:** Since the algorithm does not need to search for the best thresholds, training Extra-Trees is significantly faster than training Random Forests.

- **Advantages of Extra-Trees**
    - **Faster Training:** No exhaustive search for optimal thresholds makes the training process much quicker.
    - **Reduced Overfitting:** The added randomness can make the model more robust to noise in the data.

- **Drawbacks**
    - **Potentially Lower Accuracy:** The random thresholds might lead to suboptimal splits, increasing the error (bias) in some cases.

In [9]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# Load dataset
X, y = load_iris(return_X_y=True)

# Initialize the Extra-Trees Classifier
extra_trees_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)

# Perform cross-validation
scores = cross_val_score(extra_trees_clf, X, y, cv=5)

# Output the average accuracy
print(f"Cross-validated accuracy: {scores.mean():.4f}")


Cross-validated accuracy: 0.9533


#### **Feature Importance**

Feature importance is a highly useful property of Random Forests, which provides insights into the contributions of individual features to the model's predictions.

**About Feature Importance in Random Forests**
1) **Impurity Reduction:**
    - Feature importance is determined by the total reduction in impurity (e.g., Gini impurity or entropy for classification tasks) brought about by splits that involve the feature across all trees in the forest.
    - Each feature's contribution is computed as a weighted average, where the weight of a node is the number of training samples it contains.

2) **Normalized Scores:**
    - The computed importances are normalized so that the sum of all feature importances equals 1.

3) **Automatic Computation:**
    - Scikit-Learn calculates feature importances automatically during training. They can be accessed using the feature_importances_ attribute of the trained model.

In [11]:
from sklearn.datasets import load_iris
iris=load_iris()
rnd_clf=RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris['data'], iris['target'])
for name, score in zip(iris['feature_names'], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.09719323076272049
sepal width (cm) 0.025230255489317472
petal length (cm) 0.4374410262104303
petal width (cm) 0.4401354875375316


The feature **Petal width** has more importance.