Make docs regarding Random Forest and Ensebles more clear #945

KronosTheLate · 2022-06-09T12:27:18Z

From https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/:
"Homogeneous Ensembles - for blending the predictions of multiple supervised models all of the same type, but which receive different views of the training data to reduce overall variance. The technique is known as observation bagging. Bagging decision trees, like a DecisionTreeClassifier, gives what is known as a random forest, although MLJ also provides several canned random forest models."

The quote suggests (to me at least) that an Ensemble of trees is equivalent to a random forest. However, from https://en.wikipedia.org/wiki/Random_forest#:~:text=Random%20forests%20or%20random%20decision,class%20selected%20by%20most%20trees.:
"An extension of the algorithm was developed by Leo Breiman [9] and Adele Cutler,[10] who registered[11] "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.).[12] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman [13] in order to construct a collection of decision trees with controlled variance."

Of particular interest is "and random selection of features", which is not an option for Ensembles as far as I can tell. This difference between a random forest and ensemble of trees should be made more clear in the docs, or potentially the option to only classify by a random subset of features should be added to ensembles.

ablaom · 2022-06-10T01:13:02Z

I appreciate the comment, thanks @KronosTheLate

edited

Now you have me a little confused. 😕 What exactly do you mean by "random selection of features"? Do you mean at the level of nodes? In this case, if the individual trees provide this option (for example, you have the n_subfeatures field in DecisionTreeClassifier from DecisionTree.jl to control this). Then an EnsembleModel of these will have "random features selection", no?

Or do you mean feature bagging (one subsample of features for each tree, further subsampled at nodes as above). The latter could be an option for EnsembleModel, but isn't. On the other hand, I'm not sure common random forest implementations include feature bagging in this latter sense (although tree boosting algorithms do).

KronosTheLate · 2022-06-11T09:45:39Z

I do mean feature bagging. I did very much like your original suggestion:

Bagging decision trees, like a DecisionTreeClassifier, gives what Breiman[9] called a Random Forest. Nowadays, random forest implementations also bag features (do doc("RandomForest", pkg="DecisionTree" for an example). MLJ's EnsembleModel wrapper does not currently allow feature bagging.

The goal is simply to avoid having people think that an ensemble of trees is equivalent to a random forest.

ablaom · 2022-06-12T21:19:36Z

Mmm, yes, but I deleted my original suggestion because I the standard implementations (eg, ScikitLearn) only subsample features at the level of nodes, and not additionally at the level of trees. So they are not "feature baggers" in the sense of model-generic feature bagging. Or am I still missing something?

ablaom · 2022-06-12T21:50:36Z

That is, I would say that bagging DecisionTree.DecisionTreeClassifier trees using EnsembleModel, with the atomic n_subfeatures hyper-parameter set to non-zero, is equivalent to DecisioinTree.RandomForestClassifier (there may be small differences in the nature of the bagging - eg, replacement or no replacement - I forget these details just now).

No?

KronosTheLate · 2022-06-13T15:41:48Z

I have to admit, I am also not sure. This is my first encounter with decision trees and random forests, this 5 ECTS elective I am taking. My understanding was that each tree only gets a subset of features that it is allowed to use for classification. I am not sure if that corresponds to fiddling with n_subfeatures. But as long as the PR mentions the distinction with feature bagging, I am happy ^_^

ablaom mentioned this issue Jun 12, 2022

Clarify homogeneous ensembles comments #948

Merged

ablaom closed this as completed in #948 Jun 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make docs regarding Random Forest and Ensebles more clear #945

Make docs regarding Random Forest and Ensebles more clear #945

KronosTheLate commented Jun 9, 2022

ablaom commented Jun 10, 2022 •

edited

KronosTheLate commented Jun 11, 2022

ablaom commented Jun 12, 2022

ablaom commented Jun 12, 2022 •

edited

KronosTheLate commented Jun 13, 2022

Make docs regarding Random Forest and Ensebles more clear #945

Make docs regarding Random Forest and Ensebles more clear #945

Comments

KronosTheLate commented Jun 9, 2022

ablaom commented Jun 10, 2022 • edited

KronosTheLate commented Jun 11, 2022

ablaom commented Jun 12, 2022

ablaom commented Jun 12, 2022 • edited

KronosTheLate commented Jun 13, 2022

ablaom commented Jun 10, 2022 •

edited

ablaom commented Jun 12, 2022 •

edited