Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make docs regarding Random Forest and Ensebles more clear #945

Closed
KronosTheLate opened this issue Jun 9, 2022 · 5 comments · Fixed by #948
Closed

Make docs regarding Random Forest and Ensebles more clear #945

KronosTheLate opened this issue Jun 9, 2022 · 5 comments · Fixed by #948

Comments

@KronosTheLate
Copy link

From https://alan-turing-institute.github.io/MLJ.jl/dev/composing_models/:
"Homogeneous Ensembles - for blending the predictions of multiple supervised models all of the same type, but which receive different views of the training data to reduce overall variance. The technique is known as observation bagging. Bagging decision trees, like a DecisionTreeClassifier, gives what is known as a random forest, although MLJ also provides several canned random forest models."

The quote suggests (to me at least) that an Ensemble of trees is equivalent to a random forest. However, from https://en.wikipedia.org/wiki/Random_forest#:~:text=Random%20forests%20or%20random%20decision,class%20selected%20by%20most%20trees.:
"An extension of the algorithm was developed by Leo Breiman[9] and Adele Cutler,[10] who registered[11] "Random Forests" as a trademark in 2006 (as of 2019, owned by Minitab, Inc.).[12] The extension combines Breiman's "bagging" idea and random selection of features, introduced first by Ho[1] and later independently by Amit and Geman[13] in order to construct a collection of decision trees with controlled variance."

Of particular interest is "and random selection of features", which is not an option for Ensembles as far as I can tell. This difference between a random forest and ensemble of trees should be made more clear in the docs, or potentially the option to only classify by a random subset of features should be added to ensembles.

@ablaom
Copy link
Member

ablaom commented Jun 10, 2022

I appreciate the comment, thanks @KronosTheLate

edited

Now you have me a little confused. 😕 What exactly do you mean by "random selection of features"? Do you mean at the level of nodes? In this case, if the individual trees provide this option (for example, you have the n_subfeatures field in DecisionTreeClassifier from DecisionTree.jl to control this). Then an EnsembleModel of these will have "random features selection", no?

Or do you mean feature bagging (one subsample of features for each tree, further subsampled at nodes as above). The latter could be an option for EnsembleModel, but isn't. On the other hand, I'm not sure common random forest implementations include feature bagging in this latter sense (although tree boosting algorithms do).

@KronosTheLate
Copy link
Author

I do mean feature bagging. I did very much like your original suggestion:

Bagging decision trees, like a DecisionTreeClassifier, gives what Breiman[9] called a Random Forest. Nowadays, random forest implementations also bag features (do doc("RandomForest", pkg="DecisionTree" for an example). MLJ's EnsembleModel wrapper does not currently allow feature bagging.

The goal is simply to avoid having people think that an ensemble of trees is equivalent to a random forest.

@ablaom
Copy link
Member

ablaom commented Jun 12, 2022

Mmm, yes, but I deleted my original suggestion because I the standard implementations (eg, ScikitLearn) only subsample features at the level of nodes, and not additionally at the level of trees. So they are not "feature baggers" in the sense of model-generic feature bagging. Or am I still missing something?

@ablaom
Copy link
Member

ablaom commented Jun 12, 2022

That is, I would say that bagging DecisionTree.DecisionTreeClassifier trees using EnsembleModel, with the atomic n_subfeatures hyper-parameter set to non-zero, is equivalent to DecisioinTree.RandomForestClassifier (there may be small differences in the nature of the bagging - eg, replacement or no replacement - I forget these details just now).

No?

@KronosTheLate
Copy link
Author

I have to admit, I am also not sure. This is my first encounter with decision trees and random forests, this 5 ECTS elective I am taking. My understanding was that each tree only gets a subset of features that it is allowed to use for classification. I am not sure if that corresponds to fiddling with n_subfeatures. But as long as the PR mentions the distinction with feature bagging, I am happy ^_^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants