Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default for max_variables in fit_class_random_forest and fit_regr_random_forest #365

Open
soxofaan opened this issue Apr 21, 2022 · 7 comments
Labels
Milestone

Comments

@soxofaan
Copy link
Member

I just noticed this inconsistency:

The process specs of fit_class_random_forest and fit_regr_random_forest do no have a default value for the max_variables parameter.
e.g.

"name": "max_variables",
"description": "Specifies how many split variables will be used at a node.\n\nThe following options are available:\n\n- *integer*: The given number of variables are considered for each split.\n- `all`: All variables are considered for each split.\n- `log2`: The logarithm with base 2 of the number of variables are considered for each split.\n- `onethird`: A third of the number of variables are considered for each split.\n- `sqrt`: The square root of the number of variables are considered for each split. This is often the default for classification.",
"schema": [
{
"type": "integer",
"minimum": 1
},
{
"type": "string",
"enum": [
"all",
"log2",
"onethird",
"sqrt"
]
}
]
},

In the python client we apparently have default null with behavior defined as:

Default value is null, which corresponds to the number of predictors divided by 3.

And in the VITO back-end implementation of fit_class_random_forest we even don't use the max_variables yet, and just use the default ("auto") behavior of Spark MlLib's RandomForest.

Obviously, there are a couple of things to be resolved.
For the process spec of fit_class_random_forest and fit_regr_random_forest I would propose to allow a default value (null) for the max_variables parameter, with the behavior "back-end is free to chose the best strategy".

@soxofaan
Copy link
Member Author

soxofaan commented Apr 21, 2022

related tickets:

@m-mohr
Copy link
Member

m-mohr commented Apr 21, 2022

Just to clarify, the proposal was meant to say: Make it required (and as such have no default value, so the user needs to choose). The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

Required and without default value is how the processes are defined right now. So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

@m-mohr m-mohr added the ML label Apr 21, 2022
@soxofaan
Copy link
Member Author

The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

You have a reproducibility problem anyway because of these wildly different implementations. Requiring the user to specify a max_variable makes it even less reproducible because they have to change the value each time they change back-end.

I think a ML/AI user is aware of the multitude of degrees of freedom in ML tech and is not going to expect reproducible models or results when switching to a different back-end with different technology. But by requiring to set max_variables even their code/script is not reproducible (because whatever they chose might not work on some back-end). By allowing default null, they can at least make the running of their code or algorithm reproducible or interoperable.

Likewise, it's is or will be annoying for documentation/demo/tutorial purposes if we can't give a generic fit_class_random_forest example snippet that will work without an interoperability disclaimer about a detail like max_variables.

So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

The inconsistency I want to refer to is between the spec in openeo-processes and the actual implementations as they current are.

@soxofaan
Copy link
Member Author

it seems weird to have a default that does some kind of a black box thing.

To turn that around: each random forest implementation already has a default for their max_variables equivalent, so wouldn't it be weird that we don't?

I understand that you're worried that the behavior for max_variables=null is potentially different among back-ends, but as as user, when I "use" a default (e.g. by omitting an argument) I implicitly say "I trust the system to pick a good behavior for me and I don't care about those details".
As a user I appreciate this feature not only in an actual RF model implementation (e.g. Spark MlLib or sklearn) but also on a higher or more abstract level like openEO processes.

@soxofaan
Copy link
Member Author

soxofaan commented Apr 21, 2022

My argument in short: in this case I think user friendliness and interoperability (of code) is more important than reproducibility of results (which is practically impossible here anyway I'm afraid)

@m-mohr
Copy link
Member

m-mohr commented Apr 21, 2022

Okay, but should we add a default value or remove it completely? #358

@soxofaan
Copy link
Member Author

Well, I proposed to drop it in #358 to have a quick solution so that implementers could proceed in function of the SRR deadlines, with the idea to reintroduce it again at a later point once we have a better idea how to tackle the interoperability issues (string versus int, enum values, defaults, ...).
While the time pressure is different now, I still think we can first drop it and reintroduce it later, unless you think that is silly or overkill.

To give an implementer's data point: in the VITO backend we practically ignore the max_variables parameter at the moment and just use Spark MlLib's default

@m-mohr m-mohr added this to the 2.0.0 milestone Feb 1, 2023
@m-mohr m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants