Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

soxofaan · 2022-04-21T09:16:06Z

I just noticed this inconsistency:

The process specs of fit_class_random_forest and fit_regr_random_forest do no have a default value for the max_variables parameter.
e.g.

openeo-processes/proposals/fit_class_random_forest.json

Lines 27 to 44 in 589bfd2

    
               "name": "max_variables", 
        
               "description": "Specifies how many split variables will be used at a node.\n\nThe following options are available:\n\n- *integer*: The given number of variables are considered for each split.\n- `all`: All variables are considered for each split.\n- `log2`: The logarithm with base 2 of the number of variables are considered for each split.\n- `onethird`: A third of the number of variables are considered for each split.\n- `sqrt`: The square root of the number of variables are considered for each split. This is often the default for classification.", 
        
               "schema": [ 
        
                   { 
        
                       "type": "integer", 
        
                       "minimum": 1 
        
                   }, 
        
                   { 
        
                       "type": "string", 
        
                       "enum": [ 
        
                           "all", 
        
                           "log2", 
        
                           "onethird", 
        
                           "sqrt" 
        
                       ] 
        
                   } 
        
               ] 
        
           },

In the python client we apparently have default null with behavior defined as:

Default value is null, which corresponds to the number of predictors divided by 3.

And in the VITO back-end implementation of fit_class_random_forest we even don't use the max_variables yet, and just use the default ("auto") behavior of Spark MlLib's RandomForest.

Obviously, there are a couple of things to be resolved.
For the process spec of fit_class_random_forest and fit_regr_random_forest I would propose to allow a default value (null) for the max_variables parameter, with the behavior "back-end is free to chose the best strategy".

The text was updated successfully, but these errors were encountered:

soxofaan · 2022-04-21T09:17:15Z

related tickets:

fit_*_random_forest: Drop parameter max_variables? #358 : drop max_variables parameter for now (until we find a good common ground for all implementations we want to support)
fit_class_random_forest: max_variables parameter could additionally be of type string #355 where @m-mohr explicitly proposes to have no default value:

I'd propose adding ..., but to NOT provide a default value.

m-mohr · 2022-04-21T09:43:46Z

Just to clarify, the proposal was meant to say: Make it required (and as such have no default value, so the user needs to choose). The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

Required and without default value is how the processes are defined right now. So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

soxofaan · 2022-04-21T11:25:59Z

The implementations are wildly different and as such it seems weird to have a default that does some kind of a black box thing. That is not really reproducible then.

You have a reproducibility problem anyway because of these wildly different implementations. Requiring the user to specify a max_variable makes it even less reproducible because they have to change the value each time they change back-end.

I think a ML/AI user is aware of the multitude of degrees of freedom in ML tech and is not going to expect reproducible models or results when switching to a different back-end with different technology. But by requiring to set max_variables even their code/script is not reproducible (because whatever they chose might not work on some back-end). By allowing default null, they can at least make the running of their code or algorithm reproducible or interoperable.

Likewise, it's is or will be annoying for documentation/demo/tutorial purposes if we can't give a generic fit_class_random_forest example snippet that will work without an interoperability disclaimer about a detail like max_variables.

So I don't see an inconsistency here? Or was this meant to go into openeo-python-client and the back-end to align?

The inconsistency I want to refer to is between the spec in openeo-processes and the actual implementations as they current are.

soxofaan · 2022-04-21T11:40:21Z

it seems weird to have a default that does some kind of a black box thing.

To turn that around: each random forest implementation already has a default for their max_variables equivalent, so wouldn't it be weird that we don't?

I understand that you're worried that the behavior for max_variables=null is potentially different among back-ends, but as as user, when I "use" a default (e.g. by omitting an argument) I implicitly say "I trust the system to pick a good behavior for me and I don't care about those details".
As a user I appreciate this feature not only in an actual RF model implementation (e.g. Spark MlLib or sklearn) but also on a higher or more abstract level like openEO processes.

soxofaan · 2022-04-21T11:43:42Z

My argument in short: in this case I think user friendliness and interoperability (of code) is more important than reproducibility of results (which is practically impossible here anyway I'm afraid)

m-mohr · 2022-04-21T12:54:03Z

Okay, but should we add a default value or remove it completely? #358

soxofaan · 2022-04-21T16:16:47Z

Well, I proposed to drop it in #358 to have a quick solution so that implementers could proceed in function of the SRR deadlines, with the idea to reintroduce it again at a later point once we have a better idea how to tackle the interoperability issues (string versus int, enum values, defaults, ...).
While the time pressure is different now, I still think we can first drop it and reintroduce it later, unless you think that is silly or overkill.

To give an implementer's data point: in the VITO backend we practically ignore the max_variables parameter at the moment and just use Spark MlLib's default

soxofaan mentioned this issue Apr 21, 2022

Improve fit_class_random_forest and fit_regr_random_forest support Open-EO/openeo-python-client#293

Open

3 tasks

m-mohr added the ML label Apr 21, 2022

m-mohr added this to the 2.0.0 milestone Feb 1, 2023

m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022 •

edited

Loading

m-mohr commented Apr 21, 2022 •

edited

Loading

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022 •

edited

Loading

m-mohr commented Apr 21, 2022

soxofaan commented Apr 21, 2022

Default for max_variables in fit_class_random_forest and fit_regr_random_forest #365

Default for max_variables in fit_class_random_forest and fit_regr_random_forest #365

Comments

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022 • edited Loading

m-mohr commented Apr 21, 2022 • edited Loading

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022

soxofaan commented Apr 21, 2022 • edited Loading

m-mohr commented Apr 21, 2022

soxofaan commented Apr 21, 2022

Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

Default for `max_variables` in `fit_class_random_forest` and `fit_regr_random_forest` #365

soxofaan commented Apr 21, 2022 •

edited

Loading

m-mohr commented Apr 21, 2022 •

edited

Loading

soxofaan commented Apr 21, 2022 •

edited

Loading