Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize predict processes for ML models #396

Closed
wants to merge 1 commit into from
Closed

Conversation

m-mohr
Copy link
Member

@m-mohr m-mohr commented Nov 28, 2022

As discussed in #368, a first draft to generalize the former random forest specific ML processes for predictions into one process for simple class predictions and another process for class probabilities. Please check carefully, I don't have an ML background ;-)

@m-mohr m-mohr added this to the 1.3.0 milestone Nov 28, 2022
@m-mohr m-mohr linked an issue Nov 28, 2022 that may be closed by this pull request
Copy link
Member

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes and suggestions.
However I'm not a big ML expert either, so it would be good to collect some more input from other reviewers that do ML on a daily basis

"id": "predict_random_forest",
"summary": "Predict values based on a Random Forest model",
"description": "Applies a Random Forest machine learning model to an array and predict a value for it.",
"id": "predict_ml_model",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_ml_model looks a bit weird to me, it reads like you will be predicting the model (which is a confusing statement), instead of letting the model predict classes.

Something like predict_class, ml_model_predict or even ml_predict would feel better.

Note that we for the array, text, date related processes, we also use a prefix based naming (array_append, array_apply, date_shift) instead of a postfix based naming.

Copy link
Member Author

@m-mohr m-mohr Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My aim was to align with predict_curve so that in docs they would be listed side by side. Unfortunately, we are not very consistent with prefixes/suffixes (array_apply / date_shift / load_collection / save_result / reduce_dimension)...

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with the name predict_class is that that process would not only work for ML classification, but also ML regression, so ideally, the term class should be avoided. Unless there is a good reason to have separate inference/predict processes for ML classification and ML regression, but as far as I know that's not the case.

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

To me, "predict" implies "machine learning", so having both "ml" and "predict" in the name is slightly redundant.
However, it might be better for discoverability and self-documenting reasons to keep that bit of redundancy.

proposals/predict_ml_model.json Show resolved Hide resolved
proposals/predict_ml_model.json Show resolved Hide resolved
proposals/predict_ml_model.json Show resolved Hide resolved
@@ -0,0 +1,45 @@
{
"id": "predict_ml_model_probabilities",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, this process name looks a bit weird to me.
I'd prefer something like predict_probabilities, ml_model_predict_probabilities or ml_predict_probabilities

Copy link
Member Author

@m-mohr m-mohr Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the recent discussion (prefix is based on the primary input), it should probably be ml_predict_probabilities (or ml_model_predict_probabilities which is rather long)

proposals/predict_ml_model_probabilities.json Show resolved Hide resolved
proposals/predict_ml_model_probabilities.json Show resolved Hide resolved
proposals/predict_ml_model_probabilities.json Show resolved Hide resolved
}
],
"returns": {
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",
"description": "The predicted class probabilities. Returns `null` if any of the given values in the array is a no-data value.",

I'm not sure about the Returns null if any ...: that depends on the capability of the ML model to handle null/nodata I think

Copy link
Contributor

@LukeWeidenwalker LukeWeidenwalker Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually tricky - some thoughts:

  • E.g. in deep learning if I've anticipated missing data, I could have used torch.nan_to_num to replace nan with a carefully chosen value in both training and inference. If I then export that model with e.g. ONNX, that transformation will be baked in and my model will know how to deal with missing data. I'm not sure how every other frameworks handles this (e.g. sklearn or xgboost) - but it's super important that this value is the same during inference as it was during training!
  • If the model handles nans, and the predict_ml_model process just ignores nan-values and lets it through, then I get the correct behaviour.
  • However, if it fails on nan-values, then the only way the user can fix this is to replace nan-values with the correct value in an extra openeo process before. To do that I need to know what that value was during training - this might not be easily available!
  • If the model doesn't handle nans (either they didn't add nan-handling to inference code, or just didn't train on samples with nans at all, etc.), then what would happen really depends on the framework. It might just crash, or it might run through, with the nan values subtly impacting the predictions.
  • some options I can think of:
    • Make this a parameter of the process (fail_on_nan or similar), fail by default and allow overriding if the user knows that their model can handle nans out of the box or is willing to throw the dice.
    • Add information about what to replace nan values with to the ml-model STAC extension and use that if available

I think as a user my preference would be for this process to assume that the model has already been constructed to handle nans correctly in both training and inference and therefore doesn't try to interfere.

Hope this is useful!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

Yeah, exactly!

@soxofaan
Copy link
Member

Another side note (not necessarily to tackle in this PR):
Ultimately, I think, predict_curve becomes obsolete in favor of what is currently named predict_ml_model (considering that fit_curve basically returns just another ML model)

@m-mohr m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023
@m-mohr m-mohr modified the milestones: 2.0.0, 2.1.0 Mar 10, 2023
@m-mohr
Copy link
Member Author

m-mohr commented Mar 16, 2023

Potentially interesting for "bring your own model": https://onnx.ai/

@m-mohr
Copy link
Member Author

m-mohr commented May 15, 2023

We continue in #441 - We still need to bring some of the comments over.

@m-mohr m-mohr closed this May 15, 2023
m-mohr added a commit that referenced this pull request May 16, 2023
m-mohr added a commit that referenced this pull request May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

predict_class and predict_probabilities
3 participants