Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Improve percentile based indices' descriptions #1050

Merged
merged 43 commits into from
May 26, 2022

Conversation

bzah
Copy link
Contributor

@bzah bzah commented Apr 8, 2022

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • HISTORY.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added
  • bumpversion patch has been called on this branch
  • The relevant author information has been added to .zenodo.json

What kind of change does this PR introduce?

This PR makes csdi and wsdi indicators' use the proper parameterization used in percentile_doy to format their description.

This concerns, the base percentile value, the length of rolling window of days and the period on which percentiles are computed.

A limit of the current approach is that if percentiles are not computed using percentile_doy the description will use default, non configurable values.

Does this PR introduce a breaking change?

Yes, the signatures of all indices and indicators having a DataArray parameter holding percentiles have been modified.
Before, the parameter name was assuming a percentile value such as in t90.
Now, they are renamed into one of tas_per, tasmax_per, tasmin_per, pr_per depending on the expected variable type.
Their typing is also changed to PercentileDataArray which is now outputted by percentile_doy.

Other information:

With per = percentile_doy(da_per, window=2, per=[42, 27]) the description (reformatted for github) now looks like:

'description': 
"Annual number of days with at least 6 consecutive days where
 the daily minimum temperature is below the [42, 27]th percentile(s). 
a 2 day(s) window, centred on each calendar day in the ['2015-01-01', '2018-12-31'] period,
 is used to compute the [42, 27]th percentile(s)."

bzah added 2 commits April 8, 2022 15:29
Previously, their definitions were possibly incorrect in the case
where the user would compute percentiles with custom parameters
using `percentile_doy`.

This concerns, the base percentile value, the length of rolling
window of days and the period on which percentiles are computed.

A limit of the current approach is that the percentiles **must**
be computed using `percentile_doy` because some metadata are
expected in `DataArray.attrs`.
@bzah bzah requested a review from aulemahal April 8, 2022 14:02
Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT (I realized this review was quite cold worded) : Thank you for this work. I have done some work on the Indicator class trying to generalize it so it's easier for users to understand how to use it. But I feel it has become quite complicated to extend., and the long-term vision is still unclear... So thank you for diving in.

Instead of default_params and preformatted_attrs, I think all the parsing could be done in Indicator.format. This way, it would be easier to have more than one of those percentile thresholds. My best suggestion for now would be to return "N.A." for the cases where the correct attributes can't be parsed (replacing the default_params), and maybe issuing a warning?

xclim/core/calendar.py Outdated Show resolved Hide resolved
xclim/core/indicator.py Outdated Show resolved Hide resolved
xclim/indicators/atmos/_temperature.py Outdated Show resolved Hide resolved
xclim/indices/_multivariate.py Outdated Show resolved Hide resolved
@Zeitsperre Zeitsperre added standards / conventions Suggestions on ways forward enhancement New feature or request labels Apr 8, 2022
Copy link
Collaborator

@Zeitsperre Zeitsperre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a few things to mention. I trust @huard and @aulemahal on this. Be sure to mention these call signature changes as breaking changes in the History!

xclim/indicators/atmos/_temperature.py Outdated Show resolved Hide resolved
xclim/indices/_multivariate.py Outdated Show resolved Hide resolved
@bzah
Copy link
Contributor Author

bzah commented Apr 8, 2022

Just a note on bootstrapping (yeah I love this topic).
It's quite unnecessary to bootstrap percentiles when using non extreme percentiles such the 25th or the 75th.

This PR encourages users to use various percentile thresholds on bootstrappable indices.
Given the cost this algo, it would make sense to warn users when they try to bootstrap indices on non-extreme percentiles.

I think this should be addressed in another PR though.

For ref:

[...] This bias is relatively greater for higher percentiles. For example, there would be only one exceedance over the 99th percentile, giving an exceedance rate of 1/150 = 0.67%, which is much smaller than the nominal rate of 1%.
Zhang et al.

- Updated metadata of many indices to benefit from
the configurable per params
- Added a long_name to wsdi
- Added # noqa for bootstrap argument
@bzah
Copy link
Contributor Author

bzah commented Apr 11, 2022

I made an attempt of creating a new InputKind PERCENTILE_VARIABLE = 2 but, it's still a work in progress.

Indicators with multiple percentile DataArray as
inputs were lacking metadata update.
This also ensure the variables in metadata
have a name corresponding to their parameter.
@bzah bzah changed the title ENH: Improve csdi/wdsi descriptions ENH: Improve percentile based indices' descriptions Apr 11, 2022
@bzah
Copy link
Contributor Author

bzah commented Apr 11, 2022

The last "unit test" failure is a bit mysterious to me.
It's on test_temperature.py, in test_warm_spell_duration_index.

It seems I somehow changed how the output axes are ordered because, now
np.testing.assert_array_equal(out[0, :, 0], np.array([np.nan, 3, 0, 0, np.nan])) raises an error
but,
np.testing.assert_array_equal(out[:, 0, 0], np.array([np.nan, 3, 0, 0, np.nan])) does not...

@aulemahal
Copy link
Collaborator

aulemahal commented Apr 11, 2022

The last "unit test" failure is a bit mysterious to me.

To me too... I quickly checked and I can't see where a change could have modified the dimension order. This being said, we had similar issues before and the current opinion is that the dimension order is not guaranteed. This unit test was written before those discussions and it is kinda wrong to assume the order.

This test was relying on ordered dimensions, which
is discouraged when using xarray.
Now default values are used instead of raising an error.
This also includes the handling of non doy percentiles.
Plus, some french translations were missing for
DAYS_OVER_PRECIP_DOY_THRESH and FRACTION_OVER_PRECIP_DOY_THRESH
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

"abstract": "Nombre de jours d'une période où la précipitation est au-dessus d'un percentile quotidien et d'un seuil fixe.",
"description": "Nombre {freq:m} de jours où la précipitation au-dessus d'un percentile quotidien. Seuls les jours avec au moins {thresh} sont comptés.",
"long_name": "Nombre de jours pluvieux où la précipitation est au-dessus d'un percentile quotidien"
"title": "Nombre de jours pluvieux où la précipitation est au-dessus du {pr_per_thresh}e percentile",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vous dites la précipitation au Québec ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, c'est drôle on vient d'avoir ce débat. Je ne pense pas que le choix venait d'une différence Québec-France. On se disait que "la précipitation" était synonyme de "précipitation totale", c'est à dire en cumulant toutes les formes de précipitation. On se disait que de dire "les précipitations" impliquait une distinction potentielle entre les différentes phase.

Cela étant dit, si tu penses qu'une version est meilleure que l'autre, si tu connais des références sur lesquelles se baser, nous sommes à l'écoute!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ça m'a juste fait bizarre en lisant la, mais je ne pense pas que mon avis soit très pertient sur ce sujet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, je comprends. Mais j'insiste par contre, je pense qu'on aimerait savoir ce que d'autres groupes de recherche en pensent. Avez-vous des traductions (officielles ou non) pour les indices de icclim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Il y a un glossaire sur le portail DRIAS de Meteo France, mais il est un peu pauvre (rien sur la/les precip).

Par contre, je viens de tomber sur un glossaire de Méteo France

Sur précipitation ils disent notamment:

Une précipitation, en météorologie , est un ensemble organisé de particules d'eau liquide ou solide tombant en chute libre au sein de l' atmosphère .

Ce terme est souvent employé au pluriel, ce qui traduit la diversité des types de précipitation, dont les plus communs sont la pluie , la bruine , la neige , la grêle et le grésil ; on recense aussi dans les types de précipitation la neige en grains , la neige roulée , le poudrin de glace et les chutes de granules de glace et de prismes de glace .

[...]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pour ce qui est des traductions d'indices, @pagecp tu sais si on a quelque chose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A priori non, on n'a jamais travaillé sur la traduction des indices.

@bzah
Copy link
Contributor Author

bzah commented Apr 25, 2022

@Zeitsperre the coverage should be better now. I initially commented out test_cli because of some error I have locally on @cli.result_callback(). But, I didn't meant to commit that.

xclim/core/utils.py Outdated Show resolved Hide resolved
Using __future__ should be addressed in a specific PR,
after discussing it in an issue.
@bzah
Copy link
Contributor Author

bzah commented May 16, 2022

@aulemahal is there something else you would like to discuss on these changes ?

Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not to have followed the development of this branch!
Now that I re-read the changes, I'm not sure I understand the need for the need for the PercentileDataArray class.

The annotation is useful so that the base Indicator class knows how to format the param, how to inject the good attributes. But if percentile_doy already sets all the needed attributes, why do we need a suboject entirely? The stuff in get_metadata could be directly in Indicator.format or in a standalone function in formatting.py, no?

As I understand it, right now, the computation works if the user uses a normal DataArray, but all three metadata entries for the percentile variable are "unknown". Could we cast the input as a PercentileDataArray somewhere in the indicator's pipeline to ensure all available informations are added?

Finally, I think it would useful to add the following line to all percentile parameter's docstrings:

Xclim expects an output from :py:func:`~xclim.core.calendar.percentile_doy`.

(or something in the like).

xclim/core/formatting.py Outdated Show resolved Hide resolved
xclim/core/indicator.py Outdated Show resolved Hide resolved
Simplified format function.

Co-authored-by: Pascal Bourgault <bourgault.pascal@ouranos.ca>
@bzah
Copy link
Contributor Author

bzah commented May 16, 2022

@aulemahal

Now that I re-read the changes, I'm not sure I understand the need for the PercentileDataArray class.

I'm quoting you for why I added the PercentileDataArray class:

I am not convinced parsing the history is the safest way to check this. I think I'd rather prefer a new InputKind entry, that's what they were meant for. The problem here is that a "percentile threshold" is not differentiable from a normal input in the world of typing, and thus I think we'll need to use the same workaround as for DateStr.

I'm not sure how else it would be possible to recognize percentiles inputs.
However, PercentileDataArray could probably be hidden and not exposed in the API if that's what you meant by
"cast the input as a PercentileDataArray somewhere in the indicator's pipeline".
I'll look into that!


On the other hand, I feel like all InputKind logic is some sort of typing over the python typing system. I don't think it's really necessary and IMHO it could be replaced byclasses and TypedDict(s). But, that's my java background speaking, if I don't see classes, POJOs and Singletons every now and then it gets itchy.

bzah added 2 commits May 16, 2022 18:34
- PercentileDataArray is no longer needed to retrieve the attributs used to fill
output metadata (window, climatology_bounds...) of DataArray (such as `tasmin_per`).
- Removed the corresponding InputKind and replace valid input by attributes and
coordinate recognition.
- Moved percentile_metadata formatter into formatting module.
@bzah bzah force-pushed the fix/#1047-fix_csdi_and_wsdi_description branch from 997dff6 to 98c58d4 Compare May 17, 2022 10:26
@bzah
Copy link
Contributor Author

bzah commented May 17, 2022

Alright, I went for a in-between implementation.
I removed the PERCENTILE_VARIABLE InputKind and replaced it by a simple logic check to see if a DataArray is compatible with PercentileDataArray.

The PercentileDataArray class still exists but, is mainly used to easily create a valid percentiles input with from_da method.
I think it's useful in the case where percentiles are not computed with percentile_doy, like in precipitations example:

per = pr.quantile(0.8, "time", keep_attrs=True)
per = PercentileDataArray.from_da(per, climatology_bounds=build_climatology_bounds(pr))

In that case, we need to let the user fill climatology_bounds parameter because in Indicator pipeline we have no way to determine this period for which percentiles were computed.

@bzah bzah requested a review from aulemahal May 17, 2022 12:00
Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last version works for me!

@bzah bzah merged commit 6723a07 into master May 26, 2022
@bzah bzah deleted the fix/#1047-fix_csdi_and_wsdi_description branch May 26, 2022 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request standards / conventions Suggestions on ways forward
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In {cold, warm}_spell_duration_index indicators description, the percentile threshold is not dynamically set.
5 participants