Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper use of loc parameter in gamma,fisk dists (SPI/SPEI) #1720

Merged
merged 38 commits into from
May 2, 2024
Merged

Conversation

coxipi
Copy link
Contributor

@coxipi coxipi commented Apr 18, 2024

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
    • This PR fixes #xyz
  • Tests for the changes have been added (for bug fixes / features)
    • (If applicable) Documentation has been added / updated (for bug fixes / features)
  • CHANGES.rst has been updated (with summary of main changes)
    • Link to issue (:issue:number) and pull request (:pull:number) has been added

What kind of change does this PR introduce?

  • Spei used to rely on an offset to ensure that distribution of values are positive when using distributions such as gamma, fisk. Now, instead, we properly use the loc parameter in those distributions that generalize the distributions and allow negative values. For instance, the gamma disitribution is a 2-parameter function, but in scipy it's generalized to 3 parameters, with a loc parameter. This extends the support of gamma from $x \in [0,\infty]$ to $x \in [\text{loc}, \infty]$.

The code was already compatible with this idea, just had to change _fit_start so it can compute estimates of the parameters in the cases with negatives values.

Does this PR introduce a breaking change?

  • Removing the offset from SPEI. Also, the _fit_start can now compute an estimate of loc which can affect what value we obtain when optimizing 3-parameters distributions. Even when the loc is fixed to 0 with floc=0 in fitkwargs, the estimate of fisk was modified. The previous formulation was obtained by applying more or less the idea in the method of moments, but now I obtained an approximation that's more rigorous.

  • The case of zero inflated distribution has been changed: Now, it is the calibration data that determines the probability of zero, as it should.

  • Method APP can now only be used if fitkwargs['floc'] is given as input by the user.

Other information:

@github-actions github-actions bot added the indicators Climate indices and indicators label Apr 18, 2024
@coxipi
Copy link
Contributor Author

coxipi commented Apr 18, 2024

I started a new branch spi_loc2 because there were many errors hard to track in spi_loc. Opening a new PR, this is the old one: #1653

xclim/indices/_agro.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the docs Improvements to documenation label Apr 18, 2024
@coxipi
Copy link
Contributor Author

coxipi commented Apr 22, 2024

I was just trying to tie ends together, and was exploring the difference between fitting with zero inflated distributions vs. normal distributions.

I'm considering the case where the dataset is inflated in some small value that I inject, but I don't do any special treatment for this value. I realize that things work well unless the inflated value is 0:

pr0 = open_dataset("sdba/CanESM2_1950-2100.nc")
# Take the first 30 july 1rst 
pr0  = pr0.isel(location=1).isel(time=slice(179,365*30,365)).pr.values
pr0 = np.where(pr0>0, pr0, 1e-6)
min_prs = [-1e-10,-1e-20, 0, 1e-20, 1e-18, 1e-14, 1e-10]
print(["min_pr", "Pr(X>=min_pr) [fraction]"])
for min_pr in min_prs:
    pr = pr0.copy()
    pr[0:11] = min_pr 
    a, loc,scale = sc.stats.gamma.fit(pr)
    print([min_pr, sc.stats.gamma.cdf(min_pr,a,loc=loc,scale=scale)])

yields:

['min_pr', 'Pr(X>=min_pr) [fraction]']
[-1e-10, 0.07865019433968551]
[-1e-20, 0.07258213369500696]
[0, 5.552055328806407e-20]
[1e-20, 0.07787936999249587]
[1e-18, 0.07488806373878135]
[1e-14, 0.08371030972954428]
[1e-10, 0.08298754087251842]

I inflated the datasets, so we could expect 'Pr(X>=min_pr)' to be about 33%. We have approximate fits, so it's ok to have 7-8%, If I increase the bound in sc.stats.gamma.cdf(min_pr,a,loc=loc,scale=scale) say by adding 1e-15 which is still well below other values that were not injected, then I get indeed ~30%, except for the case of 0 precipitations.

and we can clearly see something is off with the 0-inflated case if we plot the pdfs:
image

Using a form of jitter_under_thresh, say that I inflate with precipitation values of 0, but then jitter between 10^(-21) and 10^(-20), I get something more in line with other results.

If I inflate less values, 10 or less, it seems the problem goes away (I also added a small value in the cdf computation, but that doesn't explain the change in behaviour as far I could tell):

import numpy as np
import scipy as sc
import matplotlib.pyplot as plt
pr0 = open_dataset("sdba/CanESM2_1950-2100.nc")
# Take the first 30 july 1rst 
pr0  = pr0.isel(location=1).isel(time=slice(179,365*30,365)).pr.values
pr0 = np.where(pr0>0, pr0, 1e-6)
min_prs = [
    0,
    1e-20,
    1e-17,
    1e-16,
    ]
print(["min_pr", "Pr(X>=min_pr)"])
for min_pr in min_prs:
    pr = pr0.copy()
    pr[0:10] = min_pr
    a, loc,scale = sc.stats.gamma.fit(pr)
    print([min_pr, sc.stats.gamma.cdf(min_pr+1e-21,a,loc=loc,scale=scale)])
    x = np.linspace(0, 1e-15, 1000)
    plt.plot(x,sc.stats.gamma.pdf(x, a, loc=loc, scale=scale),label=min_pr)
    plt.yscale("log")
    plt.xlabel("x")
    plt.ylabel("pdf")
plt.legend(title="min_pr")

image

Going about 11/30 values, then more cases look like the problematic pr==0.

Anyways, it's just worth keeping this in mind. I think the lesson is to be careful when using numbers where the precision is too low, and 0 values are afllicted by this. If we work with the appropriate zero_inflated options or perform some jitter_under_thresh with a very small value, in both cases, this should not be an issue.

CHANGES.rst Outdated Show resolved Hide resolved
@coxipi coxipi marked this pull request as draft April 23, 2024 13:23
@coxipi
Copy link
Contributor Author

coxipi commented Apr 23, 2024

Apparently the PR was never in draft mode? Just wanted to say it's ready for your eyes. ⚠️

@Zeitsperre Zeitsperre marked this pull request as ready for review April 23, 2024 13:32
coxipi and others added 2 commits April 29, 2024 12:21
Co-authored-by: Pascal Bourgault <bourgault.pascal@ouranos.ca>
xclim/indices/stats.py Outdated Show resolved Hide resolved
@Zeitsperre Zeitsperre mentioned this pull request May 2, 2024
5 tasks
xclim/indices/_agro.py Outdated Show resolved Hide resolved
@Zeitsperre Zeitsperre added the priority Immediate priority label May 2, 2024
xclim/indices/_agro.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@aulemahal aulemahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! thanks a lot !

if method == "APP":
fitkwargs["floc"] = 0
# same offset as in climate indices
offset = convert_units_to("1 mm/d", wb, context="hydro")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
offset = convert_units_to("1 mm/d", wb, context="hydro")
offset = convert_units_to("1000 mm/month", wb, context="hydro")

Isn't this the hardcoded offset of climate_indices ?
https://github.com/monocongo/climate_indices/blob/d108eee982abae06a415e888319b8078af868558/src/climate_indices/indices.py#L280C62-L280C69

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, climate_indices and xclim do match, but with the current datasets used in our test, we stil have negative values with the climate_indices offset. I wll wait for a next PR where we show many comparisons to many other libraries to sort this out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I agree, I didn't have the correct offset here

xclim/indices/stats.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the approved Approved for additional tests label May 2, 2024
@coxipi coxipi merged commit 4f2e633 into main May 2, 2024
19 checks passed
@coxipi coxipi deleted the spi_loc2 branch May 2, 2024 20:42
@RondeauG RondeauG mentioned this pull request May 14, 2024
5 tasks
RondeauG added a commit to hydrologie/xhydro that referenced this pull request May 14, 2024
<!-- Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
  - This PR fixes #xyz
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features).
- [x] (If applicable) Tests have been added.
- [x] CHANGES.rst has been updated (with summary of main changes).
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added.

### What kind of change does this PR introduce?

* Tests for the GIS module were too strict.
* Changes were made in `xclim v0.49.0` with how the `gamma` and `fisk`
distributions were initiated in `xclim`, which changed the fitting
results and parametric quantiles. It just so happened that we
arbitrarily used the `gamma` for our tests. To be backwards-compatible,
those tests were changed to the `gumbel_r`.

### Does this PR introduce a breaking change?

- No.

### Other information:
Changes to the `gamma` distribution:
Ouranosinc/xclim#1477
Ouranosinc/xclim#1720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved for additional tests docs Improvements to documenation indicators Climate indices and indicators priority Immediate priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants