Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New python kernels silently give different results in Cross-contour_transport.ipynb #327

Open
schmidt-christina opened this issue Apr 3, 2024 · 29 comments
Labels
🐞 bug Something isn't working 🕹️ hackathon 4.0 ❓ question Further information is requested

Comments

@schmidt-christina
Copy link
Collaborator

schmidt-christina commented Apr 3, 2024

When using a kernel newer than analyis3-23.04, the cross contour transport calculated in this notebook is wrong (see black line compared to blue line where kernel analyis3-23.04 was used).

If anyone is keen to look into what has changed in the new python versions and packages that changes the results, that would be great and please assign yourself. Otherwise I will add a sentence about which kernel to use at the top of the notebook for now and look into it during the COSIMA hackathon.

image

edit: changed kernel from analysis3-22.07 to analysis3-23.04

@adele-morrison
Copy link
Collaborator

Is it possible to tell which step of the code changed? i.e. at which point do the answers diverge?

@adele-morrison
Copy link
Collaborator

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

@anton-seaice
Copy link
Collaborator

anton-seaice commented Apr 3, 2024

Are you getting any new warnings? My guess is there is some change in masking or handing of NaNs.

@adele-morrison : This is the literal answer to your Q. But @dsroberts can advise better I think.

EDIT: Diff for 23.04 to 23.07:
conda_diff_04_to_07.txt

@schmidt-christina
Copy link
Collaborator Author

schmidt-christina commented Apr 3, 2024

And does someone have the list of which packages changed between analysis3-22.07 and analysis3-22.10?

Sorry, I just looked into it again. The change happened between analysis3-23.04 and analysis3-23.07

@anton-seaice There is no warning using the newer kernels except for one on chunking affecting performance.

@dsroberts
Copy link

Hi @schmidt-christina, @anton-seaice . The only major issue I'm aware of between those two revisions is the change in default chunking that affects the ERA5 data set. I've not seen that change any numerical output, just how long it takes to get there. Would be interesting to look into, which notebook is it?

@adele-morrison
Copy link
Collaborator

@schmidt-christina
Copy link
Collaborator Author

Probably easiest to test in the notebook Hannah created when she came across this issue (it's a condensed version of the notebook Adele pointed to with only the necessary code for the selection of T/S on the contour and binning into density classes), but it won't allow me to upload it here. Should I send it via email @dsroberts ?

@adele-morrison
Copy link
Collaborator

@schmidt-christina, nice instructions here from @aidanheerdegen on how to share code:
https://forum.access-hive.org.au/t/how-to-share-your-jupyter-notebook/692

Maybe others can help also if we can make it public.

@schmidt-christina
Copy link
Collaborator Author

I uploaded the notebook here

@navidcy navidcy added 🐞 bug Something isn't working ❓ question Further information is requested labels Apr 3, 2024
@navidcy
Copy link
Collaborator

navidcy commented Apr 3, 2024

Good catch @schmidt-christina and thanks for raising this issue! This is quite alarming tbh that something would silently start giving different results! Could you confirm that you don't have any warnings imposed to be silent (sometimes we do that in the beginning because warnings can overwhelm us -- I know I do at least!)?

@navidcy
Copy link
Collaborator

navidcy commented Apr 3, 2024

I think we should post this in the forum -- others might be getting strange results?
I'm quite alarmed actually...

I was also getting different results in #322 but I (thought I) was convinced it was because the methods for computing meridional heat transport depend on the reference temperature used. I will revisit the PR with this in mind and try using an older conda env....

@navidcy navidcy changed the title New python kernels give wrong results in Cross-contour_transport.ipynb New python kernels silently give different results in Cross-contour_transport.ipynb Apr 3, 2024
@access-hive-bot
Copy link

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/newest-conda-env-kernels-silently-change-model-output-analysis-results/2010/1

@navidcy
Copy link
Collaborator

navidcy commented Apr 3, 2024

@schmidt-christina (or @salvajoe, @willaguiar, @hrsdawson), it would be extremely useful if you could strip down the notebook to what we call Minimal Working Example (MWE). That is, the minimum code for which the problem arises using kernel A and not for kernel B. The notebook at the moment has a lot of things -- are all of them needed to showcase the problem? Perhaps they are, I don't know. If they are then leave them!

[Usually this process of stripping down to a minimal working example helps a lot pin-pointing the issue and, sometimes, even finding the solution!]

I know I'm asking you to do work but could you please? This seems an important issue. Results that change silently after upgrading packages are very very sneaky and insidious and could result for a lot of struggles by some others who could be probably thinking "oh.. this must be my fault for not getting it right...". Or this could result to publishing wrong results!

On another note. What do lines:

import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)

do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

@dsroberts
Copy link

Hi all. I've managed to reproduce the incorrect behaviour in the analysis3/23.04 kernel. The issue seems to be that the sortby() function used in line 17 of cell 11 results in different output between the two kernels:
contour_index_2304_sorted
vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.04
contour_index_2307_sorted
vol_trans_across_contour.sortby(contour_ordering)[0].plot() analysis3/23.07
contour_index_2304_unsorted
vol_trans_across_contour[0].plot() analysis3/23.04

So it seems like sortby() is doing nothing in the newer kernels. Unfortunately I don't know why that is yet, nor how to fix it. I just wanted to give you all a quick update before I head out for the rest of the day.

@schmidt-christina
Copy link
Collaborator Author

are all of them needed to showcase the problem?

All of the calculations in the notebook are needed (transport across an isobath binned into density classes is quite a long and complicated calculation). calculations are only done for one kernel and then we just load data saved from my 2023 study to compare against.

On another note. What do lines:

import logging
logging.captureWarnings(True)
logging.getLogger('py.warnings').setLevel(logging.ERROR)

do? Do they suppress warnings? If we remove them do you get any meaningful warning perhaps?

I tested that and removed the lines, but still no additional warnings.

@navidcy
Copy link
Collaborator

navidcy commented Apr 4, 2024

thanks @schmidt-christina for checking!

thanks @dsroberts for figuring out -- holding onto my chair for the what else might this investigation bring about and the end of the episode with dataarray.sortby()

@dsroberts
Copy link

OK, I think I've figured it out. The issue of why .sortby() doesn't do anything boils down to these lines here:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean
lon_along_contour = contour_ordering.x_ocean
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

tl;dr - the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

The reason for this is a bit of a journey.

The difference in behaviour between the two conda environments comes down to the construction of the DataArrays on the right hand side of the equals sign on first 2 lines above. In 23.04, the DataArray created by running contour_ordering.y_ocean contains a copy of all of the objects needed from the original contour_ordering DataArray, including the pandas.MultiIndex object responsible for linking the contour_index coordinate to the data. In 23.07, this is not the case, the pandas.MultiIndex object associated with contour_ordering.y_ocean is a reference to that object in the contour_ordering DataArray. You can see this by using the is operator:

### 23.04
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
False
### 23.07
>>> lat_along_contour.get_index('contour_index') is contour_ordering.get_index('contour_index')
True

So when the following code runs:

lat_along_contour.coords['contour_index'] = contour_index_array

The original pandas.MultiIndex is deleted, as it believed to no longer be necessary, which means that in 23.07 the contour_ordering also loses its index, and has it replaced with a default pandas.RangeIndex. This explains why .sortby(contour_index) does nothing. The .sortby() function first attempts to align the contour_index DataArray to the vol_trans_across_contour DataArray, and it aligns the coordinates successfully in both cases, however, because none of the elements in the newly-created pandas.RangeIndex appears in the in the vol_trans_across_contour DataArray's pandas.MultiIndex in 23.07, it is unable to change the data along with the contour_index coordinate. Funnily enough, the actual np.lexsort() call within .sortby() does nothing in your case, the xr.align() call does all the work. The screenshot below demonstrates this,
align

Note how the coordinates of contour_order and b have been changed after the alignment in both notebooks (analysis3-23.04 on the left, analysis3-23.07 on the right), but array on the right has not been changed.

Anyway, the solution is to add .copy() onto the end of lat_along_contour=... and lon_along_contour= i.e.

lat_along_contour = contour_ordering.y_ocean.copy()
lon_along_contour = contour_ordering.x_ocean.copy()

This a copy the objects within the contour_ordering.y_ocean DataArray, and therefore a copy of the original pandas.MultiIndex is created and assigned to lat_along_contour. This means the original index object in contour_ordering is not removed when the coords in lat_along_contour and lon_along_contour are modified, and xr.align() and therefore .sortby() works as expected. I think this behaviour is erroneous, as an object has been deleted whilst there are still active references to it, so I think its worth me submitting an issue on the xarray github, assuming it hasn't already been raised.

@adele-morrison
Copy link
Collaborator

Nice!

@navidcy
Copy link
Collaborator

navidcy commented Apr 12, 2024

Omg. Excellent detective work

@AndyHoggANU
Copy link
Contributor

Nice work @dsroberts !!

@dsroberts
Copy link

dsroberts commented Apr 12, 2024

So I've dug a little further and tried to make a reproducer, and it turns out that xarray does warn you that this might cause problems.

>>> c.coords['contour_index'] = np.arange(1,len(yc)+1)
/jobfs/113635825.gadi-pbs/ipykernel_1298317/4250819915.py:1: FutureWarning: updating coordinate 'contour_index' with a PandasMultiIndex would leave the multi-index level coordinates ['y_ocean', 'x_ocean'] in an inconsistent state. This will raise an error in the future. Use `.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])` before assigning new coordinate values.
  c.coords['contour_index'] = np.arange(1,len(yc)+1)

And it does, the PandasMultiIndex is indeed in an inconsistent state. However, when I run import cosima_cookbook as cc, that warning disappears. This line has the effect of suppressing everything coming out of warnings.warn() (not just from cosima cookbook) unless a user knows to initialise the py.warnings logger. Which, to be fair, the original author of this notebook attempted to do, but did not add a handler, so the warnings were going into the void. Placing the following

import logging
console_handler = logging.StreamHandler(sys.stderr)
logger = logging.getLogger("py.warnings")
logger.addHandler(console_handler)

In the first cell of @schmidt-christina's notebook causes that warning to appear in a few places. Given this, I think the best place to raise an issue would be over on the COSMIA cookbook repo, as xarray did try to warn us, even on analysis3-23.04, except warnings were turned off. The fix suggested in the warning (.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])) does work for the notebook i.e.:

# get lat and lon along contour, useful for plotting later:
lat_along_contour = contour_ordering.y_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
lon_along_contour = contour_ordering.x_ocean.drop_vars(['contour_index', 'y_ocean', 'x_ocean'])
contour_index_array = np.arange(1, len(contour_ordering)+1)
# don't need the multi-index anymore, replace with contour count and save
lat_along_contour.coords['contour_index'] = contour_index_array
lon_along_contour.coords['contour_index'] = contour_index_array

and is a more appropriate solution than using .copy(). Dropping vars removes any copy or reference to the pandas.MultiIndex object from contour_ordering so it can't get messed up when updating the coordinates.

@navidcy
Copy link
Collaborator

navidcy commented Apr 15, 2024

@schmidt-christina would you like to open a PR to fix this example?

@navidcy
Copy link
Collaborator

navidcy commented Apr 16, 2024

@adele-morrison I'm hoping we fix this before the hackathon; this is quite urgent. At the moment we have online an example that silently gives wrong results and we are potentially creating nightmares for people.

@dsroberts did an amazing detective job upstairs figuring the issue out

could someone open a PR and fix the example?

cc @willaguiar, @schmidt-christina, @salvajoe

@navidcy
Copy link
Collaborator

navidcy commented Apr 16, 2024

Even a PR that simply deletes the example is better than having something there which is wrong and makes people loose sleep or potentially deteriorates mental health...

If nobody has the capacity to fix it I can open a PR to delete the example and whenever we have the capacity we can put it back corrected.

@adele-morrison
Copy link
Collaborator

Even a PR that simply deletes the example is better

Hmmm, this seems a little drastic. What about a PR that says analysis3-23.04 or earlier is needed? And then we can fix properly at the hackathon?

@navidcy
Copy link
Collaborator

navidcy commented Apr 16, 2024

A caveat message is also a good solution! And if we do that, then we can also link to this issue.

(Deleting doesn't mean much... the notebooks always stay in the Github history! We just don't want this to be advertised as an example that we present to people as "working out of the box" so that they take on and modify to continue their work.)

@anton-seaice
Copy link
Collaborator

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

The method to extract the points along the contour from matplotlib is failing. There should a second 'tomato' coloured line in this plot from the Select the contour step.

Screenshot 2024-04-17 at 5 00 18 PM

@adele-morrison
Copy link
Collaborator

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

@anton-seaice
Copy link
Collaborator

I just ran this using analysis3-24.01 and there is a different error / it is failing earlier.

@anton-seaice shall we raise a new issue?

I don't mind, but it would "hopefully" get fixed in the same PR so I wouldn't ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working 🕹️ hackathon 4.0 ❓ question Further information is requested
Development

No branches or pull requests

7 participants