Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netcdf thread safety #5061

Closed
wants to merge 1 commit into from
Closed

Netcdf thread safety #5061

wants to merge 1 commit into from

Conversation

pp-mo
Copy link
Member

@pp-mo pp-mo commented Nov 11, 2022

Ultimately, to address #5016

For now, trial code for adding thread-safety and file-locking.
Both in saver and loader.

This trial code is rather rough, and not very thought-through.

At present, all of the tests are locking up + not completing.
See notes below..

learning so far

netcdf library is not considered thread-safe

Actually, it strictly never was, but it got worse with 1.6.1 which now releases the GIL during file operations
according to at least one authorative comment elsewhere

So in addition to any per-file locking that might be considered desirable (e.g. to control access to a file when doing lazy arithmetic between different netcdf variables), we do need a global lock to control all our access to the netcdf library, if any of it can be multi-threaded.
... and with Dask, presumably, access can always be multi-threaded, since data chunks are processed by separate tasks --> different worker threads.

process-based scheduling

like this
Rather to my surprise, lots of our existing code breaks if you simply switch to process-based Dask workers (as are used by default for Bag computations).
I had rather thought that this would "just work" ™️ But it's now clear that a simple "da.store" operation is not serialisable, unless you give it a serialisable lock. Just as we already saw in the lazy saving investigation.

single-threaded

Like this
Solves the above problems, and probably all others, but it's not a practical solution.
Has already proved useful in tracking down where code uses locks wrong ,I.E. failing to release before re-acquiring --> simply hangs when single-threaded.

open/closing of netcdf Dataset objects

This is somewhat simpler than I had feared.
For the reading process, a netCDF4.Dataset is opened by creating a CFReader, from which cubes are extracted, and is held open : in the new code, that is along with the netcdf library global access lock
During the process, the load code and 'actions' refer to to the dataset and its Variables, but once all the cubes have been produced, the CFReader is discarded : Then the dataset is closed, and the only remaining references are in the NetcdfDataProxy objects used to build lazy arrays -- which contain a filepath + variable-name, and thus don't refer to the 'original' Dataset object.

The only tricky part was that the existing code is relying on CFReader.__del__ to close the dataset.
Which is in fact unreliable, and may not always happen when you expected/wanted it. Effectively it depends on how the garbage collector works, which is not a constant. I also found that this is not a technical detail, but really happens.

really simple example.. click to expand
    >>> class MyC:
    ...   def __init__(self, x):
    ...     self.x = x
    ...   def __str__(self):
    ...     return f"<MyC id={id(self)}(x={self.x})"
    ...   def __del__(self):
    ...     print('delete: ', self)
    ...
    >>> myc1 = MyC(1)
    >>> myc2 = MyC(2)
    >>> myc2 = None
    delete:  <MyC id=140005502283392(x=2)
    >>> myc1
    <__main__.MyC object at 0x7f55923aef70>
    >>> myc1 = 13
    #
    # ??? why no delete message here ???
    #
    >>> del myc1
    >>> exit()
    delete:  <MyC id=140005502283632(x=1)

So, this is actually unnecessary, and I reckon we would be better off using the CFReader as a context manager, so it can automatically (always) close after operation. So I am now doing that here.
BTW this is exactly how the netcdf.saver.Saver is already used.

nature of errors.

So far, various operations which succeed nevertheless produce a stream of HDF error messages.
Not clear how that can be (if it suceeds), and being multi-threased, naturally it's hard to debug.
Maybe a fault in my code design, or for some reason something else needs guarding

lazy streaming netcdf-to-netcdf in save operations

There were some very obvious things going wrong with save-load-save roundtrips, like this one.

This highlights that when streaming from lazy netcdf data to save results, it is necessary that the saving process releases the global netcdf-access lock, since otherwise the lazy input compute will lock up.
Which means both the lazy-write and the lazy-read operations must independently ensure that the lock is used when reading/writing.
N.B. with synthetic data, like in these tests, you only see this if you save-load-save : The problem is unique to cases where the saved source data is lazy and in a netcdf file : It's a particularly specific problem.

it is necessary that the saving process releases the global netcdf-access lock

Which makes changes in the Saver code much more tricky + extensive than those in CFReader
The load_cubes/CFReader action is quite simple, since it involves no "compute" at all, only straight-line code to snapshot the file + create data-proxies. At the end of which, the file is simply closed.

Scope

Not considering process-parallel execution yet (dask config "scheduler='processes'", or distributed), which still does not work.
However, I have provided per-file locking on read, and an eventual solution may need to adopt many of the ideas from #5130

@trexfeathers
Copy link
Contributor

Tests have been running for 30mins and counting...

@pp-mo
Copy link
Member Author

pp-mo commented Nov 11, 2022

Tests have been running for 30mins and counting...

It is still deadlocking in some operations, to be investigated + fixed.
That's known about but I removed the offending one in integration/test_netcdf.py.
Other than that, at least those tests now succeed -- but with the unexplained warning messages, as previously described.

I told you it was WIP !

@pp-mo pp-mo changed the title Mostly working load+save. Netcdf thread safety Nov 14, 2022
@pp-mo
Copy link
Member Author

pp-mo commented Nov 14, 2022

Notes on existing errors

I have been focussing on the set of tests in iris.tests.integration.test_netcdf.py as that has proved a useful testbed.
(where errors emerged in the original CI problems).

  • I've skipped out all the tests there which now "just work".

  • The remaining (unskipped) tests are producing "interesting" behaviour -- in those cases the test succeeds but a lot of HDF errors appear Like this...

HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 1:
  #000: H5A.c line 528 in H5Aopen_by_name(): can't open attribute
    major: Attribute
    minor: Can't open object
  #001: H5VLcallback.c line 1091 in H5VL_attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
  #002: H5VLcallback.c line 1058 in H5VL__attr_open(): attribute open failed
    major: Virtual Object Layer
    minor: Can't open object
 . . .

N.B. although these are described as errors, something nevertheless "appears to work"? Could be failing to read data, but not checking actual values ?

  • This test was commented out because this one (still) actually hangs -- a lot of roundtrip tests used to do that, before the locks were added, but this shows that something there has still been missed.

Update Mon 2022-11-14

Discussed with @bjlittle .
Particularly this test, which triggers a stream of HDF errors even though it is "only" loading cubes
(i.e. not fetching or streaming lazy data).
Confirmed that data is fetched from NetcdfDataProxy-s, even when "just" loading like this.
@bjlittle points out that the post-load merge operation is a likely culprit.
@pp-mo also noted that the CFVariable holds "back-door" accesses to netCDF4 Variable objects, potentially even after the file is closed. Which should be addressed (but might not be "the" problem here). Then tried constructing a list of cubes, then yielding after the CFReader context exits, but that did not give any more errors as we might have expected from this?

@trexfeathers
Copy link
Contributor

We're currently hoping this can be a bugfix for v3.3 and v3.4.

@trexfeathers trexfeathers assigned trexfeathers and unassigned pp-mo Nov 21, 2022
@trexfeathers trexfeathers requested review from schlunma and removed request for schlunma November 22, 2022 10:27
trexfeathers pushed a commit to trexfeathers/iris that referenced this pull request Nov 22, 2022
@trexfeathers trexfeathers mentioned this pull request Dec 6, 2022
3 tasks
@pp-mo
Copy link
Member Author

pp-mo commented Dec 13, 2022

Close in favour of #5095

@pp-mo pp-mo closed this Dec 13, 2022
pp-mo added a commit that referenced this pull request Feb 20, 2023
* Unpin netcdf4.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Experiment to disable wheel CI on forks.

* Disable segfaulting routines.

* More temporary changes to get CI passing.

* More temporary changes to get CI passing.

* Finessed segfault skipping.

* Bring in changed from #5061.

* Re-instate test_load_laea_grid.

* Adaptations to get the tests passing.

* Use typing.Mapping instead.

* Get doctests passing.

* CF only resolve non-url filenames.

* Confirm thread safety fixes.

* Remove dummy assignment.

* Restored plot_nemo What's New entry.

* _add_aux_factories temporarily release global lock.

* Remove per-file locking.

* Remove remaining test workarounds.

* Remove remaining comments.

* Correct use of CFReader context manager.

* Refactor for easier future maintenance.

* Rename netcdf _thread_safe, add header.

* Full use of ThreadSafeAggregators.

* Full use of ThreadSafeAggregators.

* Remove remaining imports of NetCDF4.

* Test to ensure netCDF4 is via _thread_safe module.

* More refined netcdf._thread_safe classes.

* _thread_safe docstrings.

* Restore original NetCDF code where possible.

* Revert changes to 2.3.rst.

* Update lockfiles.

* Additions to _thread_safe.py

* Remove temporary CI shims.

* New locking stategy for NetCDFDataProxy.

* NetCDFDataProxy simpler use of netCDF4 lock.

* Update lock files.

* Go back to using a Threading Lock.

* Remove superfluous pass commands in test_cf.py.

* Rename _thread_safe to _thread_safe_nc.

* Rename thread safe classes to be 'Wrappers'.

* Better contained getattr and setattr pattern.

* Explicitly name netCDF4 module in _thread_safe_nc docstring.

* Better docstring for _ThreadSafeWrapper.

* Better comment about THREAD_SAFE_FLAG.

* list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators.

* More accurate thread_safe docstrings in netcdf.saver.

* Split netcdf integration tests into multiple modules.

* Tests for non-thread-safe NetCDF behaviour.

* Docstring accuracy.

* Correct use of dask config set (context manager).

* Update dependencies.

* Review - don't need the first-class import of iris.tests.

* Better name for the loading test.

* Better selection of data to load.

* What's New entry.

* Improve tests.

* Update lock files.

* Increase chunking on test_save.

---------

Co-authored-by: Patrick Peglar <patrick.peglar@metoffice.gov.uk>
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Feb 21, 2023
* Unpin netcdf4.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Temporarily enable GHA on this branch.

* Experiment to disable wheel CI on forks.

* Disable segfaulting routines.

* More temporary changes to get CI passing.

* More temporary changes to get CI passing.

* Finessed segfault skipping.

* Bring in changed from SciTools#5061.

* Re-instate test_load_laea_grid.

* Adaptations to get the tests passing.

* Use typing.Mapping instead.

* Get doctests passing.

* CF only resolve non-url filenames.

* Confirm thread safety fixes.

* Remove dummy assignment.

* Restored plot_nemo What's New entry.

* _add_aux_factories temporarily release global lock.

* Remove per-file locking.

* Remove remaining test workarounds.

* Remove remaining comments.

* Correct use of CFReader context manager.

* Refactor for easier future maintenance.

* Rename netcdf _thread_safe, add header.

* Full use of ThreadSafeAggregators.

* Full use of ThreadSafeAggregators.

* Remove remaining imports of NetCDF4.

* Test to ensure netCDF4 is via _thread_safe module.

* More refined netcdf._thread_safe classes.

* _thread_safe docstrings.

* Restore original NetCDF code where possible.

* Revert changes to 2.3.rst.

* Update lockfiles.

* Additions to _thread_safe.py

* Remove temporary CI shims.

* New locking stategy for NetCDFDataProxy.

* NetCDFDataProxy simpler use of netCDF4 lock.

* Update lock files.

* Go back to using a Threading Lock.

* Remove superfluous pass commands in test_cf.py.

* Rename _thread_safe to _thread_safe_nc.

* Rename thread safe classes to be 'Wrappers'.

* Better contained getattr and setattr pattern.

* Explicitly name netCDF4 module in _thread_safe_nc docstring.

* Better docstring for _ThreadSafeWrapper.

* Better comment about THREAD_SAFE_FLAG.

* list() wrapping within _GLOBAL_NETCDF4_LOCK, to account for generators.

* More accurate thread_safe docstrings in netcdf.saver.

* Split netcdf integration tests into multiple modules.

* Tests for non-thread-safe NetCDF behaviour.

* Docstring accuracy.

* Correct use of dask config set (context manager).

* Update dependencies.

* Review - don't need the first-class import of iris.tests.

* Better name for the loading test.

* Better selection of data to load.

* What's New entry.

* Improve tests.

* Update lock files.

* Increase chunking on test_save.

---------

Co-authored-by: Patrick Peglar <patrick.peglar@metoffice.gov.uk>
@pp-mo pp-mo deleted the nc161_fix branch April 28, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants