Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bnb/dh refactor #220

Merged
merged 433 commits into from
Nov 5, 2024
Merged

Bnb/dh refactor #220

merged 433 commits into from
Nov 5, 2024

Conversation

bnb32
Copy link
Collaborator

@bnb32 bnb32 commented Jun 23, 2024

Ok, here we go...

sup3r/preprocessing was previously just data handlers and batch handlers, essentially. Now we have Loaders, Extracters, Derivers, Cachers which are composed in sup3r.preprocessing.data_handlers.factory to build objects similar to the old DataHandlers. These do basically everything the old handlers used to do, except for training / batching related routines like sampling, normalization, etc. Loaders just load netcdf / h5 data into a xr.Dataset - like container. Extracters extract spatiotemporal regions of data. Derivers derive new features from raw feature data. Cachers, well, they cache data to either h5 or netcdf depending on the extension of the output file provided.

In sup3r/preprocessing we additionally have Samplers and BatchQueues. These are composed in sup3r.preprocessing.batch_handlers.factory to build objects similar to the old BatchHandlers. These do basically everything that the old batch handlers used to do, with some exceptions. The most notable exception is probably that they don't split data into training and validation sets. BatchHandler objects will take "collections" of data handler like objects (these can be DataHandlers, Extracters, Derivers, etc) for both training and validation and separate batch queues will be used for each. Samplers simply contain a xr.Dataset - like object and sample that data as an iterator. BatchQueue objects interface with samplers to keep a queue full of batches / samples while models are training.

All these smaller objects like loaders, extracters, derivers, samplers are built on top of xr.Dataset - like objects (sup3r.preprocessing.accessor.Sup3rX and sup3r.preprocessing.base.Sup3rDataset) which serve as the familiar .data attribute for data and batch handlers. Sup3rDataset is wrapped around Sup3rX to provide an interface for "dual" dataset objects contained by dual handlers and acts exactly like Sup3rX when datasets are not dual. Sup3rX is an xr.Dataset "accessor" class, which is the recommended way to extend xr.Datasets (as opposed to subclassing). These Sup3rX / Sup3rDataset objects act similar to xr.Datasets but with extended functionality. The tests in tests/data_wrappers/ show how to interact with these objects.

Since the fundamental dataset objects are now xr.Dataset - like, they can use dask arrays to store data. This means we don't need to load data into memory until we need the result of a computation. ForwardPassStrategy and ForwardPass have been updated accordingly, since we can lazy load the full input dataset and then index the data handler .data attribute to select generator input chunks, all before loading into memory. BatchHandler objects have a mode argument which can be set to either lazy (load batches into memory only when they are sent out for training) or eager (load .data into memory upon handler initialization).

Tests have been added for all new preprocessing modules and lots of documentation / notes have been added throughout. Tests should hopefully provide good examples of use patterns for these new objects.

@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 10 times, most recently from ebb154c to bfe2f9f Compare June 27, 2024 17:34
@bnb32 bnb32 marked this pull request as ready for review June 27, 2024 18:17
@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 53d1c66 to bbc4af1 Compare July 1, 2024 15:58
@bnb32 bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 59b9817 to a546b27 Compare July 19, 2024 20:07
grantbuster and others added 27 commits October 24, 2024 16:37
added simple test on cc batching for daily boundaries
…ing, etc would not be applied to data loaded from cache.
…. added tests for chunks=None with height interp derivation
…el_check keys. the latter is default False, as this takes a long time since it has to load arrays into memory to compute min / max levels. ) Modified the linear interpolation method to use the 2 closest levels rather than the two closest levels which also happen to be above and below the requested level. This speeds up the interpolation by orders of magnitude.
…nks = auto and then load only the rasterized data into memory.
@bnb32 bnb32 merged commit 6ea8113 into main Nov 5, 2024
12 checks passed
@bnb32 bnb32 deleted the bnb/dh_refactor branch November 5, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants