Bnb/dh refactor #220

bnb32 · 2024-06-23T02:25:49Z

Ok, here we go...

sup3r/preprocessing was previously just data handlers and batch handlers, essentially. Now we have Loaders, Extracters, Derivers, Cachers which are composed in sup3r.preprocessing.data_handlers.factory to build objects similar to the old DataHandlers. These do basically everything the old handlers used to do, except for training / batching related routines like sampling, normalization, etc. Loaders just load netcdf / h5 data into a xr.Dataset - like container. Extracters extract spatiotemporal regions of data. Derivers derive new features from raw feature data. Cachers, well, they cache data to either h5 or netcdf depending on the extension of the output file provided.

In sup3r/preprocessing we additionally have Samplers and BatchQueues. These are composed in sup3r.preprocessing.batch_handlers.factory to build objects similar to the old BatchHandlers. These do basically everything that the old batch handlers used to do, with some exceptions. The most notable exception is probably that they don't split data into training and validation sets. BatchHandler objects will take "collections" of data handler like objects (these can be DataHandlers, Extracters, Derivers, etc) for both training and validation and separate batch queues will be used for each. Samplers simply contain a xr.Dataset - like object and sample that data as an iterator. BatchQueue objects interface with samplers to keep a queue full of batches / samples while models are training.

All these smaller objects like loaders, extracters, derivers, samplers are built on top of xr.Dataset - like objects (sup3r.preprocessing.accessor.Sup3rX and sup3r.preprocessing.base.Sup3rDataset) which serve as the familiar .data attribute for data and batch handlers. Sup3rDataset is wrapped around Sup3rX to provide an interface for "dual" dataset objects contained by dual handlers and acts exactly like Sup3rX when datasets are not dual. Sup3rX is an xr.Dataset "accessor" class, which is the recommended way to extend xr.Datasets (as opposed to subclassing). These Sup3rX / Sup3rDataset objects act similar to xr.Datasets but with extended functionality. The tests in tests/data_wrappers/ show how to interact with these objects.

Since the fundamental dataset objects are now xr.Dataset - like, they can use dask arrays to store data. This means we don't need to load data into memory until we need the result of a computation. ForwardPassStrategy and ForwardPass have been updated accordingly, since we can lazy load the full input dataset and then index the data handler .data attribute to select generator input chunks, all before loading into memory. BatchHandler objects have a mode argument which can be set to either lazy (load batches into memory only when they are sent out for training) or eager (load .data into memory upon handler initialization).

Tests have been added for all new preprocessing modules and lots of documentation / notes have been added throughout. Tests should hopefully provide good examples of use patterns for these new objects.

…added pytest.warns() catches for some intentional checks.

added simple test on cc batching for daily boundaries

…ing, etc would not be applied to data loaded from cache.

Bnb/caching fixes

… time step

…. added tests for chunks=None with height interp derivation

…el_check keys. the latter is default False, as this takes a long time since it has to load arrays into memory to compute min / max levels. ) Modified the linear interpolation method to use the 2 closest levels rather than the two closest levels which also happen to be above and below the requested level. This speeds up the interpolation by orders of magnitude.

…ape in some cases.

…nks = auto and then load only the rasterized data into memory.

…ttle more speed.

…ter when loaded

Gb/bc kwargs

bnb32 force-pushed the bnb/dh_refactor branch 10 times, most recently from ebb154c to bfe2f9f Compare June 27, 2024 17:34

bnb32 marked this pull request as ready for review June 27, 2024 18:17

bnb32 requested review from castelao and grantbuster June 27, 2024 18:17

bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 53d1c66 to bbc4af1 Compare July 1, 2024 15:58

bnb32 force-pushed the bnb/dh_refactor branch 4 times, most recently from 59b9817 to a546b27 Compare July 19, 2024 20:07

bnb32 added 9 commits July 23, 2024 15:36

some upper -> lower case changes to just reduce some local warnings. …

bc6dc6b

…added pytest.warns() catches for some intentional checks.

fix: deriver caching test

d8b8f9f

removing some np.array_equal vs dask warnings

32ef4de

tf run eagerly to get accurate pytest-cov report on solar model

7e53679

removed duplicate arg

6532057

fix: missing doc string

605b6d6

\b escape for nicer click help message formatting

9657848

linting

37d98e9

removed funky sample counter test

af3f6f3

grantbuster and others added 27 commits October 24, 2024 16:37

Merge pull request #238 from NREL/gb/check_daily_batching

d719957

added simple test on cc batching for daily boundaries

Checks added to caching and derivations so that coarsening, time roll…

157c102

…ing, etc would not be applied to data loaded from cache.

updates to era_downloader from masked_fwp branch

524dd11

Merge pull request #239 from NREL/bnb/caching_fixes

ef17fde

Bnb/caching fixes

fix: get_coords return 2d coord arrays when lat and lon have a single…

b03b838

… time step

doc string for data handler: added chunks=None loads data into memory…

91a37ed

…. added tests for chunks=None with height interp derivation

removed map blocks from lin interp method. was producing the wrong sh…

0e95307

…ape in some cases.

small tweak in _lin_interp

b4a754e

removed pressure -> level alias in methods

89fe26a

missed cherry pick for fixed height parsing.

2c82e8c

height parsing fix

508fadc

cdsapi url update. cachers additions with dataset specific attributes

a194f28

if the Rasterizer gets chunks = None it will init the Loader with chu…

3f8802f

…nks = auto and then load only the rasterized data into memory.

got blockwise operations working with _lin_interp. squeaking out a li…

b8d75ec

…ttle more speed.

fixed broken attr caching with nested dictionaries

9ac7518

added delta min kwarg to presrat bias correction fwp function

afe7d0b

added logging statement to k factor in case range is set

a37b1e9

fixed performance issue with data indexing for bias calculation - fas…

5dcc98d

…ter when loaded

added option for delta_range in qdm and presrat bias transform functions

f6c7992

added info log for pre load data

39aeb03

head node shouldnt need to pre-load with chunks=None

6672d4b

bump rex requirement for delta_range kwargs for QDM functions

ba3b226

log shape of data being pre loaded into mem by sup3rx

8085a43

bug fix for esoteric edge case with missing leap day in source data

e5df5d0

added better exception if QDM/presrat returns NaN values

6e46d6c

Merge pull request #240 from NREL/gb/bc_kwargs

b8ebeeb

Gb/bc kwargs

grantbuster approved these changes Nov 5, 2024

View reviewed changes

bnb32 merged commit 6ea8113 into main Nov 5, 2024
12 checks passed

bnb32 deleted the bnb/dh_refactor branch November 5, 2024 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bnb/dh refactor #220

Bnb/dh refactor #220

bnb32 commented Jun 23, 2024

Bnb/dh refactor #220

Bnb/dh refactor #220

Conversation

bnb32 commented Jun 23, 2024