Continued data flow work and object store dataset types #145

robertbartel · 2022-02-25T18:16:38Z

Primary changes include:

Further refinements to the abstract Dataset and DatasetManager types
Adjustments to the DataRequirement metadata type
Addition of minio as package dependency for dmod.modeldata lib package
Addition of ObjectStoreDataset and ObjectStoreDatasetManager types

python/lib/modeldata/dmod/modeldata/data/dataset.py

python/lib/modeldata/dmod/modeldata/data/meta_data.py

python/lib/modeldata/dmod/modeldata/data/object_store_dataset.py

hellkite500 · 2022-03-14T14:33:59Z

python/lib/modeldata/dmod/modeldata/data/object_store_dataset.py

+            for directory in [d for d in dir_path.iterdir() if d.is_dir()]:
+                self._push_files(bucket_name, directory, recursive, bucket_root, do_checks=False)
+
+    def add_data(self, dataset_name: str, **kwargs) -> bool:


might be worth adding support for iterating a list of file names/paths and adding to the data set. I could see this being a common pattern using glob to pattern match certain data on the client side and wanting to push just those (via a list of nams) into the data set. Of course the client could iterate and add_data for each individual file, but probably wouldn't be hard to support an iterable typed kwarg either

I considered that, but I decided against it.

The object store dataset and its manager support a simulated directory structure by encoding the structure into the object names. To make that work, there has to be a defined "bucket root" directory when adding files, with only a files ancestor directory structure below this bucket root level being encoded.

If we add all the files in a directory, we can safely assume they all should be added relative to the same bucket root. But we couldn't make this kind of assumption for an arbitrary list of files, so we'd also need a list of bucket root values (unless we don't care about losing the original directory structure for each file, but I think we do). We could require a list of files and bucket roots, but I thought that would make the interface too messy, and that it would be better to leave it to something else to provide that convenience if it was wanted.

I'm be open to talking through this more though if you want (either as part of the PR or for later ones).

I was assuming the client didn't need the directory structure if they were adding a list of arbitrary files...if they did, they should pass the Path to those files, which would get encoded in the object names and put in the same root bucket....

Renaming to test_data_requirement.py after making DataRequirement type concrete and removing CatchmentDataRequirement type.

Updating tests after recent redesign that involved removing the CatchmentDataRequirement subtype.

Adding new DatasetUser type and functions within DatasetManager for managing known users of a Dataset.

Adjusting implementation of Dataset to use DataDomain, rather than separate properties for a(n optional) time range and data format.

@AbstractMethod

Adding two functions - delete_data and get_data - to DatasetManager that are basically empty shells, with a commented-out @AbstractMethod decorator for each and a TODO comment to add this in later and implement in subclasses.

Fixing "Return" portion of docstring data_domain property, which was more appropriate for data_format; and fixing short description for data_format property, which was more appropriate for data_domain.

Updating is_time_series_index and time_series_index properties, primarily to support caching of time_series_index property value.

Using bucket_root instead of add_relative_to in ObjectStoreDataset add_files function to make its name consistent with analogous params in other related function (i.e., of the ObjectStoreDatasetManager).

Handling edge case appropriately with exception if kwargs of 'file' and 'directory' are both present.

robertbartel · 2022-03-16T17:21:56Z

@hellkite500 I've pushed the commits I mentioned in the review conversations above.

FWIW, the failing test is in the metrics package, which is not in the scope of these changes. That's showing up as a check failure now, while it (incorrectly) was not showing as a failure before, because of the merging-in of #134.

Given the nature of where this and some subsequent draft PRs are at, I think it's reasonable to go ahead and approve this one despite that particular check failure, but if you object to that then we can wait until after an issue for that is opened and resolved.

hellkite500 · 2022-03-16T17:51:31Z

FWIW, the failing test is in the metrics package, which is not in the scope of these changes. That's showing up as a check failure now, while it (incorrectly) was not showing as a failure before, because of the merging-in of #134.

See #149 for these failing tests.

hellkite500

This should be a good starting point. We can always add functionality in future iterations.

robertbartel added enhancement New feature or request maas MaaS Workstream labels Feb 25, 2022

robertbartel added this to the 1.0.0 (AGU FIH) milestone Feb 25, 2022

robertbartel requested review from hellkite500 and christophertubbs February 25, 2022 18:16

robertbartel self-assigned this Feb 25, 2022

This was referenced Feb 25, 2022

Complete initial internal storage infrastructure and workflows #128

Closed

Updates to Object Store Service Config and Helper Scripts #147

Merged

robertbartel linked an issue Mar 8, 2022 that may be closed by this pull request

Confirm/standardize and document internal storage for hydrofabric data #125

Closed

robertbartel mentioned this pull request Mar 8, 2022

Confirm/standardize and document internal storage for hydrofabric data #125

Closed

hellkite500 reviewed Mar 14, 2022

View reviewed changes

robertbartel added 13 commits March 16, 2022 09:02

Update modeldata.data init for new/removed types.

346a5fb

Rename test_catchment_data_requirement.py file.

2ddbdcc

Renaming to test_data_requirement.py after making DataRequirement type concrete and removing CatchmentDataRequirement type.

Add DataDomain to modeldata.data package init.

2d2f448

Adjust modeldata.data init for circular imports.

7dd48bc

Update DataRequirement unit tests.

296dd42

Updating tests after recent redesign that involved removing the CatchmentDataRequirement subtype.

Add DatasetManager uuid property function.

bca9092

Add DatasetUser and manager funcs to link.

ec9be35

Adding new DatasetUser type and functions within DatasetManager for managing known users of a Dataset.

Helper DataFormat func to get time series index.

d2dba4e

Adjust Dataset to use DataDomain.

4fefc34

Adjusting implementation of Dataset to use DataDomain, rather than separate properties for a(n optional) time range and data format.

Add abstract add_data func to DatasetManager.

1b5e780

Add shell funcs to DatasetManager for later.

4261a53

Adding two functions - delete_data and get_data - to DatasetManager that are basically empty shells, with a commented-out @AbstractMethod decorator for each and a TODO comment to add this in later and implement in subclasses.

Add modeldata package dep on minio python package.

37c034a

Add ObjectStoreDataset, ObjectStoreDatasetManager.

08bbfe8

robertbartel force-pushed the launch_jobs/forcings_3 branch from 4752707 to 08bbfe8 Compare March 16, 2022 14:02

robertbartel added 3 commits March 16, 2022 09:08

Fix mixup in Dataset docstrings.

11af3a0

Fixing "Return" portion of docstring data_domain property, which was more appropriate for data_format; and fixing short description for data_format property, which was more appropriate for data_domain.

Adjust DataFormat properties for lazy init/cache.

db0980a

Updating is_time_series_index and time_series_index properties, primarily to support caching of time_series_index property value.

Clarifying TODO comment

bc3ac79

hellkite500 mentioned this pull request Mar 16, 2022

Missing sklearn dependency and broken tests in metrics #149

Closed

Make obj store func param names consistent.

43cca12

Using bucket_root instead of add_relative_to in ObjectStoreDataset add_files function to make its name consistent with analogous params in other related function (i.e., of the ObjectStoreDatasetManager).

robertbartel added 3 commits March 16, 2022 11:17

Handle case in ObjectStoreDatasetManager add_data.

3758e9a

Handling edge case appropriately with exception if kwargs of 'file' and 'directory' are both present.

Fix/improve docstring in object_store_dataset.py.

d838eb7

Fix docstring grammar fails.

31e520e

hellkite500 approved these changes Mar 16, 2022

View reviewed changes

robertbartel merged commit 933e158 into NOAA-OWP:master Mar 16, 2022

robertbartel deleted the launch_jobs/forcings_3 branch March 16, 2022 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continued data flow work and object store dataset types #145

Continued data flow work and object store dataset types #145

robertbartel commented Feb 25, 2022

hellkite500 Mar 14, 2022

robertbartel Mar 16, 2022

hellkite500 Mar 16, 2022

robertbartel commented Mar 16, 2022

hellkite500 commented Mar 16, 2022

hellkite500 left a comment

Continued data flow work and object store dataset types #145

Continued data flow work and object store dataset types #145

Conversation

robertbartel commented Feb 25, 2022

hellkite500 Mar 14, 2022

Choose a reason for hiding this comment

robertbartel Mar 16, 2022

Choose a reason for hiding this comment

hellkite500 Mar 16, 2022

Choose a reason for hiding this comment

robertbartel commented Mar 16, 2022

hellkite500 commented Mar 16, 2022

hellkite500 left a comment

Choose a reason for hiding this comment