Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued data flow work and object store dataset types #145

Merged
merged 20 commits into from
Mar 16, 2022

Conversation

robertbartel
Copy link
Contributor

Primary changes include:

  • Further refinements to the abstract Dataset and DatasetManager types
  • Adjustments to the DataRequirement metadata type
  • Addition of minio as package dependency for dmod.modeldata lib package
  • Addition of ObjectStoreDataset and ObjectStoreDatasetManager types

python/lib/modeldata/dmod/modeldata/data/dataset.py Outdated Show resolved Hide resolved
python/lib/modeldata/dmod/modeldata/data/meta_data.py Outdated Show resolved Hide resolved
for directory in [d for d in dir_path.iterdir() if d.is_dir()]:
self._push_files(bucket_name, directory, recursive, bucket_root, do_checks=False)

def add_data(self, dataset_name: str, **kwargs) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth adding support for iterating a list of file names/paths and adding to the data set. I could see this being a common pattern using glob to pattern match certain data on the client side and wanting to push just those (via a list of nams) into the data set. Of course the client could iterate and add_data for each individual file, but probably wouldn't be hard to support an iterable typed kwarg either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but I decided against it.

The object store dataset and its manager support a simulated directory structure by encoding the structure into the object names. To make that work, there has to be a defined "bucket root" directory when adding files, with only a files ancestor directory structure below this bucket root level being encoded.

If we add all the files in a directory, we can safely assume they all should be added relative to the same bucket root. But we couldn't make this kind of assumption for an arbitrary list of files, so we'd also need a list of bucket root values (unless we don't care about losing the original directory structure for each file, but I think we do). We could require a list of files and bucket roots, but I thought that would make the interface too messy, and that it would be better to leave it to something else to provide that convenience if it was wanted.

I'm be open to talking through this more though if you want (either as part of the PR or for later ones).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming the client didn't need the directory structure if they were adding a list of arbitrary files...if they did, they should pass the Path to those files, which would get encoded in the object names and put in the same root bucket....

Renaming to test_data_requirement.py after making DataRequirement type
concrete and removing CatchmentDataRequirement type.
Updating tests after recent redesign that involved removing the
CatchmentDataRequirement subtype.
Adding new DatasetUser type and functions within DatasetManager for
managing known users of a Dataset.
Adjusting implementation of Dataset to use DataDomain, rather than
separate properties for a(n optional) time range and data format.
Adding two functions - delete_data and get_data - to DatasetManager that
are basically empty shells, with a commented-out @AbstractMethod
decorator for each and a TODO comment to add this in later and implement
in subclasses.
Fixing "Return" portion of docstring data_domain property, which was
more appropriate for data_format; and fixing short description for
data_format property, which was more appropriate for data_domain.
Updating is_time_series_index and time_series_index properties,
primarily to support caching of time_series_index property value.
Using bucket_root instead of add_relative_to in ObjectStoreDataset
add_files function to make its name consistent with analogous params in
other related function (i.e., of the ObjectStoreDatasetManager).
Handling edge case appropriately with exception if kwargs of 'file' and
'directory' are both present.
@robertbartel
Copy link
Contributor Author

@hellkite500 I've pushed the commits I mentioned in the review conversations above.

FWIW, the failing test is in the metrics package, which is not in the scope of these changes. That's showing up as a check failure now, while it (incorrectly) was not showing as a failure before, because of the merging-in of #134.

Given the nature of where this and some subsequent draft PRs are at, I think it's reasonable to go ahead and approve this one despite that particular check failure, but if you object to that then we can wait until after an issue for that is opened and resolved.

@hellkite500
Copy link
Member

FWIW, the failing test is in the metrics package, which is not in the scope of these changes. That's showing up as a check failure now, while it (incorrectly) was not showing as a failure before, because of the merging-in of #134.

See #149 for these failing tests.

Copy link
Member

@hellkite500 hellkite500 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a good starting point. We can always add functionality in future iterations.

@robertbartel robertbartel merged commit 933e158 into NOAA-OWP:master Mar 16, 2022
@robertbartel robertbartel deleted the launch_jobs/forcings_3 branch March 16, 2022 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Confirm/standardize and document internal storage for hydrofabric data
2 participants