-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continued data flow work and object store dataset types #145
Continued data flow work and object store dataset types #145
Conversation
python/lib/modeldata/dmod/modeldata/data/object_store_dataset.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/object_store_dataset.py
Outdated
Show resolved
Hide resolved
python/lib/modeldata/dmod/modeldata/data/object_store_dataset.py
Outdated
Show resolved
Hide resolved
for directory in [d for d in dir_path.iterdir() if d.is_dir()]: | ||
self._push_files(bucket_name, directory, recursive, bucket_root, do_checks=False) | ||
|
||
def add_data(self, dataset_name: str, **kwargs) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be worth adding support for iterating a list of file names/paths and adding to the data set. I could see this being a common pattern using glob
to pattern match certain data on the client side and wanting to push just those (via a list of nams) into the data set. Of course the client could iterate and add_data
for each individual file, but probably wouldn't be hard to support an iterable typed kwarg either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but I decided against it.
The object store dataset and its manager support a simulated directory structure by encoding the structure into the object names. To make that work, there has to be a defined "bucket root" directory when adding files, with only a files ancestor directory structure below this bucket root level being encoded.
If we add all the files in a directory, we can safely assume they all should be added relative to the same bucket root. But we couldn't make this kind of assumption for an arbitrary list of files, so we'd also need a list of bucket root values (unless we don't care about losing the original directory structure for each file, but I think we do). We could require a list of files and bucket roots, but I thought that would make the interface too messy, and that it would be better to leave it to something else to provide that convenience if it was wanted.
I'm be open to talking through this more though if you want (either as part of the PR or for later ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming the client didn't need the directory structure if they were adding a list of arbitrary files...if they did, they should pass the Path
to those files, which would get encoded in the object names and put in the same root
bucket....
Renaming to test_data_requirement.py after making DataRequirement type concrete and removing CatchmentDataRequirement type.
Updating tests after recent redesign that involved removing the CatchmentDataRequirement subtype.
Adding new DatasetUser type and functions within DatasetManager for managing known users of a Dataset.
Adjusting implementation of Dataset to use DataDomain, rather than separate properties for a(n optional) time range and data format.
Adding two functions - delete_data and get_data - to DatasetManager that are basically empty shells, with a commented-out @AbstractMethod decorator for each and a TODO comment to add this in later and implement in subclasses.
4752707
to
08bbfe8
Compare
Fixing "Return" portion of docstring data_domain property, which was more appropriate for data_format; and fixing short description for data_format property, which was more appropriate for data_domain.
Updating is_time_series_index and time_series_index properties, primarily to support caching of time_series_index property value.
Using bucket_root instead of add_relative_to in ObjectStoreDataset add_files function to make its name consistent with analogous params in other related function (i.e., of the ObjectStoreDatasetManager).
Handling edge case appropriately with exception if kwargs of 'file' and 'directory' are both present.
@hellkite500 I've pushed the commits I mentioned in the review conversations above. FWIW, the failing test is in the Given the nature of where this and some subsequent draft PRs are at, I think it's reasonable to go ahead and approve this one despite that particular check failure, but if you object to that then we can wait until after an issue for that is opened and resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a good starting point. We can always add functionality in future iterations.
Primary changes include: