-
Notifications
You must be signed in to change notification settings - Fork 9
Support local paths for InputDataset.source
#30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support local paths for InputDataset.source
#30
Conversation
| if _get_source_type(source) == "path": | ||
| self.exists_locally = True | ||
| self.local_path = source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again I wonder if this could all be abstracted behind a class. e.g.
self.source = DataSource(source)then later use self.source.exists_locally and self.source.path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I began implementing this for this PR but it got hairier than expected. I think it requires a lot of design considerations and should be returned to once the core feature set is complete. AdditionalCode and BaseModel also have a source_repo attribute (that I think is about to be renamed source to accommodate local files) that could be replaced with an instance of whatever this class will be. I think this could be part of the tidy-up that brings in pathlib. I'll raise an issue for now.
TomNicholas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I have one main comment about the idea of abstracting out the "source" of the data. It's related to #9 and drivendataorg/cloudpathlib#455.
| If InputDataset.source is... | ||
| - ...a local path: create a symbolic link to the file in `local_dir/input_datasets`. | ||
| - ...a URL: fetch the file to `local_dir/input_datasets` using Pooch | ||
| (updating the `local_path` attribute of the calling InputDataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is another argument for abstracting out the concept of a "source" into a standalone class / type.
This is a similar idea to using cloudpathlib, but maybe we need to attach more info, in which case we would need our own Source class with a .path attribute.
| to_fetch.fetch(os.path.basename(self.source), downloader=downloader) | ||
| self.exists_locally = True | ||
| self.local_path = tgt_dir + "/" + os.path.basename(self.source) | ||
| # If the file is somewhere else on the system, make a symbolic link where we want it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could all go behind a Source interface.
- add _get_source_type method in utils (returns 'url' or 'path') - add call to _get_source_type in InputDataset.__init__() which updates local_path and exists_locally if 'path' - add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system - update InputDataset.check_exists_locally() to reflect above changes
… yaml created with Case.persist()
- Modifies the blueprint created by the first case to use local paths to input datasets where available - Creates and runs a second case `roms_marbl_local_case` from this blueprint
18fd968 to
b032dcc
Compare
closes #4 .
Summary of changes:
Core:
_get_source_typemethod inutils(returns 'url' or 'path')_get_source_typeinInputDataset.__init__()and setlocal_pathandexists_locallyattributes ifsourceattribute is a local pathInputDataset.get()which creates a local symbolic link to the dataset path if it is elsewhere on the systemInputDataset.check_exists_locally()to reflect above changesCI:
tests/test_roms_marbl_example.py. The first section creates and runs the caseroms_marbl_remote_casewhere all input datasets are URLs, creating a blueprinttest_blueprint.yamlas before. The second section first moves all the fetched datasets to an unrelated directory and then modifies thetest_blueprint.yamlfile to replace URLs with the path to this dir, before creating and running another caseroms_marbl_local_casefrom this modified yaml file.Other/bugfixes:
Case.persist()andCase.from_blueprint()when modifying the test routine above:valid_end_datewas overwritingvalid_start_dateentry in blueprint outputstart_dateandend_dateassociated with input datasets were not being written to blueprint outputCase.check_is_setup()was not correctly checking for the local presence of additional codetreeon system before calling it (closes bash tree: command not found #18)Case.end_dateif they want quick results (closes simplest example should run really fast #21)ci/environment.ymlto includenetCDF4andxarray(closes Missing dependencies for analysis #23) and removencoandncview(issue not yet raised but was mentioned in Tom's notes)