Skip to content

Conversation

@dafyddstephenson
Copy link
Contributor

@dafyddstephenson dafyddstephenson commented Aug 17, 2024

closes #4 .
Summary of changes:

Core:

  • add _get_source_type method in utils (returns 'url' or 'path')
  • add call to _get_source_type in InputDataset.__init__() and set local_path and exists_locally attributes if source attribute is a local path
  • add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system
  • update InputDataset.check_exists_locally() to reflect above changes

CI:

  • add a second section to tests/test_roms_marbl_example.py. The first section creates and runs the case roms_marbl_remote_case where all input datasets are URLs, creating a blueprint test_blueprint.yaml as before. The second section first moves all the fetched datasets to an unrelated directory and then modifies the test_blueprint.yaml file to replace URLs with the path to this dir, before creating and running another case roms_marbl_local_case from this modified yaml file.

Other/bugfixes:

Comment on lines +65 to +67
if _get_source_type(source) == "path":
self.exists_locally = True
self.local_path = source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I wonder if this could all be abstracted behind a class. e.g.

self.source = DataSource(source)

then later use self.source.exists_locally and self.source.path.

Copy link
Contributor Author

@dafyddstephenson dafyddstephenson Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I began implementing this for this PR but it got hairier than expected. I think it requires a lot of design considerations and should be returned to once the core feature set is complete. AdditionalCode and BaseModel also have a source_repo attribute (that I think is about to be renamed source to accommodate local files) that could be replaced with an instance of whatever this class will be. I think this could be part of the tidy-up that brings in pathlib. I'll raise an issue for now.

Copy link
Contributor

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have one main comment about the idea of abstracting out the "source" of the data. It's related to #9 and drivendataorg/cloudpathlib#455.

Comment on lines +105 to +108
If InputDataset.source is...
- ...a local path: create a symbolic link to the file in `local_dir/input_datasets`.
- ...a URL: fetch the file to `local_dir/input_datasets` using Pooch
(updating the `local_path` attribute of the calling InputDataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is another argument for abstracting out the concept of a "source" into a standalone class / type.

This is a similar idea to using cloudpathlib, but maybe we need to attach more info, in which case we would need our own Source class with a .path attribute.

to_fetch.fetch(os.path.basename(self.source), downloader=downloader)
self.exists_locally = True
self.local_path = tgt_dir + "/" + os.path.basename(self.source)
# If the file is somewhere else on the system, make a symbolic link where we want it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could all go behind a Source interface.

“Dafydd added 5 commits August 20, 2024 14:16
- add _get_source_type method in utils (returns 'url' or 'path')
- add call to _get_source_type in InputDataset.__init__() which updates local_path and exists_locally if 'path'
- add logic to InputDataset.get() which creates a local symbolic link to the dataset path if it is elsewhere on the system
- update InputDataset.check_exists_locally() to reflect above changes
- Modifies the blueprint created by the first case to use local paths to input datasets where available
- Creates and runs a second case `roms_marbl_local_case` from this blueprint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants