Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File object management #129

Open
6 tasks
eirrgang opened this issue May 4, 2021 · 7 comments
Open
6 tasks

File object management #129

eirrgang opened this issue May 4, 2021 · 7 comments
Assignees
Labels
EEE requirement Issues that need to be resolved to migrate the Ensemble of Expanded Ensembles package infrastructure development infrastructure or lower level implementation details

Comments

@eirrgang
Copy link
Contributor

eirrgang commented May 4, 2021

Establish an abstract interface for filesystem objects that can be implemented for RP. File references should allow data flow to be defined with minimal coupling to actual data location at the time of expression. File Futures allow reference to file objects that do not yet exist.

A File reference must be easily localized to the contexts of different workflow managers, such as the client environment to the execution environment and back again.

Data localization and path management must be handled automatically by the WorkflowManager instances.

Unnecessary data transfers must be avoidable through optimization code in WorkflowManager.

Relates to #75

  • SCALEMS file management API
    • generate UUID for abstract command nodes
    • Provide a file-staging operation primitive.
    • Provide a subprocess operation primitive.
    • Refactor scalems.executable as an abstract operation Director that generates file staging and subprocess primitives.
    • Stage input data required by the work in the record. (@andre-merzky will provide boiler plate) Ref
      'input_staging': ['%s/scalems_test_cfg.json' % pwd,
@eirrgang eirrgang added the infrastructure development infrastructure or lower level implementation details label May 4, 2021
@eirrgang eirrgang added this to the RP integrated prototype milestone May 4, 2021
@eirrgang eirrgang self-assigned this May 4, 2021
@eirrgang
Copy link
Contributor Author

It looks like radical.saga.filesystem.File already provides most of the interface we would want for abstractly handling local versus remote files.

@andre-merzky and @mturilli does it seem reasonable to build on SAGA here? How would I extract a radical.saga.Session from the radical.pilot.Session?

Presumably, somewhere under the hood, RP translates its URL scheme to SAGA URLs? Is there an accessible function that client software could use to get a URL with a scheme radical.saga would understand? Or does RP register its extra schemes with the underlying saga resolver?

@andre-merzky
Copy link
Contributor

andre-merzky commented May 14, 2021

does it seem reasonable to build on SAGA here?

Yes, I assume so

How would I extract a radical.saga.Session from the radical.pilot.Session?

The rp.Session inherits from the rs.Session, so you should be able to use that session as-is.

Presumably, somewhere under the hood, RP translates its URL scheme to SAGA URLs? Is there an accessible function that client software could use to get a URL with a scheme radical.saga would understand? Or does RP register its extra schemes with the underlying saga resolver?

It is not exposed. The URL translation code in RP is complex and acts differently depending on where the URL is used, what component requests the translation, etc, so I doubt it is immediately useful to ScaleMS.

@eirrgang
Copy link
Contributor Author

so I doubt it is immediately useful to ScaleMS

Specifically, I am trying to figure out the easiest way to extract a saga File object from a path which may be based on one of the RP-specific URIs provided by RP objects. It isn't always obvious to a programmer whether an attribute is going to need extra processing. The RP documentation mentions radical.utils.Url, but it seems like all of the "sandbox" attributes are just strings.

However, it appears that urllib.parse.urlparse() can handle regular posix file paths just fine, so environment_path = pathlib.Path(urllib.parse.urlparse(rpcomponent.sandbox).path) should be a reasonable normalizer.

But what would be the best way to insert the appropriate SAGA access scheme? (I think this amounts to the filesystem_endpoint from the resource description for the Pilot.)

@andre-merzky
Copy link
Contributor

You can always get the task and pilot sandboxes via task.sandbox and pilot.sandbox, and those URLs will include the access scheme (which is indeed based on the respective config entry).

What operations do you intent to implement (beyond those provided by the staging ops)?

@eirrgang
Copy link
Contributor Author

eirrgang commented May 15, 2021

What operations do you intent to implement (beyond those provided by the staging ops)?

I don't think there is a need for anything beyond the staging ops. But I don't need to write a wrapper for RP-based file references that includes a bound Pilot, Task, and/or Session if I can easily get a saga.filesystem.File, I think. I can just check whether a source or target path is a saga object to dispatch copy. Maybe even convert all Path and PathLike references to saga objects instead of a scalems.File object.

task.sandbox and pilot.sandbox are just str objects, right? They aren't saga.Url or saga.Directory objects? (If __repr__ == __str__, I may have totally missed this! I should check now. I hope I didn't make a misguided assumption.)

@eirrgang
Copy link
Contributor Author

task.sandbox and pilot.sandbox are just str objects, right? They aren't saga.Url or saga.Directory objects? (If __repr__ == __str__, I may have totally missed this! I should check now. I hope I didn't make a misguided assumption.)

It looks like Pilot stores the various sandbox URLs as RU URL objects internally, but pilot_sandbox specifically is converted to str when accessed through the property. Internally, it looks like everything is there to get a saga Directory object.

Task.sandbox and Task.pilot_sandbox may produce ru.Url references, but I don't see where they private members are assigned anything other than None.

@eirrgang eirrgang mentioned this issue Oct 20, 2022
42 tasks
@eirrgang eirrgang added the EEE requirement Issues that need to be resolved to migrate the Ensemble of Expanded Ensembles package label Oct 21, 2022
eirrgang added a commit that referenced this issue Mar 9, 2023
Lay out a template in which to consolidate notes and produce a cook book
for the necessary operations on file references so that we can see what
`scalems` needs to do.

Ref #75, #129
eirrgang added a commit that referenced this issue Mar 27, 2023
Lay out a template in which to consolidate notes and produce a cook book
for the necessary operations on file references so that we can see what
`scalems` needs to do.

Update some docstrings, notes, and "to do"s.

Ref #75, #129

More docstring updates.
@eirrgang
Copy link
Contributor Author

eirrgang commented May 4, 2023

Within the scope of this issue, we should make sure to support a user-provided "label" that can be easily cross-referenced with the local workflow metadata to locate file identifiers in a flexible and user-friendly way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EEE requirement Issues that need to be resolved to migrate the Ensemble of Expanded Ensembles package infrastructure development infrastructure or lower level implementation details
Projects
Status: 📋 Backlog
Status: 🏗 In progress
Status: 🔖 On Deck
Development

No branches or pull requests

2 participants