Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development Plan #6

Closed
JoshuaSBrown opened this issue Apr 28, 2022 · 5 comments
Closed

Development Plan #6

JoshuaSBrown opened this issue Apr 28, 2022 · 5 comments

Comments

@JoshuaSBrown
Copy link
Collaborator

The development tasks are broken down by topic, here we will brainstorm different components of Zambeze that need to be worked on. The hardest part is defining where one topic ends and another begins so we don't tried on each others toes while developing.

Detailed descriptions will be placed in separate issues once we flesh them out.

Topics

Zambeze command-line interface

The Zambeze command line interface is a python script that the user can use to interact with the orchestrator. The orchestrator should be running in the background on the same machine as the script. When a user makes a request using the command-line interface it connects to the orchestrator and passes the message along. The command-line interface should be able to report back if there are errors connecting and fail gracefully if the orchestrator does not respond in an appropriate amount of time.

Orchestrator file parser for reading Campaign file (Scientific Workflow File)

Scientific workflows that are understood by Zambeze should follow a declarative style and be written in .json (I think that is what we agreed on). We will need to be able to read these files and ensure that the file content and keywords are understood and validated.

Orchestrator logic for Task Submission to Queue (NATS)

Once a file is read in, the orchestrator will need to know how to split these into tasks with actions. It will also need a mechanism for submitting these tasks to the NATS queue. Once a submission is made the orchestrator will need to be able listen for a response... (The details of this I am uncertain about and need to be fleshed out)

Orchestrator Abstraction for interacting with supported third-party APIs i.e. (Globus/DataFed)

When an orchestrator is assigned a task it might be required to support data movement with Globus/DataFed/Some other application, or perform some other kind of computation. An abstraction layer is needed to be able to support communication with third-party applications outside of Zamebeze.

Orchestrator bid generation

When an orchestrator sees a task in NATS that is up for grabs, it will need to pull the message and evaluate if it has the resources to satisfy the request. It then needs to submit a bid to the orchestrator that originally submitted the request - it has to do this by placing a new task "Control task" back in the queue with a plan/bid for how it can satisfy the request.

Orchestrator bid evaluation

An orchestrator that has a "Data" or "Compute" task will receive bids back from the queue (NATS). It could receive anywhere between 0 and more bids back. It needs to have logic to evaluate which bid is the best and confirm the winning bid can proceed with the assignment.

.

@rafaelfsilva
Copy link
Member

Thanks @JoshuaSBrown for the initial description. Here are some thoughts:

  • Campaign file: this will be implemented in a later phase. Initially, we will use an API (i.e., a campaign.py file that will expose a Campaign class) in which the user will be able to declare their workflow (data and compute actions).
  • CLI: will also be implemented later. The API described above will be initial interface to use Zambeze.

For the orchestrator, I think it will be useful to create a graph so we can detail the functionalities described above. I will try to start a draft.

@tskluzac
Copy link
Collaborator

Where should authentication occur for third-party APIs? I would guess it's one of:

  1. Whenever an agent is launched. The service checks for relevant tokens on-disk for all necessary plugins; if they are not found, then initiate a login flow. I'm guessing if/when Zambeze has its own auth model, then its own flow would happen here as well.
  2. Whenever configure is called from the CLI. That would kick off a flow like this one (https://github.com/ORNL/zambeze/blob/main/zambeze/orchestration/plugin_modules/globus.py#L294); it just requires an extra step at the CLI.
  3. Prior to running anything (then the tokens are passed-in by client as strings at run-time; aka not our problem).

(ps -- my hours are backwards today due to baby... I promise I'm not usually working at this hour).

@rafaelfsilva
Copy link
Member

@tskluzac, I would go with option 2 for now, and once we have our own auth model, we will move all of it into the new system.

Hope you all are able to get some good hours of sleep :)

@tskluzac
Copy link
Collaborator

Another dev/design-related question --- now that the agent is a daemon that 'disconnects' from its parent process, I'm wondering what should happen when stopping an agent:

  1. Campaign-created agents: should these be terminated as soon as the campaign terminates? At first I thought "definitely yes", but now I realize there's a case where that agent could have theoretically picked up work submitted by a different campaign (and stopping it would interrupt that work). This leads to my second point...

  2. Stop behavior: in the current model, I assume that users stop agents when they are confident work is done -- therefore a hard cleanup can be had where the entire child subprocess is immediately killed. We could instead use a 'soft' stop model (i.e. where some "KILL" message is added to the queue that will prevent the agent from grabbing more tasks), but then we run the risk of users stopping an agent and it hangs indefinitely (and makes matters difficult when users try to restart their agent -- it would either have to create a second agent or force them to wait).

The best solution to both of these problems that I can think of is to continue to have hard shutdowns, but do:
A) the campaign doesn't start its own agent and just latches onto a user's locally-running agent, if one is running. Then the campaign DOES NOT automatically terminate at the end of the campaign. (user's responsibility)
B) we let the campaign continue to spin up/down an agent, but put some sort of limitation on campaign-adjacent agents that they cannot receive work. Then we can automatically spin it down without fear of deleting work.

Happy to discuss further (or have someone point out that I'm misunderstanding something!)

@rafaelfsilva
Copy link
Member

Can we close this issue as we are using Projects now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants