Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add execution order controls #424

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

jwhite242
Copy link
Collaborator

Add machinery for controlling execution order/priority of study steps

  • Adds weights to the graph to enable selecting between depth first and breadth first (current production mode) execution order.
  • Adds new execution block for exposing controls of orders and future hooks for using various step metadata for controlling step weights
  • Changes internal machinery to use a PriorityQueue to store the ready steps


| **Key** | **Required** | **Type** | **Description** |
| :- | :-: | :-: | :- |
| `step_order` | No | str | Type of scheduler managing execution. One of: {`depth-first`, `breadth-first`}. Default: `depth-first`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default here be breadth-first like you say in the sentence above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's definitely a typo. should be fixed now

Copy link
Member

@bgunnar5 bgunnar5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this addition a lot. Two of my comments here are just suggestions for cleaning up code/planning for the future but if you want to ignore them I'm not going to be offended at all.

if self.weight < other.weight:
return True

return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get fancy here (and for the __gt__ method below) and make this a one-liner: return self.weight < other.weight

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add onto this, I think you're only required to implement __lt__ and __eq__ since you can build the other comparison methods from them. But I'm only 85% sure from memory 😅

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll throw another wrench into this and point out this implementation is probably not what we want despite it working for the current feature. I.e. step weights can't be the only basis of comparison. The __eq__ gets used for doing step equality, which means we can't use __lt__ to construct that if we stick with this __lt__ that only cares about the step weights. Had kind of punted on what other things to care about in these during the initial implementation, but should definitely be addressed now

Copy link
Member

@bgunnar5 bgunnar5 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we leave this as is and then set the weights when we determine priorities?

EDIT: this comment is in reply to the previous discussion started here: #424 (comment)

for weight, (parent, step_name, step) in enumerate(self.walk_study()):
pprint(step)
pprint("Updating weight")
if isinstance(step, StudyStep):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there ever a case here where the step is not a StudyStep? Would it just be None?

self.step_prioritization_factory.register_priority_expr(
'step_order',
step_weight_priority_df
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain on your future plans for this section of code, but if you're planning on adding more additions to this if/else chain of the same form then maybe it would be better in the long run if you did something like so:

priority_expr_mapper = {
    "breadth-first": {
        "name": "step-order",
        "func": step_weight_priority_bf
    },
    "depth-first": {
        "name": "step-order",
        "func": step_weight_priority_df
    }
}

self.step_prioritization_factory.register_priority_expr(
    priority_expr_mapper[self.step_order]["name"],
    priority_expr_mapper[self.step_order]["func"]
)

Then any changes to the register_priority_expr() method would only have to be modified in one function call rather than multiple.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here is to have some protocoled classes that implement the __call__ method -- then you can just point to the class you want it to use and have them all be callable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, both good ideas, and I think the end game is some mix of the two to get all the work out of the constructor. Haven't quite fleshed out the design for the implementation, but idea with this is to eventually enable users to build expressions from things like step resource keys (procs, walltime, combinations of them), parameters, maybe even step names and layer them on to this priority. Priorities then being the sequential eval of all of those expressions (step order being the only one that's always there) and exploiting the sorting mechanism of tuples to give a lot of fine grained control over the orders.

Of course, if there are other ideas on things we might want to use to control this ordering, let me know!

@@ -0,0 +1,93 @@
description:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have an example in the samples directory as well.

@@ -906,12 +920,32 @@ def execute_ready_steps(self):
# Now, we need to take the min of the length of the queue and the
# computed number of slots. We could have free slots, but have less
# in the queue.
_available = min(_available, len(self.ready_steps))
# _available = min(_available, len(self.ready_steps)) # Kind of inconsistent if it's affected by queue size sometimes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what you mean by inconsistent here. This prevents the need for the conditional below (at least that's what it's intent was when it was originally coded. It means not checking for the -1 value explicitly also.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with this is PriorityQueue's are like standard Queue's and do not implement a length like the dequeue that used to live here. The size is also only approximate (unsure if this is really only the case in multithreading cases or not)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this whole block can now be removed as we don't need this check -> the if queue.empty() test on the while loop accomplishes the same behavior and exits when there are no more steps are available

new_steps_running = 0
LOGGER.debug("Available slots: %d, Ready steps is empty %s",
_available, self.ready_steps.empty() == True)
while not self.ready_steps.empty() and (new_steps_running < _available or _available < 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the original intent of the throttle that was added. The --throttle option meant that Maestro would not exceed that number of jobs in total. This form of throttle says that we are allowed to launch that number of new steps instead of being a global limit to my understanding. Am I incorrect there?

Also, this conditional is a little convoluted -- since available could be -1 it adds the extra conditional.

Copy link
Collaborator Author

@jwhite242 jwhite242 Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this part needs another data structure to really work right again: queue's size method is only approximate, so isn't reliable for this like the previous implementation was. will need a running steps tracker alongside this to make this throttling less convoluted

Edit: new_steps_running just needs to not be reset every run through i think and then we'll have the intended behavior

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this actually does work as intended, no need to do my edit. The controls come from _available getting set using throttle and the number of running steps a little above this loop. So if in_progress is already = throttle this conditional will not let any new jobs through.

just tested it on one of the hello world specs and verified it would properly limit to 1 job at a time (throttle setting) with the slurm adapter

Will add a comment to explain the variable.

if self.weight < other.weight:
return True

return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add onto this, I think you're only required to implement __lt__ and __eq__ since you can build the other comparison methods from them. But I'm only 85% sure from memory 😅

def __init__(self, name, description,
studyenv=None, parameters=None, steps=None, out_path="./"):
def __init__(self, name, description, studyenv=None,
parameters=None, steps=None, out_path="./"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're passing the study environment here, the output path might be redundant. Could be an opportunity to clean up some parameters here.

@@ -127,6 +127,35 @@ def remove_edge(self, src, dest):
logging.debug("Removing edge (%s, %s).", src, dest)
self.adjacency_table[src].remove(dest)

def __iter__(self):
"""
Iterator over the graph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring could be clearer. It also looks like this is a DFS search. It might be worth mentioning that in the docstring or making it configurable somehow.

self.step_prioritization_factory.register_priority_expr(
'step_order',
step_weight_priority_df
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here is to have some protocoled classes that implement the __call__ method -- then you can just point to the class you want it to use and have them all be callable.

@FrankD412
Copy link
Member

@jwhite242 -- Is this a dead end at this point? We're assessing Maestro and it's looking like DFS would be good on our end too. Wondering just in case we need to revisit.

@jwhite242
Copy link
Collaborator Author

@jwhite242 -- Is this a dead end at this point? We're assessing Maestro and it's looking like DFS would be good on our end too. Wondering just in case we need to revisit.

No, i just got derailed by other things for a bit. Am ramping back up on this now. Having the more general expression based priorities are going to be pretty helpful -> major use case here being getting big/long running variants of steps running sooner, allowing smaller ones to be churned through within the throttle limit alongside it for improved throughput.

Thinking more on the protocol question.. I'm on the fence on whether we shouldn't just use abstract base classes and tie info to these things; i.e. per step overrides of expressions. But will play with both and see how they feel

@jwhite242 jwhite242 marked this pull request as ready for review March 6, 2024 17:47
@jwhite242
Copy link
Collaborator Author

@FrankD412, @bgunnar5, @jsemler
Think this is finally ready for another pass/real review. An interesting question left (beyond any implementation issues/comments) is what to do about the spec. I refactored it to be a list so it's more clearly ordered for users, but maybe it'd make sense to contian this list in a subkey (priority_expressions or something) instead of at the root of the execution block? Don't have any other things in mind for this block yet, but thinking the key would be more future proof in case we do think of something. (i know docs are slightly out of sync, pending this subkey/not question)

@jwhite242
Copy link
Collaborator Author

actually, just had another thought that might fit nicer, expanding it and making the value more of a 'oneOf' type, so either value or expression, making it more clear that there's two types and avoiding having to do greedy parsing on things to figure it out on our end

execution:
  priority:
    - name:
      description: # optional, but encouraged... can make built-ins dump the code's internal description in the reserialized spec
      value: # use this for built-ins with string keys to select (e.g. current 'step-order')
    - name:
      description:
      expression: # use this for the eventual string based expression compilation

warning to highlight interaction with throttle setting.  Swap to
single hued blue color scheme
@jwhite242
Copy link
Collaborator Author

jwhite242 commented Mar 28, 2024

actually, just had another thought that might fit nicer, expanding it and making the value more of a 'oneOf' type, so either value or expression, making it more clear that there's two types and avoiding having to do greedy parsing on things to figure it out on our end

execution:
  priority:
    - name:
      description: # optional, but encouraged... can make built-ins dump the code's internal description in the reserialized spec
      value: # use this for built-ins with string keys to select (e.g. current 'step-order')
    - name:
      description:
      expression: # use this for the eventual string based expression compilation

Continued tweaking/iteration with a mind toward this being amenable to a mix of built-in/user things (think plugins for reusable expressions checkable via the dependencies machinery)

execution:
  priority:
    - prioritizer_id:  step_order  # built-in dag traversal order method 
      args:
         - step_order: 'depth-first'
         - ...
      - prioritizer_id: expression  # the built in expression prioritizer
        expression: step.procs*step.walltime ....
        

Think of the prioritizer_id (or similar name) as akin to the key used to id script adapters, so we can use this to tag plugin installed things in a way that makes error messaging helpful since it's a standardized place to register these functions. So for the sharing, maestro can tell the recipient that they're missing this plugin somebody was using. Still like the idea of descriptions too, though not sure about making them mandatory given the built-in ones can have that set internally and just serialized to the spec in the workspace

Comment on lines +25 to +41
- name: extract-data
description: |
Extract data and upload to external source, then clean up workspace
run:
cmd: |
# Placeholder for some database upload
python $(UPLOADER) -i $(echo-params.workspace)

upload_success=$? # Capture retcode of uploader for use later
# Clean up workspace to save space
if [[ upload_success == 0 ]] then
echo 'Data upload successful'
echo 'Cleaning up data files'
rm $(echo-params.workspace)/datafile.txt
else
echo 'Data upload failed'
exit upload_success
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this step depend on echo-params?

clear.


These execution orders can be shown with the sample hello-bye world specification as shown in the tutorials:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to Tutorials page here would be helpful

run:
cmd: |
echo "$(FAREWELL), $(NAME)!" > $(BYE_FORMAT)
depends: [say-hello]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this depends statement is changed to depends: [say_hello_*] can depth-first execution still take place? If so, should this be changed?


=== "Breadth-first order, throttle=1"

Breadth first, order of excution marked by colors with lighter colors executing first:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here, should be: "order of execution". This same typo is in the DFO tab below as well

"""
Defines api for Priority Expressions for study steps.
"""
def __call__(self, study_step):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now the type hint could be study_step: "StudyStep" that way there's at least something until the circular import is resolved. Up to you though

if self.weight < other.weight:
return True

return False
Copy link
Member

@bgunnar5 bgunnar5 Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we leave this as is and then set the weights when we determine priorities?

EDIT: this comment is in reply to the previous discussion started here: #424 (comment)

@@ -317,12 +317,17 @@ def run_study(args):
batch = spec.batch
if "type" not in batch:
batch["type"] = "local"

# Check the execution block early, then store it
exec_block = spec.execution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no execution block is given, is there a default that needs to be set? Something like:

if "graph-order" not in exec_block:
    exec_block["graph-order"] = "breadth-first"

assert step.weight == expected_step_weights[step_name]


# NOTE: can we refactor to use a shared study fixture instead of redoing it frequently?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can have a single hello_bye.yaml script and then write fixtures to update certain blocks of the script? In the fixture you can yield control over to whatever test/other fixture needs this then after the yield you can do cleanup to reset the yaml to what it was prior to running whatever test/fixture changed it.

I'm doing something similar with Merlin's CONFIG object on a branch where I'm working on incorporating pytest/coverage into Merlin's test suite: https://github.com/bgunnar5/merlin/blob/feature/coverage/tests/conftest.py#L230. This is definitely subject to change as it's still a WIP but just showcasing an example of what I'm thinking.

self.step_prioritizer = StepPrioritizer()

for expr in self.exec_list:
# note: likely be nicer to make these actual objects during spec parsing..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this note

pprint(f"Parameterized step name: {parameterized_step_name}, step: {parameterized_step}")
if not parameterized_step: # catch source, which is always none
continue
# NOTE: really can't get the base step name anymore after staging?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this same problem when developing Merlin's status commands so if we could implement an easy way to access this post-staging that would be amazing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants