Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow-level checkpointing #43

Closed
kylechard opened this Issue Dec 7, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@kylechard
Copy link
Collaborator

kylechard commented Dec 7, 2017

We need to provide the ability to record workflow state (e.g., task status, input/output, task command) and export it to an external store (e.g., database, file). Allow for restart from the checkpoint.

@yadudoc yadudoc self-assigned this Dec 7, 2017

@yadudoc yadudoc added the enhancement label Dec 7, 2017

@yadudoc yadudoc added this to the Parsl-0.4.0 milestone Dec 7, 2017

@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Dec 11, 2017

Breaking this issue into concrete tasks :

  • Enable checkpointing functionality in the dfk.
  • Enable option to auto-checkpoint at defined time-intervals and events such as cleanup
  • Restarts from checkpoints. Restarting could be complex for non-deterministic applications.

For workflows with no-nondeterministic pieces such as conditionals or loops on non-deterministic outputs, can be quite hard and the best we can do is, use the checkpoint as a memoization lookup table.
This approach would work correctly for deterministic workflows. However it might be expensive to track, input file timestamps, hashes of function body, etc to know if a the task and its inputs has not changed.

Alternatively, for purely deterministic workflows we could use task_ids (which are assigned sequentially) to track what apps have completed, and allow for restarting from the checkpoint.

@salmanhabib

This comment has been minimized.

Copy link

salmanhabib commented Dec 12, 2017

Also, don't forget having the run info in a database (this part of both the provenance tracking and checkpointing requirements).

@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Dec 12, 2017

Adding couple more task to track requirements from @salmanhabib :

  • Checkpoint periodically to a database
  • Streaming state to a database
@danielskatz

This comment has been minimized.

Copy link
Collaborator

danielskatz commented Dec 12, 2017

@salmanhabib - can you please define what you mean by "provenance tracking" here? What specific information do you want to be stored?

@salmanhabib

This comment has been minimized.

Copy link

salmanhabib commented Dec 12, 2017

@danielskatz Things like run parameters, status of outputs, versions of software/compute modules, lists of failed tasks (if relevant), etc. Basically we want to have the notion of reproducibility baked into the workflow so we can rerun things and get the same answer plus compare against runs where we know certain changes/modifications have been made.

@kylechard

This comment has been minimized.

Copy link
Collaborator Author

kylechard commented Dec 12, 2017

@salmanhabib We also have an issue related to capturing broad provenance information: #42.

@salmanhabib

This comment has been minimized.

Copy link

salmanhabib commented Dec 12, 2017

@kylechard Yes, I know, originally I was just saying that there is some overlap between the two.

@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Jan 23, 2018

Here's what I propose :

The DFK will have two methods:

  1. checkpoint( ) - write all non-checkpointed info to a checkpoint object
  2. resume_from( ) that loads all checkpointed information from a checkpoint object

We will support three modes for automatic checkpointing:

  1. (eager) checkpoint at the completion of every task or
  2. (lazy) at configurable intervals (say ~1hr).
  3. (at_exit) write out a checkpoint when the DFK exits, capturing early termination from app failures,
    uncaught exceptions, keyboard interrupts etc.

In the simplest case, code does not change, functions don't hold any state (no clojures) and are deterministic. So if an app executes a function named app_foo( ), every subsequent invocation with the same args, kwargs and env would produce the same results. With this assumption, if we store completed results and return these results at call time as completed futures we would skip every invocation of a task that has been completed as we re-create the dataflow-graph on re-execution.

Complications:

  • Tracking code change, and forcing re-execution of parts of the graph.
  • Function could be redefined in different contexts
  • Data objects could be very deep or not comparable
  • Changes to file objects aren't trackable since they are right now treated as strings, this would change once File objects are used consistently, but would require means to compare files.
  • Non-determinism from the script itself via conditions
@annawoodard

This comment has been minimized.

Copy link
Collaborator

annawoodard commented Jan 23, 2018

This is really exciting!

Just brainstorming: could we tweak resume_from to be more general? One possibility would be to have a load_cache or similar so checkpoints could be composed. For example, if you have two workflows, blue and red, that look like this:

blue: 
    A -> B -> C
    D -> E
    C, E -> F
red:
    G -> H -> I

Which you've already run. Now later you want to run:

yellow:
    D -> E
    G -> H
    E, H -> K

With the composing feature, you could use: load_cache(blue, red) and re-use both the D -> E and G -> H steps.

EDIT: Slight clarification, renaming

@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Jan 23, 2018

This should be doable since resume_from(...) should be capable of loading multiple checkpoint files. The yellow workflow has to be a mashup of parts of both the blue and red workflow to use results from either since the results are made available only when the parent apps are called.

Some other things to consider:

  • checkpointing is closely related to provenance tracking and monitoring, and it might help to design the interface keeping future deliverables like pushing state info to a DB
  • do we include a previous checkpoint in a run that loads from a checkpoint ?
@annawoodard

This comment has been minimized.

Copy link
Collaborator

annawoodard commented Jan 23, 2018

This should be doable since resume_from(...) should be capable of loading multiple checkpoint files.

Right, "loading multiple checkpoint files" is the functionality I was advocating for, I'm glad to hear it's already part of the plan. This is minor, but in my opinion load_cache(red, blue) or load_checkpoints(A, B) would more clearly describe what's being done-- resume_from for me implies something being done in serial, it's not clear what resume_from(red, blue) means (are we starting from red or blue?).

The yellow workflow has to be a mashup of parts of both the blue and red workflow to use results from either since the results are made available only when the parent apps are called.

Yes, exactly. In my work, it's a common scenario. Checkpointing "red workflow" so that "red workflow" can be restarted without waste is exciting. Checkpointing "red workflow" so work done can be re-used in subsequent workflows is still more exciting! In HEP this is commonly achieved "by hand" by splitting work into stages depending on how often you're likely to make changes to each stage. That way the less-frequently-changing stages don't have to be re-run. Parsl supporting multiple checkpoint files would abstract that "by hand" factorization away.

checkpointing is closely related to provenance tracking and monitoring, and it might help to design the interface keeping future deliverables like pushing state info to a DB

I was assuming that the checkpoint object would stored in a DB. For optimal speed I think that's the way to go. What is the current proposal?

do we include a previous checkpoint in a run that loads from a checkpoint ?

This is an important question and a bit tricky. The more unneeded "stuff" that's being carried around in the checkpoints, the slower/less convenient they will be. I think we would definitely want to at least have the option to "prune" a checkpoint such that anything from the previous checkpoint that is not used in the current workflow is dropped.

yadudoc added a commit that referenced this issue Jan 26, 2018

Enabling memoization and checkpointing addressing #43
Support for memoization, as well as checkpointing via dumping
the memoization table incrementally to pickle files. Support
for other storage methods to come later. Tested.
@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Jan 28, 2018

We can now do these things :

  • Memoize apps
  • Memoization to use code changes (works in interactive envs as well)
  • Write checkpoint incrementally (Manual, user calls dfk.checkpoint)
  • Load checkpoints at start time.
  • Dev docs added, user docs pending.
@yadudoc

This comment has been minimized.

Copy link
Member

yadudoc commented Jan 28, 2018

Merged to master as of 99e9461

@yadudoc yadudoc closed this Jan 28, 2018

yadudoc added a commit that referenced this issue Jan 29, 2018

Updating config defaults for memoization, enabling cache specificatio…
…n at per app level. #43

Now appCaching is enabled globally, but is disabled per app.
App decorator now take cache=Bool to enable caching at app level.

benclifford pushed a commit that referenced this issue Aug 9, 2018

annawoodard added a commit that referenced this issue Sep 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.