Skip to content

bugfix: Ensure deferred checkpoint includes successor tasks in execution queue#46

Merged
myui merged 3 commits intomainfrom
bugfix/checkpoint_resume
Apr 6, 2026
Merged

bugfix: Ensure deferred checkpoint includes successor tasks in execution queue#46
myui merged 3 commits intomainfrom
bugfix/checkpoint_resume

Conversation

@myui
Copy link
Copy Markdown
Collaborator

@myui myui commented Apr 6, 2026

This pull request fixes a bug in the workflow engine's checkpointing logic to ensure that when a deferred checkpoint is requested, all successor tasks are properly included in the checkpoint's pending task queue. This guarantees that resuming from a checkpoint will correctly execute all remaining tasks. The change moves the checkpoint handling to occur after successor processing and adds thorough tests to verify correct behavior.

Bug fix: Deferred checkpoint handling

  • Moved the handling of deferred checkpoint requests in graflow/core/engine.py to occur after successor tasks are added to the queue, ensuring that checkpoints include all pending successors. [1] [2]

Testing improvements

  • Added tests/core/test_checkpoint_resume_bug.py with two tests:
    • Verifies that resuming from a deferred checkpoint executes all remaining successor tasks and returns the expected results.
    • Confirms that the checkpoint's pending queue contains all appropriate successor tasks after resuming, preventing loss of workflow progress.… queue

@myui
Copy link
Copy Markdown
Collaborator Author

myui commented Apr 6, 2026

Summary

  • Fix a bug where resuming from a deferred checkpoint skips remaining tasks because successor nodes were not yet enqueued when the checkpoint was created
  • Move _handle_deferred_checkpoint() call after successor processing in the WorkflowEngine.execute() loop

Problem

When a task requests a deferred checkpoint via task_ctx.checkpoint(), the engine's execution loop was:

  1. mark_task_completed(task_id)
  2. increment_step()
  3. _handle_deferred_checkpoint() — checkpoint saved here
  4. Successor processing — successor tasks added to queue here

Since the checkpoint was created before successors were enqueued, the restored queue was empty.
engine.execute(restored_ctx) would call get_next_task(), receive None, and return immediately — never executing the remaining tasks.

Fix

Reorder the loop so that _handle_deferred_checkpoint() runs after successor processing:

  1. mark_task_completed(task_id)
  2. increment_step()
  3. Successor processing — successor tasks added to queue
  4. _handle_deferred_checkpoint() — checkpoint now includes pending successors

Test Plan

  • Added tests/core/test_checkpoint_resume_bug.py with two regression tests:

    • test_deferred_checkpoint_resumes_with_successors — end-to-end: checkpoint in step_2, resume, verify step_3 executes
    • test_deferred_checkpoint_queue_contains_successor — root cause: verify checkpoint queue contains the successor task
  • All 41 existing checkpoint tests pass (no regressions)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a workflow-engine checkpointing bug where deferred checkpoints were created before successor tasks were enqueued, causing resumed executions to miss remaining work.

Changes:

  • Moved deferred checkpoint handling in WorkflowEngine.execute() to run after successor tasks are added to the execution queue.
  • Added a new regression test file validating both resume execution and restored pending-queue contents for deferred checkpoints.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
graflow/core/engine.py Reorders deferred checkpoint creation to include newly enqueued successor tasks in the checkpoint queue snapshot.
tests/core/test_checkpoint_resume_bug.py Adds regression tests to ensure resumed contexts execute remaining successors and that the restored pending queue includes successors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@myui myui merged commit c7c0de4 into main Apr 6, 2026
3 checks passed
@myui myui deleted the bugfix/checkpoint_resume branch April 6, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants