Skip to content

refactor: implement cooperative state machine for range/list operations #18204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 21, 2025

Conversation

drmingdrmer
Copy link
Member

@drmingdrmer drmingdrmer commented Jun 20, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

refactor: implement cooperative state machine for range/list operations

Fix blocking issue during initialization data transmission.

Problem:

When establishing a watch stream, the meta-service sends large amounts of
initialization data to the client. During this transmission, other events
are blocked until completion, including add-watcher commands.

This creates a deadlock: if initialization data is large, it blocks all
subsequent Dispatcher operations. When a second watch request arrives,
it must wait for the first one to complete sending all initialization data.
Since adding a new watcher requires holding the state machine lock,
multiple concurrent watch requests will block the state machine entirely,
causing timeouts for other requests.

Solution:

Make the process cooperative by not waiting for watch stream transmission
to complete. Instead, queue the add-watcher command and return immediately.
This allows subsequent watch requests to proceed without waiting for previous
initialization data transmissions to finish.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Refactoring

Related Issues


This change is Reviewable

@github-actions github-actions bot added the pr-refactor this PR changes the code base without new features or bugfix label Jun 20, 2025
@drmingdrmer
Copy link
Member Author

This should fix the 4s timeout when meta-service initialize a watch stream with large bunch of initialization data. @everpcpc @BohuTANG

@drmingdrmer drmingdrmer requested review from BohuTANG and everpcpc June 20, 2025 14:16
@drmingdrmer drmingdrmer marked this pull request as ready for review June 20, 2025 14:16
Copy link
Member

@BohuTANG BohuTANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Fix blocking issue during initialization data transmission.

Problem:

When establishing a watch stream, the meta-service sends large amounts of
initialization data to the client. During this transmission, other events
are blocked until completion, including add-watcher commands.

This creates a deadlock: if initialization data is large, it blocks all
subsequent Dispatcher operations. When a second watch request arrives,
it must wait for the first one to complete sending all initialization data.
Since adding a new watcher requires holding the state machine lock,
multiple concurrent watch requests will block the state machine entirely,
causing timeouts for other requests.

Solution:

Make the process cooperative by not waiting for watch stream transmission
to complete. Instead, queue the add-watcher command and return immediately.
This allows subsequent watch requests to proceed without waiting for previous
initialization data transmissions to finish.
@drmingdrmer drmingdrmer merged commit 0224108 into databendlabs:main Jun 21, 2025
148 of 150 checks passed
@drmingdrmer drmingdrmer deleted the 315-bb branch June 21, 2025 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants