Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: query node may stuck at stopping progress #33104

Merged
merged 2 commits into from
May 20, 2024

Conversation

weiliu1031
Copy link
Contributor

issue: #33103
when try to do stopping balance for stopping query node, balancer will try to get node list from replica.GetNodes, then check whether node is stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and the stopping node maintains in roNodes, so balancer couldn't find replica which contains stopping node, and stopping balance for replica won't be triggered, then query node will stuck forever due to segment/channel doesn't move out.

when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is stopping,
if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label May 16, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/bug Issues or changes related a bug labels May 16, 2024
Copy link

codecov bot commented May 16, 2024

Codecov Report

Attention: Patch coverage is 85.21739% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 82.15%. Comparing base (1671c78) to head (52bb75f).
Report is 16 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #33104      +/-   ##
==========================================
- Coverage   82.18%   82.15%   -0.04%     
==========================================
  Files        1006     1006              
  Lines      128428   128336      -92     
==========================================
- Hits       105554   105429     -125     
- Misses      18902    18924      +22     
- Partials     3972     3983      +11     
Files Coverage Δ
internal/querycoordv2/checkers/balance_checker.go 95.12% <100.00%> (-0.18%) ⬇️
internal/querycoordv2/checkers/channel_checker.go 85.40% <100.00%> (+0.19%) ⬆️
internal/querycoordv2/checkers/index_checker.go 78.49% <100.00%> (-0.89%) ⬇️
internal/querycoordv2/checkers/leader_checker.go 95.23% <100.00%> (-0.09%) ⬇️
internal/querycoordv2/checkers/segment_checker.go 88.30% <100.00%> (-0.54%) ⬇️
internal/querycoordv2/job/job_load.go 90.67% <100.00%> (-0.24%) ⬇️
internal/querycoordv2/meta/replica.go 100.00% <100.00%> (ø)
internal/querycoordv2/meta/replica_manager.go 82.09% <100.00%> (ø)
internal/querycoordv2/meta/resource_manager.go 80.85% <100.00%> (-0.08%) ⬇️
...nternal/querycoordv2/observers/replica_observer.go 95.31% <100.00%> (+0.07%) ⬆️
... and 10 more

... and 38 files with indirect coverage changes

@mergify mergify bot added the ci-passed label May 16, 2024
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@sre-ci-robot sre-ci-robot added size/XL Denotes a PR that changes 500-999 lines. and removed size/L Denotes a PR that changes 100-499 lines. labels May 17, 2024
@mergify mergify bot added ci-passed and removed ci-passed labels May 17, 2024
@@ -685,6 +684,7 @@ func (s *Server) watchNodes(revision int64) {
)
s.nodeMgr.Stopping(nodeID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sessionUpdateEvent is always stopping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only query node turn to stopping mode, it will update it's session state field

@chyezh
Copy link
Contributor

chyezh commented May 17, 2024

/lgtm

@congqixia
Copy link
Contributor

/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, weiliu1031

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot sre-ci-robot merged commit a7f6193 into milvus-io:master May 20, 2024
15 checks passed
weiliu1031 added a commit to weiliu1031/milvus that referenced this pull request May 20, 2024
issue: milvus-io#33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
sre-ci-robot pushed a commit that referenced this pull request May 20, 2024
issue: #33103 
pr: #33104
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved ci-passed dco-passed DCO check passed. kind/bug Issues or changes related a bug lgtm size/XL Denotes a PR that changes 500-999 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants