Skip to content

Conversation

hpatro
Copy link
Collaborator

@hpatro hpatro commented Jun 4, 2025

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and reconcile all the nodes in the shard to the primary node's shard-id.

@hpatro hpatro requested a review from PingXie June 4, 2025 20:25
Copy link

codecov bot commented Jun 4, 2025

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 73.57%. Comparing base (3789b29) to head (8f658fe).
Report is 14 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2174      +/-   ##
============================================
+ Coverage     71.43%   73.57%   +2.13%     
============================================
  Files           122      122              
  Lines         66210    73777    +7567     
============================================
+ Hits          47300    54281    +6981     
- Misses        18910    19496     +586     
Files with missing lines Coverage Δ
src/cluster_legacy.c 89.50% <0.00%> (+2.77%) ⬆️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro added 2 commits June 10, 2025 20:26
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@hpatro hpatro force-pushed the handle-divergent-shard-id-nodes-conf branch from 5c52a9d to 8f658fe Compare June 10, 2025 20:27
@hpatro
Copy link
Collaborator Author

hpatro commented Jun 11, 2025

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro
Copy link
Collaborator Author

hpatro commented Jun 11, 2025

Test scenarios:

43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9

All of these load fine.

@martinrvisser
Copy link

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro fine in my tests also, nice work
2804808:M 13 Jun 2025 12:13:25.599 * Node 386bac3e5337540f2bce4181ff8980f33a989667 has a different shard id (4009ad6c27a3fe91b709a6502bd111b5d58eada0) than its primary's shard id bb85c3d913bb6f9e3d02e68ae89b5c5e61111fb4 (f713a7d0f3079e30fcc47348260a09de0c9b8b72). Updating replica's shard id to match primary's shard id.
2804808:M 13 Jun 2025 12:13:25.599 * Node configuration loaded, I'm 386bac3e5337540f2bce4181ff8980f33a989667

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

@hpatro
Copy link
Collaborator Author

hpatro commented Jun 18, 2025

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

Yes. Updated.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hpatro hpatro added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 20, 2025
@hpatro hpatro merged commit 4a3e713 into valkey-io:unstable Jun 20, 2025
81 checks passed
stockholmux added a commit to stockholmux/valkey that referenced this pull request Aug 13, 2025
Signed-off-by: Kyle J. Davis <kyledvs@amazon.com>
@hpatro hpatro added the release-notes This issue should get a line item in the release notes label Aug 13, 2025
@zuiderkwast zuiderkwast moved this to 8.0.5 (to be released) in Valkey 8.0 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 8.1 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 7.2 Aug 21, 2025
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 21, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Aug 22, 2025
…d id (#2174)

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: To be backported
Status: 8.0.5
Status: To be backported
Development

Successfully merging this pull request may close these issues.

[BUG] "Unrecoverable error: corrupted cluster config file" due to incorrect shard-id
5 participants