Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious NOLEADER in healthy clusters #21

Closed
aphyr opened this issue Mar 11, 2020 · 3 comments
Closed

Spurious NOLEADER in healthy clusters #21

aphyr opened this issue Mar 11, 2020 · 3 comments

Comments

@aphyr
Copy link

aphyr commented Mar 11, 2020

I'm not entirely sure what's up with this case--it looks like the cluster started up and formed normally, we joined all nodes to n1, n1 was the leader, then... immediately thereafter, n1 started insisting that there was no leader, and rejecting requests, while the rest of the cluster went on normally.

20200311T174200.000-0400.zip

latency-raw (45)

To start, n1 is the leader:

 ({:id 958289600, :role :leader, :node "n1"}
 {:id 183312529, :node "n2", :role :follower}
 {:id 561617232, :node "n3", :role :follower}
 {:id 314213252, :node "n4", :role :follower}
 {:id 1472507146, :node "n5", :role :follower})

n1 executes a few transactions

2020-03-11 17:55:40,336{GMT}	INFO	[jepsen worker 15] jepsen.util: 15	:ok	:txn	[[:append 2 2] [:r 2 [2]] [:r 1 []]]

But a couple seconds later, it flips to NOTLEADER, then NOLEADER:

2020-03-11 17:42:13,330{GMT}	INFO	[jepsen worker 15] jepsen.util: 15	:fail	:txn	[[:r 2 nil]]	:notleader
2020-03-11 17:42:13,330{GMT}	INFO	[jepsen worker 30] jepsen.util: 30	:fail	:txn	[[:r 1 nil]]	:notleader
2020-03-11 17:42:13,336{GMT}	INFO	[jepsen worker 40] jepsen.util: 40	:fail	:txn	[[:r 0 nil] [:r 1 nil] [:r 0 nil] [:append 2 23]]	:noleader
2020-03-11 17:42:13,336{GMT}	INFO	[jepsen worker 25] jepsen.util: 25	:fail	:txn	[[:append 1 3] [:append 0 5]]	:noleader

After a few hundred seconds of this, we start getting socket timeouts on n1:

2020-03-11 17:42:22,466{GMT}	INFO	[jepsen worker 43] jepsen.util: 43	:info	:txn	[[:append 2 4]]	:socket-timeout
@aphyr aphyr changed the title Node believes there's no leader despite other nodes being all right Spurious NOLEADER in healthy clusters Mar 11, 2020
@aphyr
Copy link
Author

aphyr commented Mar 26, 2020

We're still seeing this in 73ad833--nodes run fine for a while, then without faults, some of them decide there's no leader any more. Here's another test run

latency-raw (46)

@yossigo
Copy link
Collaborator

yossigo commented Apr 6, 2020

Should be solved by #34.

@aphyr
Copy link
Author

aphyr commented May 5, 2020

This looks fixed! We're still seeing occasional NOLEADER outages every minute or two with b9ee410, but they're short-lived now. :)

latency-raw (49)

@aphyr aphyr closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants