We had a period were 3 of 12 instances in a cluster were all reporting true for "hasLeadership". The problem presisted for at least 10 min (after I noticed and until I restarted the cluster). A single instance-leader termination just caused the leadership to move to a new instance bringing count back to 3. Restarting all instances resolved the problem.
Jordan's suggestion is "LeaderLatch code isn't good about clearing the internal leader state when there are connection problems".
Please look into it. Thanks,
I was never able to write a test that reproduces this. However, I can think of several edge cases that might cause it. In the end, I re-wrote LeaderLatch to better handle connection/server instability. At the same time, made most of the calls async which will help concurrency and performance.
This will be in the next release.