Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If DisableInitialHostLookup, Network Instability Can Cause Prolonged "no hosts available in pool" #1721

Open
amcrn opened this issue Sep 7, 2023 · 3 comments · May be fixed by #1722
Open

Comments

@amcrn
Copy link

amcrn commented Sep 7, 2023

What version of Cassandra are you using?

Reproducible on 3.x & 4.x

What version of Gocql are you using?

v1.6.0

What version of Go are you using?

1.20

What did you do?

If DisableInitialHostLookup is set to true, the ID for each
gocql.HostInfo is set to a random UUID.

host.SetHostID(MustRandomUUID().String())

When refreshRing(...) is invoked, system.peers is queried,
returning the new list of Hosts. However, these gocql.HostInfo have
the ID set as the real host_id from system.peers.

Therefore, when r.session.ring.addHostIfMissing(h) is invoked, the
ID does not match any of the random UUIDs, resulting in hostIPToUUID
pointing to the new ID, but hostList appends.

gocql/ring.go

Lines 95 to 101 in db6d556

existing, ok := r.hosts[hostID]
if !ok {
r.hosts[hostID] = host
r.hostIPToUUID[host.nodeToNodeAddress().String()] = hostID
existing = host
r.hostList = append(r.hostList, host)
}

addHostIfMissing(h) returns !ok, resulting in
r.session.startPoolFill(h) being invoked for each Host, which in turn
calls s.pool.addHost(host) (which checks by Host ID), and
s.policy.AddHost(host). The Policy's AddHost however uses a
cowHostList, which defines equality by the ConnectAddress, not the
ID!

gocql/host_source.go

Lines 138 to 145 in b9737dd

func (h *HostInfo) Equal(host *HostInfo) bool {
if h == host {
// prevent rlock reentry
return true
}
return h.ConnectAddress().Equal(host.ConnectAddress())
}

This mismatch of checks (ID vs. IP/Addr) results in the Policy
de-duplicating the Host, but the Pool does not. This becomes a problem
if a network flap occurs again. One variant of this problem is
queryExecutor.do(...) has hostIter() only able to return nil,
because q.policy's hosts are empty, whereas q.pool's hostConnPools
contains the random UUID's entry, with filling repeatedly being set
to true on reconnect() attempts. This effectively starves the
ability to get an Up host, resulting in non-stop
gocql: no hosts available in pool until the session is re-created
or the application is restarted.

The workaround being used at the moment is to set DisableInitialHostLookup
to false and using SingleHostReadyPolicy (to reduce the connection
time that having DisableInitialHostLookup=true was originally achieving).

What did you expect to see?

That when DisableInitialHostLookup=true, and a ring refresh occurs, the code correctly de-deduplicates.

What did you see instead?

Remaining Pools associated with non-existent UUIDs, which causes side-effects on subsequent network flaps.

@martin-sucha
Copy link
Contributor

Hi!

Thank you for reporting the issue!

It seems that we should update the HostInfo.Equal to use UUID comparison as well as update other places to use the UUID (some topology tests and maybe others). Could you please check if that solves the issue?

Also I'm curious if we should deprecate the DisableInitialHostLookup option. Is there a use case for it besides the faster startup (which should be solved by SingleHostReadyPolicy)?

amcrn added a commit to amcrn/gocql that referenced this issue Sep 8, 2023
If DisableInitialHostLookup is enabled, the Host IDs are random UUIDs. Therefore, on the first ring refresh, remove the hosts and re-add them back and refill the pools.

Closes gocql#1721
@amcrn
Copy link
Author

amcrn commented Sep 8, 2023

The PR I submitted above seems to solve the problem. Let me know what you think.

In general, I think DisableInitialHostLookup should be deprecated. It doesn't stop the consultation of system.peers on a ring refresh, so the only utility as far as I can tell is to reduce the amount of time to create a new Session. But if SingleHostReadyPolicy is used, the connection time is reduced as much as possible.

The only possible corner-case I can think of is that if the initial provided hosts are resolvable, but all hosts are in a bad node state, that DisableInitialHostLookup today will proceed with creating the Session, despite zero hosts being marked as Up. Whereas with SingleHostReadyPolicy, at least one has to be marked as Up. This "fail-open" option might be useful to some folks. Thoughts?

@josuebrunel
Copy link

josuebrunel commented Jan 16, 2024

Hello @amcrn , any example of the implementation of a SingleHostReadyPolicy please ?

edit:
I'll try with

cluster.PoolConfig = gocql.PoolConfig{
		HostSelectionPolicy: gocql.SingleHostReadyPolicy(gocql.RoundRobinHostPolicy()),
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants