Description
To facilitate running nsqd
in environments where the host isn't long running and to facilitate operations around managing a cluster a new "draining" mode will be introduced to nsqd
.
A nsqd
instance in "draining" mode will:
- Not accept any new messages
- Allow consumers to receive all remaining messages
- Exit after all topics and channels are empty.
- Indicate draining status via
/info
endpoint
Clients that use a HA approach of pooling multiple nsqds for publishing messages (i.e. nsqio/go-nsq#311 ) are expected to transparently tolerate a host in draining mode.
Implementation Plan
A new --sigterm=drain
CLI flag will enable this new behavior. Existing functionality will be preserved with the argument --sigterm=clean-shutdown
A new PUT /config/drain
endpoint can also initiate a drain, and a PUT /config/shutdown
would initiate a clean shutdown.
When in a draining mode new messages will be rejected with an error. E_PUB_FAILED
will be the response for new messages over the TCP protocol, and HTTP 503 for http protocol.
An attempt to create new topics and channels (via subscribe) will be rejected if nsqd
is in drain mode.
Once initiated a drain operation can only be completed, it can't be canceled. TBD: PUT /config/shutdown
may be able to override the drain and close all connections and exit nsqd
.
Open Questions
- Should each topic (and/or channel) be closed as they are drained or should they only be closed after all are drained? If this functionality is per-topic should the HTTP API expose that same behavior, and is there a need to expand the lookupd protocol to initiate a tombstone before existing to avoid race conditions w/ clients?
- Should draining close a topic/channel or should it still be configured on a nsqd instance after restart? (i.e. is this similar to
POST /topic/delete
)