Zeek Supervisor Client
- Executable:
zeekc
- Language: C++
- Revision Control: will live in
zeek
git repo
Assumption is that a Supervised Zeek Cluster is already started/running: user or
service-manager has ran zeek -j …
and the ellipsis includes some script/option
that will load the Zeek-script which defines a cluster for Zeek to supervise.
# Display standard "usage" info: flags, list of commands with brief explanation,
# influential environment variables, etc.
$ zeekc help
...
# Query the current cluster status
$ zeekc status [all | <node_name>]
...
# Displays a table of nodes according to these structures:
# https://docs.zeek.org/en/current/scripts/base/frameworks/supervisor/api.zeek.html#type-Supervisor::Status
# Do we need to include any other metrics in the returned status?
# Do we need more categories to filter by (e.g. node type) ?
# If there's downed nodes at this point, what do we expect users to do?
# Check the standard services logs for stderr/stdout info? Check reporter.log ?
# A `zeekc diag` command could help gather information, like ask Zeek supervisor
# to find core dumps and extract stack trace. Would it do more than that, like
# show last N lines of downed nodes' stderr, or last N lines of reporter.log?
# Inspect various state stored in script variable named by <ID>
$ zeekc print <ID> [all | <node_name>]
...
# User may modify Zeek scripts at this point and ask if they're valid/loadable:
$ zeekc check
...
# If it passes, ask to reload the cluster with updated scripts:
$ zeekc restart [all | <node_name>]
...
# If we wanted to stop the cluster for some time:
$ zeekc stop [all | <node_name>]
...
# To resume the cluster:
$ zeekc start [all | <node_name>]
...
# To terminate the cluster, including the supervisor:
$ zeekc terminate
...
# Normally wouldn't terminate the supervisor if a service-manager is handling
# the Zeek supervisor process itself and will just restart it, but `terminate`
# would be helpful for anyone running a supervised Zeek cluster "manually".
# The typical way to terminate a cluster, including supervisor, perhaps to
# upgrade the local Zeek installation, would look like:
$ zeekc stop && systemctl stop zeek
...
# One could go directly through `systemctl stop`, too, but that's not going to
# have any "orderly" shutdown semantics for the cluster which, in the
# future, may span multiple hosts that `zeekc` needs to orchestrate more
# intelligently than simply asking each host to "shutdown everything".
-
zeekc version
orzeekc --version
- Show a version number and exit (see Open Questions below,
but we might just plan to emit the
zeek
version number to whichzeekc
is paired)
- Show a version number and exit (see Open Questions below,
but we might just plan to emit the
-
zeekc -v/--verbose
- Enable verbose debugging output to stderr
- Do we anticipate a
zeekc
connecting to azeek
of different versions?- There's a couple ways for this to break
- The underlying Broker/CAF versioning between peers differs
- Should be easy to detect handshake failure and report nicely
- The underlying Broker/CAF message format is compatible, but the
Supervisor events / data structures changed between
zeek
versions- I'd suggest having only standardized "hello" or "handshake" exchange
-
zeekc
:publish("zeek/supervisor", hello, "zeek/zeekc")
-
zeek
:publish("zeek/zeekc", hello, zeek_version())
-
zeekc
: wait for response with relatively short timeout interval. If the major/minor versions matches what we were built for (3.2, 4.0, 4.1, etc), then proceed, else emit fatal error.
-
- I'd suggest having only standardized "hello" or "handshake" exchange
- The underlying Broker/CAF versioning between peers differs
- There's a couple ways for this to break
-
New options
-
SupervisorControl::enable=T
: toggles whether tolisten()
for external requests by default -
SupervisorControl::listen_port=42042/tcp
: the port on which tolisten()
for external requests
-
-
publish()
responses to requests using a topic related to request ID- This helps there potentially be multiple "client" implementations that can "play nice" with each other and don't get responses mixed up. Example of alternate client could be a Python script that directly requests status updates from Zeek supervisor.
-
The PID status of nodes is currently the "PID of last fork()", even if that fork already exited, so need to change/document that to report some sentinel value indicating "currently down"
-
Probably nice to have an API to request continuous status updates
- e.g. any change in the process tree gets published to a topic of choosing
- This helps
zeekc stop
do an orderly shutdown: ask to shutdown workers, then proxies, then manager, then logger and at each step wait for status updates to confirm all nodes of that type are gone
-
Add
Supervisor::stop()
andSupervisor::start()
tokill()
andfork()
nodes respectively, but without mutating the node table. This differs fromcreate()
anddestroy()
operations which do change the node table. Child processes associated with a "stopped" node do not automatically get revived until "started".-
create()
can add a default parameter ofstart: bool &default=T
.
-
-
Rename
SupervisorControl::stop_request
toSupervisorControl::terminate_request
and implementstop_request
andstart_request
calling tostop()
andstart()
-
check
- Try to start a shadow version of the process tree in "parse-only" mode and return if anything exits non-zero
- If supervisor is not running, or has no children, report that as error
-
print
- Not currently supported by
SupervisorControl
API, but can add
- Not currently supported by
-
stop
- Order stop of workers, then proxies, then manager, then logger
-
start
- Orderly start of logger, then manager, then proxies, then workers
-
restart
- Likely just
stop
followed bystart
- Likely just
-
terminate
- Likely just
stop
followed byterminate_request()
- Likely just
-
Ability to toggle TLS: command-line flag or env. variable.
-
Ability to change connection port: command-line flag or env. variable