Skip to content

Zeek Supervisor Client

Jon Siwek edited this page Jun 30, 2020 · 1 revision

Zeek Supervisor Command-Line Client

Project Infrastructure

  • Executable: zeekc
  • Language: C++
  • Revision Control: will live in zeek git repo

Usage Scenarios

Assumption is that a Supervised Zeek Cluster is already started/running: user or service-manager has ran zeek -j … and the ellipsis includes some script/option that will load the Zeek-script which defines a cluster for Zeek to supervise.

# Display standard "usage" info: flags, list of commands with brief explanation,
# influential environment variables, etc.

$ zeekc help
  ...

# Query the current cluster status

$ zeekc status [all | <node_name>]
  ...

# Displays a table of nodes according to these structures:
# https://docs.zeek.org/en/current/scripts/base/frameworks/supervisor/api.zeek.html#type-Supervisor::Status
# Do we need to include any other metrics in the returned status?
# Do we need more categories to filter by (e.g. node type) ?

# If there's downed nodes at this point, what do we expect users to do?
# Check the standard services logs for stderr/stdout info?  Check reporter.log ?
# A `zeekc diag` command could help gather information, like ask Zeek supervisor
# to find core dumps and extract stack trace.  Would it do more than that, like
# show last N lines of downed nodes' stderr, or last N lines of reporter.log?

# Inspect various state stored in script variable named by <ID>
$ zeekc print <ID> [all | <node_name>]
  ...

# User may modify Zeek scripts at this point and ask if they're valid/loadable:
$ zeekc check
  ...

# If it passes, ask to reload the cluster with updated scripts:
$ zeekc restart [all | <node_name>]
  ...

# If we wanted to stop the cluster for some time:
$ zeekc stop [all | <node_name>]
  ...

# To resume the cluster:
$ zeekc start [all | <node_name>]
  ...

# To terminate the cluster, including the supervisor:
$ zeekc terminate
  ...

# Normally wouldn't terminate the supervisor if a service-manager is handling
# the Zeek supervisor process itself and will just restart it, but `terminate`
# would be helpful for anyone running a supervised Zeek cluster "manually".
# The typical way to terminate a cluster, including supervisor, perhaps to
# upgrade the local Zeek installation, would look like:

$ zeekc stop && systemctl stop zeek
  ...

# One could go directly through `systemctl stop`, too, but that's not going to
# have any "orderly" shutdown semantics for the cluster which, in the
# future, may span multiple hosts that `zeekc` needs to orchestrate more
# intelligently than simply asking each host to "shutdown everything".

Additional Meta-Usage

  • zeekc version or zeekc --version

    • Show a version number and exit (see Open Questions below, but we might just plan to emit the zeek version number to which zeekc is paired)
  • zeekc -v/--verbose

    • Enable verbose debugging output to stderr

Open Questions

  • Do we anticipate a zeekc connecting to a zeek of different versions?
    • There's a couple ways for this to break
      • The underlying Broker/CAF versioning between peers differs
        • Should be easy to detect handshake failure and report nicely
      • The underlying Broker/CAF message format is compatible, but the Supervisor events / data structures changed between zeek versions
        • I'd suggest having only standardized "hello" or "handshake" exchange
          • zeekc: publish("zeek/supervisor", hello, "zeek/zeekc")
          • zeek: publish("zeek/zeekc", hello, zeek_version())
          • zeekc: wait for response with relatively short timeout interval. If the major/minor versions matches what we were built for (3.2, 4.0, 4.1, etc), then proceed, else emit fatal error.

Implementation Notes

Zeek-side changes to better support zeekc

  • New options

    • SupervisorControl::enable=T: toggles whether to listen() for external requests by default
    • SupervisorControl::listen_port=42042/tcp: the port on which to listen() for external requests
  • publish() responses to requests using a topic related to request ID

    • This helps there potentially be multiple "client" implementations that can "play nice" with each other and don't get responses mixed up. Example of alternate client could be a Python script that directly requests status updates from Zeek supervisor.
  • The PID status of nodes is currently the "PID of last fork()", even if that fork already exited, so need to change/document that to report some sentinel value indicating "currently down"

  • Probably nice to have an API to request continuous status updates

    • e.g. any change in the process tree gets published to a topic of choosing
    • This helps zeekc stop do an orderly shutdown: ask to shutdown workers, then proxies, then manager, then logger and at each step wait for status updates to confirm all nodes of that type are gone
  • Add Supervisor::stop() and Supervisor::start() to kill() and fork() nodes respectively, but without mutating the node table. This differs from create() and destroy() operations which do change the node table. Child processes associated with a "stopped" node do not automatically get revived until "started".

    • create() can add a default parameter of start: bool &default=T.
  • Rename SupervisorControl::stop_request to SupervisorControl::terminate_request and implement stop_request and start_request calling to stop() and start()

zeekc

  • check

    • Try to start a shadow version of the process tree in "parse-only" mode and return if anything exits non-zero
    • If supervisor is not running, or has no children, report that as error
  • print

    • Not currently supported by SupervisorControl API, but can add
  • stop

    • Order stop of workers, then proxies, then manager, then logger
  • start

    • Orderly start of logger, then manager, then proxies, then workers
  • restart

    • Likely just stop followed by start
  • terminate

    • Likely just stop followed by terminate_request()
  • Ability to toggle TLS: command-line flag or env. variable.

  • Ability to change connection port: command-line flag or env. variable

Clone this wiki locally