-
Notifications
You must be signed in to change notification settings - Fork 0
STARDOS Logging
Logging is an incredibly important part of STARDOS. It is how we will monitor the state of each process, and the logs will be captured to be reviewed after the mission if necessary. As such, it is important to have a cohesive logging strategy so that any issues can be easily found and so that irrelevant information is kept out of production systems.
Each of the five logging levels ROS2 has defined will be used in STARDOS for different purposes, and the system will be configured to run at different log levels based on the current use case. In the best case, using the wrong log level will pollute the logs with irrelevant information and make diagnosing issues harder. In the worst case, using the wrong log level will hide critical information about runtime problems from post-mission diagnostics. As such, it is vitally important to the integrity of STARDOS that the logging levels be used properly.
DEBUG logging is the lowest-severity log level. There is no standard for DEBUG logging, since (as implied by the name) it is reserved for debugging use. The system will only ever be set to DEBUG log level during the debugging phase of development, so log pollution is not an issue at this level. Generally speaking, any changing runtime values should be logged to DEBUG if they are logged. There is not a limit to how many DEBUG log messages a node can write, but they should not contain computationally-expensive operations in production code -- any that do should be removed from the source code prior to deployment.
ROS2 already produces a large number of logs at the DEBUG level. It may be useful to
filter these out, particularly because each time a message is published to a ROS2
topic, rclpy or rclcpp will produce multiple lines of DEBUG output. These can be
filtered out by running stderr through grep -v rclpy or grep -v rclcpp
respectively.
INFO logging is used for informational messages. The system will be set to INFO log level during testing and initial deployment. For each operation a node performs, it should log a single message at INFO level upon completion of the operation. If the operation may take an extended period of time, an additional INFO level messsage may be logged at the beginning of the operation as well.
In addition, any time a node starts a subprocess or other asynchronous operation, it must produce an INFO level message stating that it has done so, in a way that makes it clear what asynchronous operation it is running. This may be the same message as is logged at the beginning of a long-running operation such as an image capture.
WARN logging is used for warning messages. The system will be set to WARN log level during operational deployment. Any abnormal operation that does not endanger the node, mission, or future operations but may impact the results of the current operation should produce a WARN level message.
In addition, any time a subprocess exits with a non-zero exit code, a WARN level
message should be logged. Any ERROR or FATAL level messages detailing the reason
will be logged by the subprocess. The exception is if the exit code is -9, as this
indicates it was killed with SIGKILL, and thus had no opportunity to log anything
prior to exiting. SIGKILL is never to be used by any STARDOS node, so a process
being killed in this way is an error state.
ERROR logging is used for error messages. Any abnormal or failed operation that may endanger the node, mission, or future operations should produce an ERROR level message. Additionally, a node which encounters an unrecoverable error should produce an ERROR level message, unless said error endangers the entire mission (such as a capture group, the datalink node, or the control node).
In addition, any time a subprocess exits with code -9 (indicating termination by
SIGKILL), an ERROR level message should be logged.
FATAL logging is reserved for errors which prove fatal to the mission as a whole, rather than the conventional meaning of "fatal for the process". This includes a capture group crashing (as data can no longer be collected), the datalink node crashing (as autopilot telemetry is no longer accessible), or any control node operation that leaves the payload in a state unsuitable to continue a mission.
Additionally, a FATAL level message may be logged if the computer which produced the log must be rebooted in order to regain functionality.
STARDOS nodes should never explicitly write to stdout or stderr. Some driver
libraries may write to these, and that is acceptable, but STARDOS node developers
should not use them. The reasoning behind this is that, because every STARDOS node
will be a child of the control node, they all share a single stdout and
stderr. ROS2's logging tools include the node name and namespace in the log
outputs, meaning that it is easy to search through the control node's output to see
every node's output uniquely.