Skip to content

STARDOS Logging

Alex Brimhall edited this page Jul 11, 2022 · 1 revision

title: STARDOS Logging

Overview

Logging is an incredibly important part of STARDOS. It is how we will monitor the state of each process, and the logs will be captured to be reviewed after the mission if necessary. As such, it is important to have a cohesive logging strategy so that any issues can be easily found and so that irrelevant information is kept out of production systems.

The Logging Levels

Each of the five logging levels ROS2 has defined will be used in STARDOS for different purposes, and the system will be configured to run at different log levels based on the current use case. In the best case, using the wrong log level will pollute the logs with irrelevant information and make diagnosing issues harder. In the worst case, using the wrong log level will hide critical information about runtime problems from post-mission diagnostics. As such, it is vitally important to the integrity of STARDOS that the logging levels be used properly.

DEBUG Logging

DEBUG logging is the lowest-severity log level. There is no standard for DEBUG logging, since (as implied by the name) it is reserved for debugging use. The system will only ever be set to DEBUG log level during the debugging phase of development, so log pollution is not an issue at this level. Generally speaking, any changing runtime values should be logged to DEBUG if they are logged. There is not a limit to how many DEBUG log messages a node can write, but they should not contain computationally-expensive operations in production code -- any that do should be removed from the source code prior to deployment.

A Note on Debugging

ROS2 already produces a large number of logs at the DEBUG level. It may be useful to filter these out, particularly because each time a message is published to a ROS2 topic, rclpy or rclcpp will produce multiple lines of DEBUG output. These can be filtered out by running stderr through grep -v rclpy or grep -v rclcpp respectively.

INFO Logging

INFO logging is used for informational messages. The system will be set to INFO log level during testing and initial deployment. For each operation a node performs, it should log a single message at INFO level upon completion of the operation. If the operation may take an extended period of time, an additional INFO level messsage may be logged at the beginning of the operation as well.

In addition, any time a node starts a subprocess or other asynchronous operation, it must produce an INFO level message stating that it has done so, in a way that makes it clear what asynchronous operation it is running. This may be the same message as is logged at the beginning of a long-running operation such as an image capture.

WARN Logging

WARN logging is used for warning messages. The system will be set to WARN log level during operational deployment. Any abnormal operation that does not endanger the node, mission, or future operations but may impact the results of the current operation should produce a WARN level message.

In addition, any time a subprocess exits with a non-zero exit code, a WARN level message should be logged. Any ERROR or FATAL level messages detailing the reason will be logged by the subprocess. The exception is if the exit code is -9, as this indicates it was killed with SIGKILL, and thus had no opportunity to log anything prior to exiting. SIGKILL is never to be used by any STARDOS node, so a process being killed in this way is an error state.

ERROR Logging

ERROR logging is used for error messages. Any abnormal or failed operation that may endanger the node, mission, or future operations should produce an ERROR level message. Additionally, a node which encounters an unrecoverable error should produce an ERROR level message, unless said error endangers the entire mission (such as a capture group, the datalink node, or the control node).

In addition, any time a subprocess exits with code -9 (indicating termination by SIGKILL), an ERROR level message should be logged.

FATAL Logging

FATAL logging is reserved for errors which prove fatal to the mission as a whole, rather than the conventional meaning of "fatal for the process". This includes a capture group crashing (as data can no longer be collected), the datalink node crashing (as autopilot telemetry is no longer accessible), or any control node operation that leaves the payload in a state unsuitable to continue a mission.

Additionally, a FATAL level message may be logged if the computer which produced the log must be rebooted in order to regain functionality.

A note on stdout and stderr

STARDOS nodes should never explicitly write to stdout or stderr. Some driver libraries may write to these, and that is acceptable, but STARDOS node developers should not use them. The reasoning behind this is that, because every STARDOS node will be a child of the control node, they all share a single stdout and stderr. ROS2's logging tools include the node name and namespace in the log outputs, meaning that it is easy to search through the control node's output to see every node's output uniquely.

Clone this wiki locally