Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Structured Logging #3

Closed
CMCDragonkai opened this issue Dec 6, 2021 · 17 comments · Fixed by #14
Closed

Integrate Structured Logging #3

CMCDragonkai opened this issue Dec 6, 2021 · 17 comments · Fixed by #14
Assignees
Labels
development Standard development r&d:polykey:supporting activity Supporting core activity

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2021

Specification

Structured logging is to use JSON on the STDERR logs. This exposes more of the data, and allows us to create a standard for the set of fields to be used.

There's no real official standard for structured logging. But journalctl and syslog defines some common fields. The fields can be similar to our formatting that we already use, but this time in a JSON format. Additional data fields can be encoded too.

We could put a json formatting in somewhere that provides the JSON serialisation.

This can be especially useful since we are also looking into adding error chaining, which we have done in a few places. But by creating our own @matrixai/error we can implement error chaining. MatrixAI/Polykey#304

Additional context

Tasks

  1. Research different structured logging libraries
  2. Compare with opentracing or tracing based data formats
  3. Implement simple structured logging capabilities into js-logger
@CMCDragonkai CMCDragonkai added the development Standard development label Dec 6, 2021
@CMCDragonkai
Copy link
Member Author

Need to consider a couple of things:

  • Common fields that should align with other structured logging mechanisms
  • The --format field used for output formatting can double as for logger formatting too
  • How does error chaining come into play here, and how should these be shown?
  • How does stack traces come into play here? Stack traces don't have sensitive information, they are just traces of what functions called what functions.

@CMCDragonkai
Copy link
Member Author

So basically the problem is this:

  1. Does --format apply to just STDOUT or also to STDERR? I believe it should apply to both. Especially as STDERR will likely be relied on upon quite a bit by pk agent start. So doing something like pk agent start --format=json --verbose will then ensure that INFO messages get logged out to STDERR by with JSON structure rather than human readable messages. This will be important for any kind of long-term debugging of the PK agent.
  2. If --format applies to STDERR as well, this then affects the PK CLI calls which would mean any exception plus INFO messages would also become JSON. This is OK as we can expect that once --format=json is used, we would actually want this to be worked by a computer.

Therefore any usage by a computer should always use --format=json. It would affect exception output on STDERR, logging messages to STDERR, and output to STDOUT.

@CMCDragonkai
Copy link
Member Author

One additional issue is the "level of verbosity". This is basically controlled by --verbose. But when we output an exception to STDERR, we don't currently show the stack trace atm. The root exception handler by the PK agent and in PK CLI calls don't show the stack trace. The stack trace is useful for tracing where the error comes from.

Right now verbosity only controls INFO, WARNING... etc. We could do something where one chooses to use logger.debug to show the stacktrace as well if the logger level is appropriate. But this requires a conditional check on the logger's effective level.

Something like:

if (this.logger.getEffectiveLevel() > ...) {
  this.logger.debug(e, ...); // include stack trace
} else {
  this.logger.info(e, ...); // don't include stack trace
}

A similar situation would apply to errors being thrown over the GRPC boundary as discussed here MatrixAI/Polykey#249.

I reckon something can be automated if we integrate the design of js-logger with js-errors library here MatrixAI/Polykey#304, such that when an error is being logged out, we gain additional context depending on what the log level is. Of course what errors is passed over GRPC also depends, some errors we only want to write to the agent's STDERR, other errors we want to propagate to the GRPC client, and in other cases we would want to "map" agent-local errors to GRPC API errors.

This would mean something like catch a local agent error, write that out to STDERR on the agent on the RPC handler, then create a new error to be propagated to the GRPC client. We have to separate client errors from server errors like in HTTP where there is 4xx errors vs 5xx errors. 5xx errors would be caught by the exit handler, and put to STDERR. 4xx errors are not, they would be emitted by the RPC handler. We would only log them out to STDERR if the RPC handler handled the error, and also that this is a problem for the agent itself.

@CMCDragonkai
Copy link
Member Author

Node v16 should be reporting the error chain nicely now.

@CMCDragonkai
Copy link
Member Author

@emmacasolin FYI this issue addresses the idea of how the exceptions (including the error chaining) will eventually be displayed on the CLI or STDERR logging.

@CMCDragonkai
Copy link
Member Author

Note that the cause property is actually an ES2022 feature, but nodev16 does support it, but because we haven't fully updated our standards to it, we will forget that it actually exists on the standard errors.

@CMCDragonkai
Copy link
Member Author

Any kind of structured logging will be intended for machine consumption and filtering. That's because any logging of server errors may contain a long stack string, and that information is likely to be quite noisy on basic log readers.

@CMCDragonkai
Copy link
Member Author

@emmacasolin this would still be needed to ensure that --format json affects STDERR logging.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jun 17, 2022

Some thoughts...

We've been hitting various performance and other network buggy issues which are hard to debug because of concurrency and that asynchronous work are happening out of order.

I think we need to improve our structured logging to provide a better observability into the runtime of our program.

To do this, I'm thinking that we need to incorporate some new ideas:

Point is flamegraph was only invented by Brendan Gregg back in 2013/2014. But the same idea can be applied to view an entire trace of the entire distributed system. But also works for a local system too. But the tooling isn't really solid though and seems heavyweight for local debugging. So something inbetween might be useful.

@CMCDragonkai
Copy link
Member Author

Some clarification on terminology.

There's a "standard" for structured logging as exemplified by https://www.structlog.org/.

In that structure, it is not using JSON, but using a human readable structure.

This issue is more about JSON as a structured logging format.

I believe the human readable structure as demonstrated by structlog is a good idea, and we can move our logging format to fit that format eventually. But JSON logging is simply more powerful for any machine analysis of logs.

However atm, I'm not sure what the deal is with information like trace #4, and how structlog would handle information like that. It's not human readable to have information like that on one single line (which is the default for most loggers), it's simply too much information. And if we want to preserve human readability, newlines/tabs/columns on the terminal must be used. The question is, is it really possible to enable easy machine parseability while still using whitespace as a formatting operator for human readability on the terminal? Maybe the goals are in conflict.

Some more info here: hynek/structlog#407

I think we should go as far as we can to maintain structured logging for both human readability AND machine parseability.

In #4 I suggested that if we ever do need to use literal newlines or whitespace operator for formatting, it must always be done with some indentation. That way, one can say that any newline that has ahead of it, is still a continuation of the same log message.

But it's a bit more complicated than that. Because you need a "lookbehind" to say that whatever is before the must not also be a \n.

So this should be fine as that terminates the message:

ABC \n

But this should not be considered an end to the message

ABC \n
  ABC\n

But this would get confused with:

ABC \n
  ABC \n

Where it is 2 messages.

So if there are spaces after \n, then that must be joined up with the first message.

Then machines should be able to parse individual messages.

@CMCDragonkai
Copy link
Member Author

However do note that such usage of newlines are unlikely to be parseable easily, since it depends on logging analysers that know that this can occur. I reckon that it should be possible to code a custom parser for lnav that understands such a formatter though.

However it looks like it still isn't entirely a solved problem: tstack/lnav#410 (multi-line matching).

This further shows that any usage of multiple lines for a log can only really be usable on the terminal and custom log parsers.

But at that point the JSON structure should be used instead.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes @emmacasolin follow this for more structured logging research.

@CMCDragonkai
Copy link
Member Author

Also I noticed that structlog talks about "context", that almost sounds like opentelemetry's idea of spans.

My opinion is that "context" can be done by hooking into async_hooks, however that won't work for distributed spans, which goes across process contexts, so at some point it has to be done through decorators.

It does seem that decorators would be the most expressive way of doing this, as this is basically a sort of AOP (aspect oriented programming) cross-cutting concern. We would want to avoid having to add lots of line-noise, but still be able to drop down to imperative logging for flexible control in edge cases where a decorator just doesn't work.

@CMCDragonkai
Copy link
Member Author

Because PK is a decentralised system, we do have calls that propagate from 1 agent to another agent. Debugging this is crucial, but not necessarily possible in production unless we control all the parties to the interaction. Right now there's a bug in the NAT testing PR where signalling involves agent 1 asking the seed node to ask agent 2 to reverse hole punch. If there's a delay anywhere here, we end up with timeout errors.

@CMCDragonkai
Copy link
Member Author

Structured graph visualisation with contexts/spans/traces. I think we take a page from the git graph, and use things like: https://github.com/nicoespeon/gitgraph.js.

Along with the ink library: https://github.com/vadimdemedes/ink we may even extend the react components to eventually the GUI too show a trace on what is occurring in the system.

@CMCDragonkai CMCDragonkai self-assigned this Jul 18, 2022
@CMCDragonkai
Copy link
Member Author

To solve this issue, I'm just going to incorporate json as formatting style. In terms of the actual data being logged, it'd be similar to what is already available under our formatting. I might incorporate similar keywords to what is already available to align. But the contexts and spans would have to be done in a separate issue.

A JSON format would be one step towards producing traces JSON data that can then be analysed and visualised. There is a relationship between this and "state" simulacra, but somewhat independent. State simulacra is generally specific to the state being modelled, while tracing is very generic and applies to any contextful based systems.

@CMCDragonkai
Copy link
Member Author

@emmacasolin please change up PK to start using the formatting.jsonFormatter when --format=json is set on the CLI as per #3 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:supporting activity Supporting core activity
Development

Successfully merging a pull request may close this issue.

1 participant