Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(logging): Provide JSON-formatted structured logging as option #1308

Conversation

vtaskow
Copy link
Contributor

@vtaskow vtaskow commented Jul 21, 2023

Background

Structured logging is a process for generating log records in a format that is easily-read by a machine. The aims to provide analytics and insights from logs.
In the rest of our setup, components' logs are sent to fluentd and then to Elastic Search for processing.

Why

Currently, MLServer does not provide an option for producing structured logging. Enabling structured logging on initialisation will be beneficial in case users would like to create log insights.

How

There are several python libraries that support structured logging(structlog, python-json-logger), however, our case is very simple. Adding an additional dependency would be an overcomplication and the actual code written is not very much different from the proposed solution.

Formatting

In cases when model name is missing, a key-value pairs for both model name and version should not be present in the log.
In cases when model version is missing, a key-value pair for it should not be present in the log.
When mlserver is started, one of the two log formats is chosen based on the newly added flag in the settings.

@vtaskow vtaskow requested a review from adriangonz July 21, 2023 10:36
@vtaskow vtaskow self-assigned this Jul 21, 2023
@vtaskow vtaskow changed the title 602 Introduce str logging feat(logging): Provide JSON-formatted structured logging as an option Jul 21, 2023
@@ -42,7 +42,9 @@ class ModelLoggerFormatter(logging.Formatter):
"""

_UNSTRUCTURED_FORMAT = "%(asctime)s [%(name)s]%(model)s %(levelname)s - %(message)s"
_STRUCTURED_FORMAT = "{'time':'%(asctime)s', 'name': '%(name)s', 'level': '%(levelname)s', 'message': '%(message)s' %(model)s}"
_STRUCTURED_FORMAT = "{\"time\":\"%(asctime)s\", \"name\": \"%(name)s\", " \
Copy link
Contributor

@agrski agrski Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I realise this PR is still in draft, but offering a couple of suggestions that might make things easier.

Firstly, as JSON requires double quotes, why not use single quotes to wrap the entire thing as a Python string? I.e. '{"time": "%(asctime)", ...}'.

Secondly, as this is still prone to human errors, why not represent the entire thing as a dictionary and turn this into the format string using marshalling (json.dumps)? As a tiny example:

import json

d = {'time': '%(asctime)'}
json.dumps(d)
# '{"time": "%(asctime)"}'

Doing this also makes it easy to add key-value pairs, such as model details below, and avoids having to worry about where in a string something goes as this is deferred to the point of marshalling, rather than determining the values to include.

Copy link
Contributor

@agrski agrski Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For greater control over validation on provided fields, a data class could be used, and those have been around since Python 3.7 I think. I suspect this may also be desirable because the field order should be fixed across invocations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about wrapping with single points, thanks! Re dumps: what we want is to not have model_name or model_version when those are not set by the ctx vars, and in the dictionary we can't have only key without value. E.g:

_STRUCTURED_FORMAT = {
        "time": "%(asctime)s",
        "name": "%(name)s",
        "level": "%(levelname)s",
        "message": "%(message)s",
        "model": ?
}
VS
'{"time": "%(asctime)s", "name": "%(name)s", "level": "%(levelname)s", '
        '"message": "%(message)s" %(model)s}' // model here would contain the key-v vair with commas for proper json formatting

One thing I can do is keep the dict version, and then dump it and attach the model before the closing brace in the super().init(...). But this becomes a bit messy. I think the single quotes-option is cleanest. I am open to other suggestions, of course

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we are trying to avoid is modifying the "private" _format on the fly in the format() method, as it's not thread-safe from what I read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For greater control over validation on provided fields, a data class could be used, and those have been around since Python 3.7 I think. I suspect this may also be desirable because the field order should be fixed across invocations.

Yeah, I was thinking about wrapping those in a class but for now it seems that we won't benefit as much from it. Possibly if we introduce further model context fields or require validation. (I think the model name and version validation should be contained in the ModelSetting class where they are initialised).

Not sure I understand the point about field order across invocations. Could you explain 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the point about field order across invocations. Could you explain

Python implemented reproducible dictionary iteration a while back, I'm pretty sure, but what happens if a dict grows? Expanding it means changing the hashing function, which means the iteration order changes. So you might see time, name, message, level one time then message, model, time, level, name another - hard to read across lines!

Using a dataclass might prevent that by having a set dict size to use and/or allowing custom definition of the time order, e.g. time first, etc.


One thing we are trying to avoid is modifying the "private" _format on the fly in the format() method, as it's not thread-safe from what I read

That would make sense, although coroutines for async calls are running on a single thread, so it'd need a bit more thought as to whether there are any suspending points that might cause such issues.

💭 How about instead just fully defining the format, rather than doing string joins on the fly? Slightly duplicative, but also really explicit and obvious as to what's expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about growing the dictionary and not preserving the order. However, I would say that if the structured logging is enabled, it most likely means the logs will be fed to a pipeline for processing and then to a log management system for further inspection. And also if we keep the current structured log format which is a string, the stdout will contain strings which resemble valid json and the order of the fields will not be changed across invocations. A side note from that, in Python specifically, if we have a pipeline that loads those logs via json.loads(), then the order of fields will not be guaranteed by default, but an additional param for OrderedDict can be passed to handle this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, could you give an example of fully defining the format without string joining 🤔

@vtaskow vtaskow added the enhancement New feature or request label Jul 21, 2023
@vtaskow vtaskow changed the title feat(logging): Provide JSON-formatted structured logging as an option feat(logging): Provide JSON-formatted structured logging as option Jul 21, 2023
@vtaskow vtaskow marked this pull request as ready for review July 21, 2023 16:09
@vtaskow vtaskow marked this pull request as draft July 24, 2023 10:39

logger.addHandler(stream_handler)
handler.setFormatter(ModelLoggerFormatter(settings))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CLI is used to run mlserver start ., first the CLI will configure a logger and then start the MLServer(settings) which will then again configure a logger but with settings. This is all within the same thread. In the previous version, the handler was initialised with the wrong formatter first and then when the MLServer class was configuring the logger, it wasn't setting a formatter because there was already a handler for the logger.

@vtaskow vtaskow marked this pull request as ready for review July 26, 2023 09:11
Copy link
Contributor

@adriangonz adriangonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good from my side @vtaskow ! 🚀

I've added a minor nitpick but, as a nitpick, you can totally ignore it.

mlserver/context.py Outdated Show resolved Hide resolved
@adriangonz adriangonz merged commit 57208d9 into SeldonIO:master Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants