-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(logging): Provide JSON-formatted structured logging as option #1308
feat(logging): Provide JSON-formatted structured logging as option #1308
Conversation
mlserver/logging.py
Outdated
@@ -42,7 +42,9 @@ class ModelLoggerFormatter(logging.Formatter): | |||
""" | |||
|
|||
_UNSTRUCTURED_FORMAT = "%(asctime)s [%(name)s]%(model)s %(levelname)s - %(message)s" | |||
_STRUCTURED_FORMAT = "{'time':'%(asctime)s', 'name': '%(name)s', 'level': '%(levelname)s', 'message': '%(message)s' %(model)s}" | |||
_STRUCTURED_FORMAT = "{\"time\":\"%(asctime)s\", \"name\": \"%(name)s\", " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 I realise this PR is still in draft, but offering a couple of suggestions that might make things easier.
Firstly, as JSON requires double quotes, why not use single quotes to wrap the entire thing as a Python string? I.e. '{"time": "%(asctime)", ...}'
.
Secondly, as this is still prone to human errors, why not represent the entire thing as a dictionary and turn this into the format string using marshalling (json.dumps
)? As a tiny example:
import json
d = {'time': '%(asctime)'}
json.dumps(d)
# '{"time": "%(asctime)"}'
Doing this also makes it easy to add key-value pairs, such as model details below, and avoids having to worry about where in a string something goes as this is deferred to the point of marshalling, rather than determining the values to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For greater control over validation on provided fields, a data class could be used, and those have been around since Python 3.7 I think. I suspect this may also be desirable because the field order should be fixed across invocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about wrapping with single points, thanks! Re dumps: what we want is to not have model_name or model_version when those are not set by the ctx vars, and in the dictionary we can't have only key without value. E.g:
_STRUCTURED_FORMAT = {
"time": "%(asctime)s",
"name": "%(name)s",
"level": "%(levelname)s",
"message": "%(message)s",
"model": ?
}
VS
'{"time": "%(asctime)s", "name": "%(name)s", "level": "%(levelname)s", '
'"message": "%(message)s" %(model)s}' // model here would contain the key-v vair with commas for proper json formatting
One thing I can do is keep the dict version, and then dump it and attach the model before the closing brace in the super().init(...). But this becomes a bit messy. I think the single quotes-option is cleanest. I am open to other suggestions, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we are trying to avoid is modifying the "private" _format on the fly in the format()
method, as it's not thread-safe from what I read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For greater control over validation on provided fields, a data class could be used, and those have been around since Python 3.7 I think. I suspect this may also be desirable because the field order should be fixed across invocations.
Yeah, I was thinking about wrapping those in a class but for now it seems that we won't benefit as much from it. Possibly if we introduce further model context fields or require validation. (I think the model name and version validation should be contained in the ModelSetting class where they are initialised).
Not sure I understand the point about field order across invocations. Could you explain 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the point about field order across invocations. Could you explain
Python implemented reproducible dictionary iteration a while back, I'm pretty sure, but what happens if a dict grows? Expanding it means changing the hashing function, which means the iteration order changes. So you might see time, name, message, level
one time then message, model, time, level, name
another - hard to read across lines!
Using a dataclass might prevent that by having a set dict size to use and/or allowing custom definition of the time order, e.g. time first, etc.
One thing we are trying to avoid is modifying the "private" _format on the fly in the format() method, as it's not thread-safe from what I read
That would make sense, although coroutines for async calls are running on a single thread, so it'd need a bit more thought as to whether there are any suspending points that might cause such issues.
💭 How about instead just fully defining the format, rather than doing string joins on the fly? Slightly duplicative, but also really explicit and obvious as to what's expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about growing the dictionary and not preserving the order. However, I would say that if the structured logging is enabled, it most likely means the logs will be fed to a pipeline for processing and then to a log management system for further inspection. And also if we keep the current structured log format which is a string, the stdout will contain strings which resemble valid json and the order of the fields will not be changed across invocations. A side note from that, in Python specifically, if we have a pipeline that loads those logs via json.loads(), then the order of fields will not be guaranteed by default, but an additional param for OrderedDict can be passed to handle this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, could you give an example of fully defining the format without string joining 🤔
|
||
logger.addHandler(stream_handler) | ||
handler.setFormatter(ModelLoggerFormatter(settings)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the CLI is used to run mlserver start .
, first the CLI will configure a logger and then start the MLServer(settings) which will then again configure a logger but with settings. This is all within the same thread. In the previous version, the handler was initialised with the wrong formatter first and then when the MLServer class was configuring the logger, it wasn't setting a formatter because there was already a handler for the logger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good from my side @vtaskow ! 🚀
I've added a minor nitpick but, as a nitpick, you can totally ignore it.
Background
Structured logging is a process for generating log records in a format that is easily-read by a machine. The aims to provide analytics and insights from logs.
In the rest of our setup, components' logs are sent to
fluentd
and then toElastic Search
for processing.Why
Currently, MLServer does not provide an option for producing structured logging. Enabling structured logging on initialisation will be beneficial in case users would like to create log insights.
How
There are several python libraries that support structured logging(structlog, python-json-logger), however, our case is very simple. Adding an additional dependency would be an overcomplication and the actual code written is not very much different from the proposed solution.
Formatting
In cases when model name is missing, a key-value pairs for both model name and version should not be present in the log.
In cases when model version is missing, a key-value pair for it should not be present in the log.
When
mlserver
is started, one of the two log formats is chosen based on the newly added flag in the settings.