Keep "lifecycle log" in models #2863

piskvorky · 2020-06-20T18:39:04Z

Why

When Gensim users report issues with models, we have trouble deciding what they actually did, what version they used, etc. They're often confused themselves.

This leads to protracted communication => increased support effort => slower issue resolution and more annoyance all around. Not good.

What

Keep an internal "lifecycle log" inside each model. This could be as simple as an internal list-of-strings attribute, e.g. word2vec.lifecycle = ["init() created on 2020-06-20 15:49:03 UTC with Gensim 3.8.3 on Python 3.6", "train() called on 2020-06-20 15:49:04 UTC", "load() called on 2020-06-29 8:7:58 UTC", …]. We would serialize this log on save(), and ask users to provide its value when investigating an issue.

We probably want to keep this simple and human-readable, both in the API and log content. Not to over-engineer to start with – not attempting a full model recreation from the log or anything, I can see how this could turn into a rabbit hole.

Originally conceived by @gojomo in #2698 (comment).

The text was updated successfully, but these errors were encountered:

gojomo · 2020-06-21T04:06:28Z

As I see it, the main events to store with respect to the *2Vec models are:

initial model creation (time, gensim/python/OS version, key parameters)
any vocabulary-expansion (initial or later: time, size of vocab, & vocab-related params before/after)
every call to train (amount of data processed, epochs, starting/ending alpha; key parameters in case mutated since last event)
any save (time, key parameters in case mutated since last event)
any load (time & gensim/python/OS versions, in case into very new version/environment)

The most typical model uses would just have about 3-5 such events. Some advanced uses might involve being loaded/expanded/trained/saved multiple times, but still most likely just dozens of events – tiny compared to the overall model.

So I'd see it as a plain Python list of events, where each event is just a dict. It might be capped as only the last N events – but a cap is probably not needed. It'd automatically be part of the model serialization. A utility method might dump the events as a series of JSON lines, or something prettier - and a quick glance at that output would both answer the most common questions about a problem situations, and quickly highlight to what extent previous/current gensim versions were involved.

gojomo · 2020-09-29T19:46:48Z

This might be as simple as a single method on SaveLoad (or similar mix-in) roughly like:

def log_event(**kwargs):
    event_dict = kwargs
    event_dict['datetime'] = datetime.now().isoformat()
    event_dict['gensim'] = gensim.__version__
    event_dict['python'] = sys.version
    event_dict['platform'] = platform.platform()
    assert 'event' in event_dict : "event name must be specified"
    if not hasattr(self, 'event_log') or self.event_log is None:
        self.event_log = []
    self.event_log.append(event_dict)

It might also check that all top-level keys are from some declared/documented list in source code. It'd be easy & encouraged to expand this list, with a short note on the key's interpretation, but error or warn when an unknown top-level key is used (unless we add a convention for private/x-/etc ad hoc keys).

Even long-lived models would only accumulate a few KB of data, and reviewing this log (either in user reports or our own older model) would quickly reveal many sources of oddness or error.

* fix docs * re #2863: record lifecycle events * log events even if lifecycle attribute is turned off * fix overlong lines + use f-strings * bump up internal version + improve docs * record more info for load + save * lifecycle events for KeyedVectors * add lifecycle events to remaining models * ask for lifecycle log in ISSUE_TEMPLATE * improve logging * Update word2vec.py remove unused import * Update CHANGELOG.md Co-authored-by: Michael Penkov <m@penkov.dev>

gojomo · 2021-03-08T20:11:12Z

I realize my review is late - but I think the addition/use of the msg parameter is somewhat counter to the original spirit of putting all interesting values into the dict, named the same as their variable names. (No downstream tools should ever have to adapt themselves to the vagaries of any one developer's improvised message text to parse-out interesting values.)

If there must be a msg parameter – and I'd lean no, because the event-name/source-object-type should be plenty of contextual description for the audience – it should have to populate itself with values that are also necessarily in the event dict. (An f-string that is always passed the event dict might serve this role & thus enforce this practice.) And, as matter of code style, does 'message' rate abbreviation as 'msg'?

Separately, a short utility accessor that pretty-prints the (most-recent N lines of) log as JSON could be helpful to complete the value & demo-ability of the feature.

piskvorky · 2021-03-08T21:39:36Z

My "original spirit" of the logs was to store strings only. One event = one string.

I suppose this highlights a different view to the intended / envisioned purpose of the log:

I see it as a debugging instrument for humans => prefer denormalized, human-readable strings. Basically a persistent logging-like log line.
You see it as an instrument for further (machine) processing => prefer more fine-grained structures.

My implementation does support arbitrary dicts (your proposal), for flexibility. But I guess I filled the dicts primarily with the stuff that I'd like to see in user reports (strings).

If there must be a msg parameter

It doesn't. It's just a choice I made, in some places (not all). Or I don't understand what you mean by this paragraph.

gojomo · 2021-03-13T00:14:13Z

I wasn't thinking particularly for machine processing, just clarity to expert users. Free-form narrative log lines can vary a lot in their terminology/grammar/choice-of-what-to-emphasize/units/etc. Historically, plenty of Gensim logging text has been of uneven style/quality - but that of course is not unique to Gensim, and I'm sure I've contributed some vague/under-proof-read output, as well.

Reusing prominent exact in-code variable names, treating the whole property dict as a self-describing 'log event', should be more robust against uneven style/language. My hope would be for any major model lifecycle event, from a small controlled-vocabulary of possible events, the event name is all the context necessary. (Anything extra is just more maintenance effort, or risk that the unstructured text drifts from code functionality.)

piskvorky added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Jun 20, 2020

piskvorky mentioned this issue Jun 20, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

piskvorky mentioned this issue Oct 4, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

gojomo mentioned this issue Oct 5, 2020

Support streaming models split into multiple files from S3 / GCS #1851

Closed

piskvorky self-assigned this Oct 16, 2020

piskvorky added this to the 4.0.0 milestone Oct 16, 2020

piskvorky added a commit that referenced this issue Mar 4, 2021

re #2863: record lifecycle events

05052f2

piskvorky mentioned this issue Mar 4, 2021

Record lifecycle events in Gensim models #3060

Merged

mpenkov closed this as completed in #3060 Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep "lifecycle log" in models #2863

Keep "lifecycle log" in models #2863

piskvorky commented Jun 20, 2020 •

edited

gojomo commented Jun 21, 2020 •

edited

gojomo commented Sep 29, 2020

gojomo commented Mar 8, 2021 •

edited

piskvorky commented Mar 8, 2021 •

edited

gojomo commented Mar 13, 2021

Keep "lifecycle log" in models #2863

Keep "lifecycle log" in models #2863

Comments

piskvorky commented Jun 20, 2020 • edited

Why

What

gojomo commented Jun 21, 2020 • edited

gojomo commented Sep 29, 2020

gojomo commented Mar 8, 2021 • edited

piskvorky commented Mar 8, 2021 • edited

gojomo commented Mar 13, 2021

piskvorky commented Jun 20, 2020 •

edited

gojomo commented Jun 21, 2020 •

edited

gojomo commented Mar 8, 2021 •

edited

piskvorky commented Mar 8, 2021 •

edited