Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep "lifecycle log" in models #2863

Closed
piskvorky opened this issue Jun 20, 2020 · 5 comments · Fixed by #3060
Closed

Keep "lifecycle log" in models #2863

piskvorky opened this issue Jun 20, 2020 · 5 comments · Fixed by #3060
Assignees
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Milestone

Comments

@piskvorky
Copy link
Owner

piskvorky commented Jun 20, 2020

Why

When Gensim users report issues with models, we have trouble deciding what they actually did, what version they used, etc. They're often confused themselves.

This leads to protracted communication => increased support effort => slower issue resolution and more annoyance all around. Not good.

What

Keep an internal "lifecycle log" inside each model. This could be as simple as an internal list-of-strings attribute, e.g. word2vec.lifecycle = ["init() created on 2020-06-20 15:49:03 UTC with Gensim 3.8.3 on Python 3.6", "train() called on 2020-06-20 15:49:04 UTC", "load() called on 2020-06-29 8:7:58 UTC", …]. We would serialize this log on save(), and ask users to provide its value when investigating an issue.

We probably want to keep this simple and human-readable, both in the API and log content. Not to over-engineer to start with – not attempting a full model recreation from the log or anything, I can see how this could turn into a rabbit hole.

Originally conceived by @gojomo in #2698 (comment).

@piskvorky piskvorky added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Jun 20, 2020
@gojomo
Copy link
Collaborator

gojomo commented Jun 21, 2020

As I see it, the main events to store with respect to the *2Vec models are:

  • initial model creation (time, gensim/python/OS version, key parameters)
  • any vocabulary-expansion (initial or later: time, size of vocab, & vocab-related params before/after)
  • every call to train (amount of data processed, epochs, starting/ending alpha; key parameters in case mutated since last event)
  • any save (time, key parameters in case mutated since last event)
  • any load (time & gensim/python/OS versions, in case into very new version/environment)

The most typical model uses would just have about 3-5 such events. Some advanced uses might involve being loaded/expanded/trained/saved multiple times, but still most likely just dozens of events – tiny compared to the overall model.

So I'd see it as a plain Python list of events, where each event is just a dict. It might be capped as only the last N events – but a cap is probably not needed. It'd automatically be part of the model serialization. A utility method might dump the events as a series of JSON lines, or something prettier - and a quick glance at that output would both answer the most common questions about a problem situations, and quickly highlight to what extent previous/current gensim versions were involved.

@gojomo
Copy link
Collaborator

gojomo commented Sep 29, 2020

This might be as simple as a single method on SaveLoad (or similar mix-in) roughly like:

def log_event(**kwargs):
    event_dict = kwargs
    event_dict['datetime'] = datetime.now().isoformat()
    event_dict['gensim'] = gensim.__version__
    event_dict['python'] = sys.version
    event_dict['platform'] = platform.platform()
    assert 'event' in event_dict : "event name must be specified"
    if not hasattr(self, 'event_log') or self.event_log is None:
        self.event_log = []
    self.event_log.append(event_dict)

It might also check that all top-level keys are from some declared/documented list in source code. It'd be easy & encouraged to expand this list, with a short note on the key's interpretation, but error or warn when an unknown top-level key is used (unless we add a convention for private/x-/etc ad hoc keys).

Even long-lived models would only accumulate a few KB of data, and reviewing this log (either in user reports or our own older model) would quickly reveal many sources of oddness or error.

@piskvorky piskvorky self-assigned this Oct 16, 2020
@piskvorky piskvorky added this to the 4.0.0 milestone Oct 16, 2020
piskvorky added a commit that referenced this issue Mar 4, 2021
mpenkov added a commit that referenced this issue Mar 7, 2021
* fix docs

* re #2863: record lifecycle events

* log events even if lifecycle attribute is turned off

* fix overlong lines + use f-strings

* bump up internal version + improve docs

* record more info for load + save

* lifecycle events for KeyedVectors

* add lifecycle events to remaining models

* ask for lifecycle log in ISSUE_TEMPLATE

* improve logging

* Update word2vec.py

remove unused import

* Update CHANGELOG.md

Co-authored-by: Michael Penkov <m@penkov.dev>
@gojomo
Copy link
Collaborator

gojomo commented Mar 8, 2021

I realize my review is late - but I think the addition/use of the msg parameter is somewhat counter to the original spirit of putting all interesting values into the dict, named the same as their variable names. (No downstream tools should ever have to adapt themselves to the vagaries of any one developer's improvised message text to parse-out interesting values.)

If there must be a msg parameter – and I'd lean no, because the event-name/source-object-type should be plenty of contextual description for the audience – it should have to populate itself with values that are also necessarily in the event dict. (An f-string that is always passed the event dict might serve this role & thus enforce this practice.) And, as matter of code style, does 'message' rate abbreviation as 'msg'?

Separately, a short utility accessor that pretty-prints the (most-recent N lines of) log as JSON could be helpful to complete the value & demo-ability of the feature.

@piskvorky
Copy link
Owner Author

piskvorky commented Mar 8, 2021

My "original spirit" of the logs was to store strings only. One event = one string.

I suppose this highlights a different view to the intended / envisioned purpose of the log:

  • I see it as a debugging instrument for humans => prefer denormalized, human-readable strings. Basically a persistent logging-like log line.
  • You see it as an instrument for further (machine) processing => prefer more fine-grained structures.

My implementation does support arbitrary dicts (your proposal), for flexibility. But I guess I filled the dicts primarily with the stuff that I'd like to see in user reports (strings).

If there must be a msg parameter

It doesn't. It's just a choice I made, in some places (not all). Or I don't understand what you mean by this paragraph.

@gojomo
Copy link
Collaborator

gojomo commented Mar 13, 2021

I wasn't thinking particularly for machine processing, just clarity to expert users. Free-form narrative log lines can vary a lot in their terminology/grammar/choice-of-what-to-emphasize/units/etc. Historically, plenty of Gensim logging text has been of uneven style/quality - but that of course is not unique to Gensim, and I'm sure I've contributed some vague/under-proof-read output, as well.

Reusing prominent exact in-code variable names, treating the whole property dict as a self-describing 'log event', should be more robust against uneven style/language. My hope would be for any major model lifecycle event, from a small controlled-vocabulary of possible events, the event name is all the context necessary. (Anything extra is just more maintenance effort, or risk that the unstructured text drifts from code functionality.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants