DB mode issues fixes #184

jpwchang · 2022-10-17T18:58:38Z

Description

Addresses the following issues that were discovered during testing of DB mode with certain example notebooks:

When loading from JSON to DB, missing fields in the JSON data will be automatically set to None (matching the behavior of mem mode) rather than causing an error
Fixes an issue where get_utterance_ids was not being correctly called in filter_utterances_by
Updates DBStorageManager to avoid the use of the no-longer-supported collection.map_reduce, replacing it with a more canonical solution for listing all IDs in the collection
Implements branching logic for Corpus.load_info depending on the storage type. In mem mode the behavior is as before, whereas in DB mode it will instead use iterated batch insertion to avoid needing to load the entire info file (which can be highly memory intensive) and to improve execution speed
Replaces the . separator in PromptTypes metadata names with the __ separator, to account for the fact that . has special meaning in MongoDB
In DB mode, adds checks to ensure that the DB collection prefix (either manually given or inferred from filename) is compatible with MongoDB collection naming rules
- Per MongoDB documentation, "collection names should start with an underscore or letter character", which our random ID method was technically not in compliance with since UUIDs can start with digits. Therefore, the random ID method has been updated to always prefix the UUID with an underscore to guarantee that the resulting ID is in compliance with MongoDB's guidance for collection names

Motivation and Context

The above issues were preventing several example notebooks from running to completion, either due to outright crashes or high memory usage. This patch should help resolve that.

How has this been tested?

Ensured that the erroring parts of the notebooks in question now run without issue, and also monitored memory usage to ensure that there is the expected reduction

convokit/model/corpus.py

calebchiam · 2022-10-18T01:04:20Z

Regarding the . issue, what happens if we load from a Corpus that already has . in its metadata names? Is that a safe operation?

Seems like we might need a function to convert names / keys into a mongo-safe form.

jpwchang · 2022-10-18T13:46:52Z

Regarding the . issue, what happens if we load from a Corpus that already has . in its metadata names? Is that a safe operation?

Seems like we might need a function to convert names / keys into a mongo-safe form.

Not a safe operation right now, no. Basically what happens is that in mongodb . is interpreted as nesting (like javascript package namespaces), so a metadata key with name <part1>.<part2>, when inserted into the DB, will end up becoming a nested DB entry with structure {part1: {part2: ... }}.

jpwchang · 2022-10-18T14:04:01Z

Per discussion with @cristiandnm on slack, merging these changes now for the sake of immediate issue resolution, and will pursue a general solution to the . problem in a separate PR.

jpwchang added 2 commits October 17, 2022 14:38

Implement fixes for issues found in testing

f92b8c9

fix call to get_corpus_id

fcddd91

jpwchang requested a review from calebchiam October 17, 2022 18:58

calebchiam reviewed Oct 18, 2022

View reviewed changes

convokit/model/corpus.py Outdated Show resolved Hide resolved

get_corpus_id takes storage_type directly

8d8eca5

jpwchang merged commit fcb405c into master Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB mode issues fixes #184

DB mode issues fixes #184

jpwchang commented Oct 17, 2022

calebchiam commented Oct 18, 2022

jpwchang commented Oct 18, 2022

jpwchang commented Oct 18, 2022

DB mode issues fixes #184

DB mode issues fixes #184

Conversation

jpwchang commented Oct 17, 2022

Description

Motivation and Context

How has this been tested?

calebchiam commented Oct 18, 2022

jpwchang commented Oct 18, 2022

jpwchang commented Oct 18, 2022