Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB mode issues fixes #184

Merged
merged 3 commits into from Oct 18, 2022
Merged

DB mode issues fixes #184

merged 3 commits into from Oct 18, 2022

Conversation

jpwchang
Copy link
Collaborator

Description

Addresses the following issues that were discovered during testing of DB mode with certain example notebooks:

  • When loading from JSON to DB, missing fields in the JSON data will be automatically set to None (matching the behavior of mem mode) rather than causing an error
  • Fixes an issue where get_utterance_ids was not being correctly called in filter_utterances_by
  • Updates DBStorageManager to avoid the use of the no-longer-supported collection.map_reduce, replacing it with a more canonical solution for listing all IDs in the collection
  • Implements branching logic for Corpus.load_info depending on the storage type. In mem mode the behavior is as before, whereas in DB mode it will instead use iterated batch insertion to avoid needing to load the entire info file (which can be highly memory intensive) and to improve execution speed
  • Replaces the . separator in PromptTypes metadata names with the __ separator, to account for the fact that . has special meaning in MongoDB
  • In DB mode, adds checks to ensure that the DB collection prefix (either manually given or inferred from filename) is compatible with MongoDB collection naming rules
    • Per MongoDB documentation, "collection names should start with an underscore or letter character", which our random ID method was technically not in compliance with since UUIDs can start with digits. Therefore, the random ID method has been updated to always prefix the UUID with an underscore to guarantee that the resulting ID is in compliance with MongoDB's guidance for collection names

Motivation and Context

The above issues were preventing several example notebooks from running to completion, either due to outright crashes or high memory usage. This patch should help resolve that.

How has this been tested?

Ensured that the erroring parts of the notebooks in question now run without issue, and also monitored memory usage to ensure that there is the expected reduction

@calebchiam
Copy link
Collaborator

Regarding the . issue, what happens if we load from a Corpus that already has . in its metadata names? Is that a safe operation?

Seems like we might need a function to convert names / keys into a mongo-safe form.

@jpwchang
Copy link
Collaborator Author

Regarding the . issue, what happens if we load from a Corpus that already has . in its metadata names? Is that a safe operation?

Seems like we might need a function to convert names / keys into a mongo-safe form.

Not a safe operation right now, no. Basically what happens is that in mongodb . is interpreted as nesting (like javascript package namespaces), so a metadata key with name <part1>.<part2>, when inserted into the DB, will end up becoming a nested DB entry with structure {part1: {part2: ... }}.

@jpwchang
Copy link
Collaborator Author

Per discussion with @cristiandnm on slack, merging these changes now for the sake of immediate issue resolution, and will pursue a general solution to the . problem in a separate PR.

@jpwchang jpwchang merged commit fcb405c into master Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants