Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create seperate doc for new chunk #335

Merged
merged 4 commits into from Oct 21, 2020
Merged

Conversation

JoranAngevaare
Copy link
Member

@JoranAngevaare JoranAngevaare commented Oct 21, 2020

What is the problem / what does the code in this PR do
We experienced that some documents got bigger than the 16 MB limit imposed by mongo due to many chunks being written to the same document. This is solved in this PR by splitting out the chunks-data and the meta-data into separate documents.

Can you give a minimal working example (or illustrate with a figure)?
https://github.com/XENONnT/analysiscode/blob/master/StraxTests/add_mongo_saver_finetuning.ipynb

Doc format
See the gist below for an example. Here we uploaded three chunks with a length == 5 datafield (to keep it simple). This means we have four documents in total:
https://gist.github.com/jorana/c4fccefa1a152b403bace944a8993699

Copy link
Contributor

@darrylmasson darrylmasson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor change with significant quality-of-life implications.

# Start a new document, update it with the proper information
doc = self.basic_md.copy()
doc['write_time'] = datetime.now(py_utc)
doc[chunk_name] = {"data": aggregate_data}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line here is going to make life difficult for the analyst. You'll end up with chunks looking like this:

{chunk_0: {data: ...}},
{chunk_1: {data: ...}},
...

You can't easily loop over this (the field you want is different for each document), projections become more difficult (strictly inclusive, not exclusive), indexing chunk number is impossible, and probably some other things. It would make life much simpler if we just make a data field, rather than chunk_%i.data, and have a separate chunk_i field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I considered it did not make much of a difference as people will use st.get_array but that is a lousy excuse for making confusing documents

if self.run_start is not None:
update = {'run_start_time': self.run_start}
for chunk_id in self.ids_chunk.keys():
self.col.update_one({'_id': chunk_id}, {'$set': update})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably faster to do one call to update_many({'_id': {'$in': list(self.ids_chunk.keys())}}, ...) than many calls to update_one. Also probably possible to just specify {'number': run_number}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch

chunk_id = self.ids_chunk.get(chunk_name, None)
if chunk_id is not None:
self.col.update_one({'_id': chunk_id},
{'$addToSet': {f'chunk_name.data': aggregate_data}})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'f{chunk_name}.data'? Anyway, see above comment about chunk_name as a field.

@@ -39,14 +39,15 @@ def _read_chunk(self, backend_key, chunk_info, dtype, compressor):

# Query for the chunk and project the chunk info
doc = self.db[self.col_name].find_one(
{**query, chunk_name: {"$exists": True}},
{f"{chunk_name}": 1})
{**query, "chunk_i": chunk_name},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a compelling reason why we carry chunk_name rather than its number? Integers index much more cleanly than chunk_%i, and I'm not aware of cases when chunks get non-integer identifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it, it's also more consistent

for chunk_id in self.ids_chunk.keys():
self.col.update_one({'_id': chunk_id}, {'$set': update})
query = {k: v for k, v in self.basic_md.items()
if k in ('run_id', 'data_type', 'lineage_hash')}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, these looks like the fields to index. If {**query} above looks like this we can make a single compound index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we don't specify run_id, it looks like we use number instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query has already been indexed but not in a compound manner. I remember that compound indexing and TTLs don't go well together but these indexes are not part of the TTL indices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mileage varies, obviously, but my usual strategy is to index for the specific queries I know are being made. If we always query against these three fields at the same time, a compound index probably provides a bit of a performance bonus. If we sometimes make queries against these fields individually, then a separate index might be better, or it might not.

@JoranAngevaare JoranAngevaare merged commit 114c284 into master Oct 21, 2020
@JoranAngevaare JoranAngevaare deleted the update_mongo_storage branch December 3, 2020 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants