New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create seperate doc for new chunk #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor change with significant quality-of-life implications.
strax/storage/mongo.py
Outdated
# Start a new document, update it with the proper information | ||
doc = self.basic_md.copy() | ||
doc['write_time'] = datetime.now(py_utc) | ||
doc[chunk_name] = {"data": aggregate_data} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line here is going to make life difficult for the analyst. You'll end up with chunks looking like this:
{chunk_0: {data: ...}},
{chunk_1: {data: ...}},
...
You can't easily loop over this (the field you want is different for each document), projections become more difficult (strictly inclusive, not exclusive), indexing chunk number is impossible, and probably some other things. It would make life much simpler if we just make a data
field, rather than chunk_%i.data
, and have a separate chunk_i
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I considered it did not make much of a difference as people will use st.get_array but that is a lousy excuse for making confusing documents
strax/storage/mongo.py
Outdated
if self.run_start is not None: | ||
update = {'run_start_time': self.run_start} | ||
for chunk_id in self.ids_chunk.keys(): | ||
self.col.update_one({'_id': chunk_id}, {'$set': update}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably faster to do one call to update_many({'_id': {'$in': list(self.ids_chunk.keys())}}, ...)
than many calls to update_one
. Also probably possible to just specify {'number': run_number}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good catch
strax/storage/mongo.py
Outdated
chunk_id = self.ids_chunk.get(chunk_name, None) | ||
if chunk_id is not None: | ||
self.col.update_one({'_id': chunk_id}, | ||
{'$addToSet': {f'chunk_name.data': aggregate_data}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'f{chunk_name}.data'
? Anyway, see above comment about chunk_name
as a field.
strax/storage/mongo.py
Outdated
@@ -39,14 +39,15 @@ def _read_chunk(self, backend_key, chunk_info, dtype, compressor): | |||
|
|||
# Query for the chunk and project the chunk info | |||
doc = self.db[self.col_name].find_one( | |||
{**query, chunk_name: {"$exists": True}}, | |||
{f"{chunk_name}": 1}) | |||
{**query, "chunk_i": chunk_name}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a compelling reason why we carry chunk_name
rather than its number? Integers index much more cleanly than chunk_%i
, and I'm not aware of cases when chunks get non-integer identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it, it's also more consistent
strax/storage/mongo.py
Outdated
for chunk_id in self.ids_chunk.keys(): | ||
self.col.update_one({'_id': chunk_id}, {'$set': update}) | ||
query = {k: v for k, v in self.basic_md.items() | ||
if k in ('run_id', 'data_type', 'lineage_hash')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, these looks like the fields to index. If {**query}
above looks like this we can make a single compound index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we don't specify run_id
, it looks like we use number
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query has already been indexed but not in a compound manner. I remember that compound indexing and TTLs don't go well together but these indexes are not part of the TTL indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mileage varies, obviously, but my usual strategy is to index for the specific queries I know are being made. If we always query against these three fields at the same time, a compound index probably provides a bit of a performance bonus. If we sometimes make queries against these fields individually, then a separate index might be better, or it might not.
What is the problem / what does the code in this PR do
We experienced that some documents got bigger than the 16 MB limit imposed by mongo due to many chunks being written to the same document. This is solved in this PR by splitting out the chunks-data and the meta-data into separate documents.
Can you give a minimal working example (or illustrate with a figure)?
https://github.com/XENONnT/analysiscode/blob/master/StraxTests/add_mongo_saver_finetuning.ipynb
Doc format
See the gist below for an example. Here we uploaded three chunks with a length == 5 datafield (to keep it simple). This means we have four documents in total:
https://gist.github.com/jorana/c4fccefa1a152b403bace944a8993699