Create seperate doc for new chunk #335

JoranAngevaare · 2020-10-21T11:56:13Z

What is the problem / what does the code in this PR do
We experienced that some documents got bigger than the 16 MB limit imposed by mongo due to many chunks being written to the same document. This is solved in this PR by splitting out the chunks-data and the meta-data into separate documents.

Can you give a minimal working example (or illustrate with a figure)?
https://github.com/XENONnT/analysiscode/blob/master/StraxTests/add_mongo_saver_finetuning.ipynb

Doc format
See the gist below for an example. Here we uploaded three chunks with a length == 5 datafield (to keep it simple). This means we have four documents in total:
https://gist.github.com/jorana/c4fccefa1a152b403bace944a8993699

darrylmasson

A minor change with significant quality-of-life implications.

darrylmasson · 2020-10-21T12:11:45Z

strax/storage/mongo.py

+            # Start a new document, update it with the proper information
+            doc = self.basic_md.copy()
+            doc['write_time'] = datetime.now(py_utc)
+            doc[chunk_name] = {"data": aggregate_data}


This line here is going to make life difficult for the analyst. You'll end up with chunks looking like this:

{chunk_0: {data: ...}}, {chunk_1: {data: ...}}, ...

You can't easily loop over this (the field you want is different for each document), projections become more difficult (strictly inclusive, not exclusive), indexing chunk number is impossible, and probably some other things. It would make life much simpler if we just make a data field, rather than chunk_%i.data, and have a separate chunk_i field.

You are right, I considered it did not make much of a difference as people will use st.get_array but that is a lousy excuse for making confusing documents

darrylmasson · 2020-10-21T12:16:16Z

strax/storage/mongo.py

+        if self.run_start is not None:
+            update = {'run_start_time': self.run_start}
+            for chunk_id in self.ids_chunk.keys():
+                self.col.update_one({'_id': chunk_id}, {'$set': update})


Probably faster to do one call to update_many({'_id': {'$in': list(self.ids_chunk.keys())}}, ...) than many calls to update_one. Also probably possible to just specify {'number': run_number}.

Thanks, good catch

darrylmasson · 2020-10-21T12:17:10Z

strax/storage/mongo.py

+        chunk_id = self.ids_chunk.get(chunk_name, None)
+        if chunk_id is not None:
+            self.col.update_one({'_id': chunk_id},
+                                {'$addToSet': {f'chunk_name.data': aggregate_data}})


'f{chunk_name}.data'? Anyway, see above comment about chunk_name as a field.

darrylmasson · 2020-10-21T12:42:15Z

strax/storage/mongo.py

@@ -39,14 +39,15 @@ def _read_chunk(self, backend_key, chunk_info, dtype, compressor):

        # Query for the chunk and project the chunk info
        doc = self.db[self.col_name].find_one(
-            {**query, chunk_name: {"$exists": True}},
-            {f"{chunk_name}": 1})
+            {**query, "chunk_i": chunk_name},


Is there a compelling reason why we carry chunk_name rather than its number? Integers index much more cleanly than chunk_%i, and I'm not aware of cases when chunks get non-integer identifiers.

Changed it, it's also more consistent

darrylmasson · 2020-10-21T12:48:13Z

strax/storage/mongo.py

-            for chunk_id in self.ids_chunk.keys():
-                self.col.update_one({'_id': chunk_id}, {'$set': update})
+            query = {k: v for k, v in self.basic_md.items()
+                     if k in ('run_id', 'data_type', 'lineage_hash')}


Ok, these looks like the fields to index. If {**query} above looks like this we can make a single compound index.

Also we don't specify run_id, it looks like we use number instead.

The query has already been indexed but not in a compound manner. I remember that compound indexing and TTLs don't go well together but these indexes are not part of the TTL indices.

Mileage varies, obviously, but my usual strategy is to index for the specific queries I know are being made. If we always query against these three fields at the same time, a compound index probably provides a bit of a performance bonus. If we sometimes make queries against these fields individually, then a separate index might be better, or it might not.

Create seperate doc for new chunk

1e497b5

JoranAngevaare requested a review from darrylmasson October 21, 2020 11:56

darrylmasson suggested changes Oct 21, 2020

View reviewed changes

JoranAngevaare added 2 commits October 21, 2020 14:40

use a better structure

5d4a080

remove superflous update

1bd32cb

darrylmasson approved these changes Oct 21, 2020

View reviewed changes

use chunk_i as int

1d8ea94

JoranAngevaare merged commit 114c284 into master Oct 21, 2020

JoranAngevaare mentioned this pull request Oct 22, 2020

improve mongo-backend-loading #336

Merged

JoranAngevaare deleted the update_mongo_storage branch December 3, 2020 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create seperate doc for new chunk #335

Create seperate doc for new chunk #335

JoranAngevaare commented Oct 21, 2020 •

edited

darrylmasson left a comment

darrylmasson Oct 21, 2020

JoranAngevaare Oct 21, 2020

darrylmasson Oct 21, 2020

JoranAngevaare Oct 21, 2020 •

edited

darrylmasson Oct 21, 2020

darrylmasson Oct 21, 2020

JoranAngevaare Oct 21, 2020

darrylmasson Oct 21, 2020

darrylmasson Oct 21, 2020

JoranAngevaare Oct 21, 2020

darrylmasson Oct 21, 2020

Create seperate doc for new chunk #335

Create seperate doc for new chunk #335

Conversation

JoranAngevaare commented Oct 21, 2020 • edited

darrylmasson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoranAngevaare Oct 21, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoranAngevaare commented Oct 21, 2020 •

edited

JoranAngevaare Oct 21, 2020 •

edited