Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Kami · 2020-01-20T07:10:17Z

This pull request builds on top of various research I did in other PRs and issues in the past (#4837, #4838, etc).

The goal is to speed up serialization and de-serialization of very large dictionaries (e.g.large results produced by various actions) inside MongoDB.

Right now we store those dictionaries as native MongoDB dictionaries, but this means we need to escape the keys, since they can't contain $ and . which are special characters.

This escaping is inefficient and wasteful with large (and nested) dictionaries.

Proposed Implementation / Improvement

This pull request tries to optimize that by serializing those dictionaries as JSON instead.

Keep in mind that this is fine, since we always treat those values as opaque strings anyway.

In fact, a future improvement would be to have a result database object per runner, this way we could utilize a simple "string" type for the actual result, but that's a much more involved change.

Some actual numbers with synthetic dataset are available in other issues mentioned above. Actual numbers with production real-life data are to follow once we deploy this to CI/CD server.

The change is fully backward compatible, but it's behind a feature flag (opt-in) until we are really sure after running it in production for a while with diverse datasets it doesn't make some corner / edge case worse or similar.

Related optimization I talked about to @m4dcoder is getting rid of using EscapedDict DB field type in scenarios where keys can't contain special characters (e.g. potentially various workflow related models).

TODO

Update all the affected models to use this field type
Tests
Deploy it on CI/CD for a month, observe performance, memory usage, etc
Add optional migration script for affected models - future
Document change in upgrade notes and point users to the optional migration script - future
Add some docs for developers on the new DB format
Update execution list / get command to display execution run duration which includes database write time not just action run time
Document the change

serialize and unserialize large dictionary data (such as result field, etc.).

(opt-in) by default.

escaping the values adds tons of overhead.

m4dcoder · 2020-02-22T02:33:46Z

st2common/st2common/fields.py

+    def to_mongo(self, value, use_db_field=True, fields=None):
+        if not cfg.CONF.database.use_json_dict_field:
+            value = mongoescape.escape_chars(value)
+            return super(stormbase.EscapedDynamicField, self).to_mongo(


This bugs out because JSONDictEscapedFieldCompatibilityField is incompatible with EscapedDynamicField.

m4dcoder · 2020-02-22T02:35:14Z

st2common/st2common/models/api/execution.py

@@ -152,6 +152,10 @@ class ActionExecutionAPI(BaseAPI):
    @classmethod
    def from_model(cls, model, mask_secrets=False):
        doc = cls._from_model(model, mask_secrets=mask_secrets)
+
+        import json
+        doc['result'] = json.loads(doc['result'])


I have to add this to get the result in JSONDictEscapedFieldCompatibilityField with ujson/cjson backend to work properly. This is because _from_model will call to_mongo which dumps the result in JSON to string. The CLI output will be displayed as a string.

m4dcoder

This experimental patch is not ready yet. I tested it and it is buggy and also don't like that it needs to json.dumps/loads more than it needed. I commented the problem places in the code in this PR.

stale · 2020-07-01T09:18:35Z

Thanks for contributing to this issue. As it has been 90 days since the last activity, we are automatically marking is as stale. If this issue is not relevant or applicable anymore (problem has been fixed in a new version or similar), please close the issue or let us know so we can close it. On the contrary, if the issue is still relevant, there is nothing you need to do, but if you have any additional details or context which would help us when working on this issue, please include it as a comment to this issue.

guzzijones · 2021-02-18T20:43:44Z

I have taken a look at this. +1 from me. Saving the result as a string instead of a mongo object/dict is ideal. There is no current use case to index nested dicts from workflow transitions while running.

using two different approaches for serializing execution / live action result.

Kami · 2021-02-18T20:56:26Z

After some discussions on Slack, I decided to try to attack this problem again and wrap up this change - slow insertion of large executions problem (which is fairly common in StackStorm world) still hasn't gone anyway and it's very much alive as it was 3 years ago.

I already performed a lot of benchmarkinh, research and prototyping in the past, but since quite a lot of time has passed since then (and data is scattered across various different issues and PRs), I decided to start from scratch with micro benchmarks which compare using different approach for storing LiveActionDB.result field value which is the main offender in our case.

Good thing about micro benchmarks is that it's also very easy to add a new one or use a different fixture file, etc. to see how it affects the numbers (and they are also fully reproducible by everyone).

First to refresh everyone's memory (including mine).

The two issues which contribute to slow insertion / saving of executions with large results:

mongoengine inefficency when converting large dictionaries to native pymongo SON types.
Escaping in our EscapedDictField which is performed on every single key in a dictionary (we escape $, . - this is needed if you want to store dictionaries which keys contain those values which are special in MongoDB since they act as operators and separators).

And one very important invariant on the result field itself - we use dictionary type for execution result field, but for the purpose of database itself, that field is an opaque binary blob - we don't perform any querying on it's values.

That's very important because it allows us to treat it as a string / binary blog which makes things easier (in fact, one of the option I already debated in the past is storing result in a separate collection as a string value and then add a FK to LiveAction collection).

Proposed Approach

The proposed change to mitigate / fix this issue is still the same one as I proposed and researched more than a year ago - storing result as a dictionary serialized to a JSON string / blog. Only difference is that now I use orjson instead of ujson since micro benchmark shows it's 5-15% faster than ujson on this specific dataset.

A lot of other approaches were also brought up in the past - things such as removing and replacing mongo all together, using a different ORM library, etc.

Yes, many of those would likely result in a much better performance if they were executed and implemented correctly (if!), but most of them are not even close to realistic - they are much more risky (much larger change, harder to make it backward compatible, etc.) and simply not realistic since they would require many man month hours worth of work (and that's an ideal scenario, less ideal and more realistic scenario where various edge cases crop up, compatibility issues, etc. is closer to 6 months).

This means that this approach is one of the few possible realistic approaches we have left.

This approach exploits the invariant that result is for all purposes, opaque binary blob to the database itself and if we use a binary field, we don't need to run escaping code on it + mongoengine doesn't need to perform slow conversion.

Initial Results

Initial results from micro benchmarks are very promising.

Save

Read

For a very large execution (8 MB), we see almost ~10x improvement in insertion speed (5s vs 500ms) and ~5x improvement in read speed (800ms vs 150ms).

The good news is that this change also scales down - aka it's not slower for smaller results - in fact it's also a bit faster for those scenarios as well.

Those numbers are from running micro benchmarks on my computer. They also run on continuously CI and everyone who wants to run it can easily run them themselves and also easily add a different dataset (e.g. some real life execution result data).

Things to experiment with

I still want to experiment with compressing actual json blob using zstandard.

The goal here is not to save number of bytes on disk (those were never really a big issue), but to save number of bytes which need to be sent over the network when talking to the database - this could perhaps help and reduce overall operation time even more for very large executions, but I doubt it will be worth the effort in the end. Will see what data will show us.

TODO

Also add micro benchmark just for the dict key escaping code
Add micro benchmark which include zstandard compression
Finish up the code
Add more backward and forward compatibility tests
Add some fixtures which (anonymized) utilize real life execution results
Get rid of json backend option - this was only used to benchmark different backends and just use orjson

single key with a large value.

all JSON fixture files so they contain at least one key with character which needs to be escaped.

Kami · 2021-02-18T23:13:27Z

I added micro benchmarks just for the escaping and unescape function and also unit tests for the new fields.

Running existing tests did uncover some issues - in some places, we run a set query for updating a live action and we pass dict for the value (which also kinda brakes the our ORM layer / DB abstraction and shows why such changes are hard - no abstraction is perfect and there are always leaks).

Sadly this won't out of the box, I will need to think and experiment some more on how to best approach this.

Technically, I think should be able to make it work, as long as I add some additional code to the places which use field__set notation to make sure the value we pass is already serialized as JSON instead of the raw dict - sadly to_mongo is not called of set queries.

…st2 into optimize_escaped_dict_fields

Kami · 2021-02-18T23:44:43Z

For now, I'm leaning towards adding a header to the serialized field value even though we will likely just support single serialization library and format (orjson) to make it a bit more future-proof in case we ever need to change the serialization format.

Something along the lines of b"<compression algorithm (if any):><serialization format>:<serialized data>.

So the value would look like something like this: n:or:{"foo": "bar"} where o stands for orjson and n for no compression (and z could be zstandard).

This adds some overhead, but it's really just 4 bytes so it's negligible (especially if we ever support compression).

m4dcoder · 2021-02-19T00:03:13Z

Can we be opinionated here and support a specific method of serialization? There's too much options to support (i.e. orjson, ujson, cjson) and I rather json be json w/o any prefix that only st2 understands.

amanda11

LGTM - just added a minor comment.
Would like someone else to review as well though, if possible.

amanda11 · 2021-03-19T13:04:54Z

st2client/st2client/formatters/execution.py

@@ -21,6 +21,11 @@

 import yaml

+try:
+    from yaml import CSafeDumper as YamlSafeDumper


Could we possibly add a comment to explain why we may not have CSafeDumper, and need this catch/exception?

Funny enough, I just added similar change recently (51f811c).

Technically as long as people are using official packages it should be there. But if they do from source install, etc it may not be.

Agreed it's a good ideal and will add a comment as well :)

means YAML loading and serialization will be significantly slower.

amanda11

Sorry - realised I hadn't seen the full PR changes on last review.

Just a couple of other queries...

amanda11 · 2021-03-19T13:39:15Z

contrib/core/tests/test_action_sendmail.py

-                "This message was generated by StackStorm action "
-                "send_mail running on %s" % (HOSTNAME)
-            )
+        expected_body = (


It looks like we've kept the code from the if - which if I read it was only if we were running on python2, so should we not have kept the else logic?

The version of the used library has also been changed / upgraded so now the behavior is different / more correct.

To clarify - we don't want to use escape sequences since that is Python specific and would not display correctly in the email clients, etc.

I assume we just did that change when working on Python 3 support a long time ago since it was the easiest, but technically not the most correct one :)

amanda11 · 2021-03-19T13:39:26Z

contrib/core/tests/test_action_sendmail.py

-                "This message was generated by StackStorm action "
-                "send_mail running on %s" % (HOSTNAME)
-            )
+        expected_body = (


It looks like we've kept the code from the if - which if I read it was only if we were running on python2, so should we not have kept the else logic?

Same as above.

amanda11 · 2021-03-19T13:53:25Z

st2common/st2common/openapi.yaml

+          description: True to compress the response body
+          type: boolean
+          default: false
+        - name: pretty_format


Out of interest, is the UI going to use the pretty format or the non-pretty format, and does this affect how the response appears in the UI at all?

Yeah, UI will use pretty format, but keep in mind that this API endpoint is only used when displaying results for large executions - we don't display it directly in the UI anymore (since that would freeze the browser. etc.), but we just display link to the API endpoint which returns raw result.

Here is a WIP st2web change - StackStorm/st2web#868 (https://github.com/StackStorm/st2web/pull/868/files#diff-38caf478018c81662a6660b616ab9dd030868fb0aec1c01a6c3916465d44f6f0R48).

In short - it doesn't affect output of st2web for executions which we display results directly in the ui (and those go through the code highlighter widget which does re-formatting anyway).

amanda11

LGTM

functions.

blag

LGTM, but there's a few small improvements that could be done.

Makefile

st2client/st2client/commands/action.py

test-requirements.txt

iso8601 date strings where possible.

tests-requirements.

Co-authored-by: blag <blag@users.noreply.github.com>

Kami · 2021-03-20T11:15:00Z

@blag Thanks for the review.

I will go ahead and merge this since it's already as massive PR, but I plan to open another small PR on top of that in the near future.

Having said that, 'm a bit OCD and that extra query on get in st2web change (StackStorm/st2web#868) was not making me happy, so I identified another approach I could use to avoid additional query in st2web to get execution result when result size < threshold.

Basically, I will modify API and add ?max_result_size=X (or similar, tbd) query parameter to /v1/executions/<id> API endpoint.

Instead of performing a simple "get one by id" query, we will perform a more complex query (something along the lines of id == Yo and (result_size < X | not result_size exists) and only return result with the execution object in case the query returns any data.

Good news is that this query is fast since first filter already filter on the PK so it should add negligible overhead (funny enough, based on the testing I did, sometimes it even returns faster than just a plan get by ID query, no idea why), but it will allow us to simplify st2web change and also avoid one extra API operation on st2web side.

jegesh · 2021-06-13T12:00:06Z

I'm looking to patch this PR into my current st2 installation. Any documentation I should refer to? Tips? I'm running v3.4 on Ubuntu 18 and st2 slows to a crawl when we run a few nested workflows, even though I set WORKERS=30 in the config file (and I can see 30 worker processes in htop), so fixing this will really be a godsend for us.

Kami · 2021-06-13T15:15:45Z

@jegesh this has been merged and will be included in the next release which should be out soon. In the mean time you can use dev unstable package for testing if you feel adventurous.

Add new JSONDictField which allows us to more efficently store,

a93f9c2

serialize and unserialize large dictionary data (such as result field, etc.).

Kami added performance mongodb labels Jan 20, 2020

pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Jan 20, 2020

Kami and others added 4 commits February 21, 2020 18:06

Add new JSONDictField which allows us to more efficently store,

a89d658

serialize and unserialize large dictionary data (such as result field, etc.).

Add a feature flag for using new json dict field, set it to false

fe5e33d

(opt-in) by default.

Use new JSON dict field for dictionaries which can be very large where

f0919c9

escaping the values adds tons of overhead.

Fix JSONDictEscapedFieldCompatibilityField to get it to work

59e87f9

m4dcoder force-pushed the optimize_escaped_dict_fields branch from 3f9ca22 to 59e87f9 Compare February 22, 2020 02:31

m4dcoder reviewed Feb 22, 2020

View reviewed changes

m4dcoder requested changes Feb 22, 2020

View reviewed changes

stale bot added the stale label Jul 1, 2020

Kami mentioned this pull request Feb 14, 2021

[RFC] Switching from json + ujson to orjson JSON serialization and de-serialization library StackStorm/community#65

Closed

3 tasks

Merge branch 'master' into optimize_escaped_dict_fields

2024a1e

stale bot removed the stale label Feb 18, 2021

Add a micro-benchmark which comparsed execution save + read times for

2f969f8

using two different approaches for serializing execution / live action result.

Kami added 3 commits February 18, 2021 23:11

Add another micro benchmark fixture which represents a dictionary with a

5971302

single key with a large value.

Add micro-benchmark for escape_chars() and unescape_chars() and update

f19f0fd

all JSON fixture files so they contain at least one key with character which needs to be escaped.

Add unit tests for the new fields.

81672f2

Kami added 4 commits February 19, 2021 00:26

Merge branch 'optimize_escaped_dict_fields' of github.com:StackStorm/…

2a15e8b

…st2 into optimize_escaped_dict_fields

Handle subclass hack in the micro benchmark itself.

44bdbad

Add some more tests.

052fde7

Fix invalid / broken test - result is a dict field and not a string.

dbc5f3d

amanda11 reviewed Mar 19, 2021

View reviewed changes

Kami added 5 commits March 19, 2021 14:07

Add workaround for weird failure on CI which should not be fatatal.

b13c195

Log a warning message if pyyaml C bindings are not available since it

51f811c

means YAML loading and serialization will be significantly slower.

Upgrade orjson to latest stable version.

4672495

Update out of date st2client setup.py metadata.

a25efa6

Add a comment on libyaml availability.

bdd8e3c

Kami force-pushed the optimize_escaped_dict_fields branch from 8362768 to bdd8e3c Compare March 19, 2021 13:25

amanda11 reviewed Mar 19, 2021

View reviewed changes

amanda11 approved these changes Mar 19, 2021

View reviewed changes

Kami added 2 commits March 19, 2021 23:29

Update more code to use orjson and C versions of yaml load/dump

a098315

functions.

Use fast dict copy.

e818158

blag reviewed Mar 19, 2021

View reviewed changes

Makefile Show resolved Hide resolved

st2client/st2client/commands/action.py Show resolved Hide resolved

test-requirements.txt Show resolved Hide resolved

Kami and others added 7 commits March 20, 2021 00:08

For performance reasons, use udatetime library for parsing rfc3339 /

48d612d

iso8601 date strings where possible.

ujson is not only used for tests / benchmarks so move it to

71ffb1a

tests-requirements.

Fix typo.

46ba2c9

Add TODO comment.

a27245f

Fix affected test.

1a91394

Apply suggestions from code review

cbd0259

Co-authored-by: blag <blag@users.noreply.github.com>

Fix syntax, add comments.

3b47856

Kami merged commit da76504 into master Mar 20, 2021

Kami deleted the optimize_escaped_dict_fields branch March 20, 2021 11:38

Kami mentioned this pull request Mar 20, 2021

Implement coditional execution "result" field retrieval for the "/v1/executions/<id>" (get one execution) API endpoint #5197

Merged

Kami mentioned this pull request May 7, 2021

Add migration script for migrating old execution field data to the new format #5255

Merged

2 tasks

arm4b mentioned this pull request Aug 23, 2021

Add AJ (@guzzijones) to the TSC Maintainers #5340

Merged

blag mentioned this pull request Aug 28, 2021

Usage of udatetime causing CI failures for Exchange packs #5346

Closed

DavidMeu mentioned this pull request Oct 20, 2021

Receiving mongodb error (appear in st2workflow logs) sending a 2MB content How it reached ~18MB? #5394

Open

Kami mentioned this pull request Aug 9, 2023

zstandard compression for parameters and results #5995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Kami commented Jan 20, 2020 •

edited

Loading

m4dcoder Feb 22, 2020

m4dcoder Feb 22, 2020

m4dcoder left a comment

stale bot commented Jul 1, 2020

guzzijones commented Feb 18, 2021

Kami commented Feb 18, 2021 •

edited

Loading

Kami commented Feb 18, 2021

Kami commented Feb 18, 2021

m4dcoder commented Feb 19, 2021

amanda11 left a comment

amanda11 Mar 19, 2021

Kami Mar 19, 2021 •

edited

Loading

Kami Mar 19, 2021

amanda11 left a comment

amanda11 Mar 19, 2021

Kami Mar 19, 2021 •

edited

Loading

amanda11 Mar 19, 2021

Kami Mar 19, 2021

amanda11 Mar 19, 2021

Kami Mar 19, 2021 •

edited

Loading

amanda11 left a comment

blag left a comment

Kami commented Mar 20, 2021

jegesh commented Jun 13, 2021

Kami commented Jun 13, 2021

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Optimize storage (serialization and de-serilization) of very large dictionaries inside MongoDB #4846

Conversation

Kami commented Jan 20, 2020 • edited Loading

Proposed Implementation / Improvement

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m4dcoder left a comment

Choose a reason for hiding this comment

stale bot commented Jul 1, 2020

guzzijones commented Feb 18, 2021

Kami commented Feb 18, 2021 • edited Loading

Proposed Approach

Initial Results

Save

Read

Things to experiment with

TODO

Kami commented Feb 18, 2021

Kami commented Feb 18, 2021

m4dcoder commented Feb 19, 2021

amanda11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kami Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanda11 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kami Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kami Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

amanda11 left a comment

Choose a reason for hiding this comment

blag left a comment

Choose a reason for hiding this comment

Kami commented Mar 20, 2021

jegesh commented Jun 13, 2021

Kami commented Jun 13, 2021

Kami commented Jan 20, 2020 •

edited

Loading

Kami commented Feb 18, 2021 •

edited

Loading

Kami Mar 19, 2021 •

edited

Loading

Kami Mar 19, 2021 •

edited

Loading

Kami Mar 19, 2021 •

edited

Loading