Stream json data to a file to save 30% of the memory. #2510

rhpvorderman · 2024-04-27T08:29:28Z

This comment contains a description of changes (with reason)

I noticed that using --no-data-dir saved more than 300 MiB when working with the multiqc/test-data. This is odd. Turns out the internal data is serialized into a string before it is written. Using a streaming method saves the memory.

before:

/usr/bin/time bash -c "multiqc --force ../test-data/ 2>/dev/null"
26.09user 3.37system 0:28.28elapsed 104%CPU (0avgtext+0avgdata 1030072maxresident)k
320inputs+558480outputs (5major+327488minor)pagefaults 0swaps

After:

/usr/bin/time bash -c "multiqc --force ../test-data/ 2>/dev/null"
25.82user 3.61system 0:28.29elapsed 104%CPU (0avgtext+0avgdata 674052maxresident)k
464inputs+558440outputs (5major+356765minor)pagefaults 0swaps

I would love to reduce the memory further, but a lot of the memory profile shows stuff that has already been touched by other PRs. As stated before the LZ compression is a major culprit, using 200 MiB. If it could be replaced with gzip that usage would simply vanish. Another culprit is reading and storing lines from files, but that is already touched upon by #2505 and I suspect that that will have much better memory use as lines are not saved permanently. I added the profile report.
memray-flamegraph-multiqc.9194.html.zip

This change is pretty small. While I do not like that it can dump to file (returning None) and dump to string (returning str) it is good that the replace_nan function is still used.

…ith dump_json

vladsavelyev · 2024-04-27T16:33:10Z

That's awesome! I changed a few more places where JSON can be written directly to file.

rhpvorderman · 2024-04-29T06:21:34Z

Awesome. Thanks!

Stream json data to a file to prevent using excessive memory

754709a

vladsavelyev added this to the MultiQC v1.22: Pydantic milestone Apr 27, 2024

More places where JSON dump can be optimized. Replace sanitise_json w…

a0c629b

…ith dump_json

vladsavelyev self-requested a review April 27, 2024 16:33

vladsavelyev approved these changes Apr 27, 2024

View reviewed changes

vladsavelyev added the core: back end label Apr 27, 2024

vladsavelyev merged commit 7cd623c into MultiQC:main Apr 27, 2024
6 checks passed

rhpvorderman deleted the memorysavings branch April 29, 2024 06:21

This was referenced Apr 29, 2024

Consider using orjson #1920

Closed

Rampant memory consumption #1961

Open

Reduce MultiQC memory consumption #2517

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream json data to a file to save 30% of the memory. #2510

Stream json data to a file to save 30% of the memory. #2510

rhpvorderman commented Apr 27, 2024 •

edited

vladsavelyev commented Apr 27, 2024

rhpvorderman commented Apr 29, 2024

Stream json data to a file to save 30% of the memory. #2510

Stream json data to a file to save 30% of the memory. #2510

Conversation

rhpvorderman commented Apr 27, 2024 • edited

vladsavelyev commented Apr 27, 2024

rhpvorderman commented Apr 29, 2024

rhpvorderman commented Apr 27, 2024 •

edited