Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream json data to a file to save 30% of the memory. #2510

Merged
merged 2 commits into from
Apr 27, 2024

Conversation

rhpvorderman
Copy link
Contributor

@rhpvorderman rhpvorderman commented Apr 27, 2024

  • This comment contains a description of changes (with reason)

I noticed that using --no-data-dir saved more than 300 MiB when working with the multiqc/test-data. This is odd. Turns out the internal data is serialized into a string before it is written. Using a streaming method saves the memory.

before:

/usr/bin/time bash -c "multiqc --force ../test-data/ 2>/dev/null"
26.09user 3.37system 0:28.28elapsed 104%CPU (0avgtext+0avgdata 1030072maxresident)k
320inputs+558480outputs (5major+327488minor)pagefaults 0swaps

After:

/usr/bin/time bash -c "multiqc --force ../test-data/ 2>/dev/null"
25.82user 3.61system 0:28.29elapsed 104%CPU (0avgtext+0avgdata 674052maxresident)k
464inputs+558440outputs (5major+356765minor)pagefaults 0swaps

I would love to reduce the memory further, but a lot of the memory profile shows stuff that has already been touched by other PRs. As stated before the LZ compression is a major culprit, using 200 MiB. If it could be replaced with gzip that usage would simply vanish. Another culprit is reading and storing lines from files, but that is already touched upon by #2505 and I suspect that that will have much better memory use as lines are not saved permanently. I added the profile report.
memray-flamegraph-multiqc.9194.html.zip

This change is pretty small. While I do not like that it can dump to file (returning None) and dump to string (returning str) it is good that the replace_nan function is still used.

@vladsavelyev vladsavelyev added this to the MultiQC v1.22: Pydantic milestone Apr 27, 2024
@vladsavelyev
Copy link
Member

That's awesome! I changed a few more places where JSON can be written directly to file.

@vladsavelyev vladsavelyev self-requested a review April 27, 2024 16:33
@vladsavelyev vladsavelyev merged commit 7cd623c into MultiQC:main Apr 27, 2024
6 checks passed
@rhpvorderman
Copy link
Contributor Author

Awesome. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants