Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leaks and update PDF/A algorithm for non-document products #845

Merged
merged 5 commits into from
Mar 2, 2024

Conversation

al-niessner
Copy link
Contributor

@al-niessner al-niessner commented Feb 29, 2024

🗒️ Summary

Removed some caching that was causing excessive memory consumption. Also only do PDF valid for Product_Document.

⚙️ Test Data and/or Report

All tests pass. Previous behavior was 10s of MB per label. Current behavior from profiler is stable at 50 MB independent of number of labels.

♻️ Related Issues

Closes #824
Closes #826

Al Niessner added 2 commits February 26, 2024 13:20
There were many. First, it turns out that hashing on a URL or URI is a bad idea. When doing profiling, the tables that hashed on these objects were tremendously big (thousands). Changeing them hash (key) on .toString() reduced the tables to 12 items. Seemed more reasonable.

Second problem was the schematron were holding their past information in memory and then recycling them. resulted in 10s of MB per label processed being stored in each of the schematron validators. Doing a reset() on the validator when doen, cleared that problem.

Third, the schematron validators were holding onto document frangments that translated into 1 MB per label (give or take as it dependend on the label size). To get rid of these, need to clean up (garbage collect) the schematron validators. This is going to cost in CPU cycles per label because the validator has to be created from scratch each time. However, long jobs will not fail because of out of memory.

As part of (3), changed from caching the transform to caching the actual document. Hence it is loaded once from file or network and then used repeatedly to create new schematron validators. In this way, bet benefit of not reloadeing a million times, but pay to create validators. Not too much slower. Doing about 300 labels every 15 minutes currently. Think it was faster yesterday at maybe 400 every 15 minutes but not positive. Never wrote it down.
@al-niessner al-niessner self-assigned this Feb 29, 2024
@al-niessner al-niessner requested a review from a team as a code owner February 29, 2024 19:12
@al-niessner
Copy link
Contributor Author

Ooops. Wrong branch for this work... I will add PDF when I have time.

@al-niessner
Copy link
Contributor Author

@jordanpadams @nutjob4life

Ready for prime time.

@jordanpadams jordanpadams changed the title performance issues - specifically memory Fix memory leaks and update PDF/A algorithm for non-document products Mar 2, 2024
@jordanpadams jordanpadams merged commit ae45e45 into main Mar 2, 2024
3 checks passed
@jordanpadams jordanpadams deleted the issue_824 branch March 2, 2024 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validate is slow or runs out of memory when validating a bundle Check for PDF/A-1a only if Product_Document
2 participants