Fix memory leaks and update PDF/A algorithm for non-document products #845

al-niessner · 2024-02-29T19:12:56Z

🗒️ Summary

Removed some caching that was causing excessive memory consumption. Also only do PDF valid for Product_Document.

⚙️ Test Data and/or Report

All tests pass. Previous behavior was 10s of MB per label. Current behavior from profiler is stable at 50 MB independent of number of labels.

♻️ Related Issues

Closes #824
Closes #826

There were many. First, it turns out that hashing on a URL or URI is a bad idea. When doing profiling, the tables that hashed on these objects were tremendously big (thousands). Changeing them hash (key) on .toString() reduced the tables to 12 items. Seemed more reasonable. Second problem was the schematron were holding their past information in memory and then recycling them. resulted in 10s of MB per label processed being stored in each of the schematron validators. Doing a reset() on the validator when doen, cleared that problem. Third, the schematron validators were holding onto document frangments that translated into 1 MB per label (give or take as it dependend on the label size). To get rid of these, need to clean up (garbage collect) the schematron validators. This is going to cost in CPU cycles per label because the validator has to be created from scratch each time. However, long jobs will not fail because of out of memory. As part of (3), changed from caching the transform to caching the actual document. Hence it is loaded once from file or network and then used repeatedly to create new schematron validators. In this way, bet benefit of not reloadeing a million times, but pay to create validators. Not too much slower. Doing about 300 labels every 15 minutes currently. Think it was faster yesterday at maybe 400 every 15 minutes but not positive. Never wrote it down.

al-niessner · 2024-02-29T19:14:14Z

Ooops. Wrong branch for this work... I will add PDF when I have time.

Quick check that container label is not Product_Document then press onward.

al-niessner · 2024-03-01T16:43:39Z

@jordanpadams @nutjob4life

Ready for prime time.

Al Niessner added 2 commits February 26, 2024 13:20

add unit test first

9d3bcc6

al-niessner self-assigned this Feb 29, 2024

al-niessner requested a review from a team as a code owner February 29, 2024 19:12

al-niessner and others added 3 commits February 29, 2024 13:12

Merge branch 'main' into issue_824

67efb71

check PDF only if documentation

469b0b9

Quick check that container label is not Product_Document then press onward.

Merge branch 'issue_824' of github.com:NASA-PDS/validate into issue_824

d2cf090

jordanpadams changed the title ~~performance issues - specifically memory~~ Fix memory leaks and update PDF/A algorithm for non-document products Mar 2, 2024

jordanpadams approved these changes Mar 2, 2024

View reviewed changes

jordanpadams merged commit ae45e45 into main Mar 2, 2024
3 checks passed

jordanpadams deleted the issue_824 branch March 2, 2024 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leaks and update PDF/A algorithm for non-document products #845

Fix memory leaks and update PDF/A algorithm for non-document products #845

al-niessner commented Feb 29, 2024 •

edited

Loading

al-niessner commented Feb 29, 2024

al-niessner commented Mar 1, 2024

Fix memory leaks and update PDF/A algorithm for non-document products #845

Fix memory leaks and update PDF/A algorithm for non-document products #845

Conversation

al-niessner commented Feb 29, 2024 • edited Loading

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

al-niessner commented Feb 29, 2024

al-niessner commented Mar 1, 2024

al-niessner commented Feb 29, 2024 •

edited

Loading