-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjustment for memory constraints #112
Conversation
Why these changes are being introduced: As anticipated, for large harvests, saving the entire Record object on the Harvester instance was prohibitively expensive memory-wise. If the record is successfully harvested, only the identifier is needed. How this addresses that need: * The loop that pulls all records through the step methods saves only the successfull identifier Side effects of this change: * Harvester does not retain successfully harvested Records on the instance Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-156
Why these changes are being introduced: Similar to not saving the entire Record object, with full source and normalized metadata, when a Record fails it is also prohibitively expensive memory-wise to save the entire Record. How this addresses that need: * If a record fails, the identifier, the harvest step, and the Exception object are saved * This allows for debugging/logging of failed records, and why, without the memory overhead of saving everything Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-156
Why these changes are being introduced: Mostly for debugging purposes, skipping the sending of EventBridge events for MIT harvests allows for debugging the harvester flow and performance. With the current ability to exclude writing external files, sending an EventBridge event was still a non-configurable external action the harvester would take. How this addresses that need: * Adds CLI and harvester flag to skip sending EventBridge events Side effects of this change: * None Relevant ticket(s): * None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for the very detailed explanation and test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Non-blocking]: Just to make sure I understand -- the second screenshot shows the ECS container metrics after making the change to save only the identifier of successful records to the harvester (but not the change for failed records)?
That's correct! |
Why these changes are being introduced: * Saving the entire Record object to the 'event_records' dictionary in-memory was causing the container to run out-of-memory. This update applies the solution from #112. How this addresses that need: * Only save the fields that are required for the EventBridge details param. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-323
Why these changes are being introduced: * Saving the entire Record object to the 'event_records' dictionary in-memory was causing the container to run out-of-memory. This update applies the solution from #112. How this addresses that need: * Only save the fields that are required for the EventBridge details param. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-323
Purpose and background context
A recent full harvest of the GeoHarvester via the TIMDEX StepFunction failed because the ECS container ran out of memory. This was anticipated but not confirmed until now.
This was a result of the
Harvester
maintaining the successful and failedRecord
objects on the harvester instance throughout the harvest. With aRecord
object containing both the source XML and normalized JSON (at least for MIT harvests), the volume of data becomes substantial over time; at least for a 512mb container.The following is a screenshot of ECS container metrics from the failed run. Note the linearly increasing memory that eventually runs out:
This PR updates the harvester in two important ways:
Exception
object are saved on the harvester (commit)The following is the same ECS container metrics, but now the memory nearly flatlines once in the main iteration loop:
This screenshot was taken before the failed records adjustment, so the memory consumption may grow even slower.
Taken altogether, this should be sufficient for very large harvests (e.g. OGM with 120k+ records).
How can a reviewer manually see the effects of these changes?
Set env vars:
Run full harvest against Dev1 files, but ALL outputs of the harvester will be internal to the local docker container:
While this runs -- and it should take some time as there are 2k+ zip files here -- monitor docker stats with the following command in another terminal window:
Here is a screenshot showing memory for this container (and a couple of others that were running) on my machine:
Note that the memory consumption of the container remains fairly stable. It will still grow slowly, but orders of magnitude slower than before as successful and failed records are no longer fully maintained in memory as items on the
Harvester
instance.Ctrl + C
will stop the docker container anytime satisfied memory is not growing rapidly and stop the harvest and container, which will naturally remove files saved locally inside the container.Discussion
This harvester is almost entirely IO bound, with lots of read network calls to S3 to read zip files and write network calls to S3 writing source + normalized metadata, and appending to a growing JSONLines file.
One dramatic performance improvement could be writing all metadata files locally inside the container, and either writing them all at once to S3 at the end of harvest, or performing batch writes along the way (e.g. every 100 records).
But this introduces some complexity that may not be worth the performance gain. In most cases, the "daily" harvests will be very, very small. The readability and linear logic of the current implementation is easily efficient enough. Similarly, for OGM harvests there are far fewer network writes.
Lastly, the OGM work may further stress test the current implementation. If batch writing seems appealing then, as we'll be writing potentially 100k+ files to JSONlines, it might be the time to make this change.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)