Adjustment for memory constraints #112

ghukill · 2024-01-29T15:43:39Z

Purpose and background context

A recent full harvest of the GeoHarvester via the TIMDEX StepFunction failed because the ECS container ran out of memory. This was anticipated but not confirmed until now.

This was a result of the Harvester maintaining the successful and failed Record objects on the harvester instance throughout the harvest. With a Record object containing both the source XML and normalized JSON (at least for MIT harvests), the volume of data becomes substantial over time; at least for a 512mb container.

The following is a screenshot of ECS container metrics from the failed run. Note the linearly increasing memory that eventually runs out:

This PR updates the harvester in two important ways:

for records successfully processed, only the identifier is saved on the harvester (commit)
for failed records, only the record's identifier, failed harvest step name, and Exception object are saved on the harvester (commit)

The following is the same ECS container metrics, but now the memory nearly flatlines once in the main iteration loop:

This screenshot was taken before the failed records adjustment, so the memory consumption may grow even slower.

Taken altogether, this should be sufficient for very large harvests (e.g. OGM with 120k+ records).

How can a reviewer manually see the effects of these changes?

Set env vars:

WORKSPACE=dev
SENTRY_DSN=None
S3_RESTRICTED_CDN_ROOT=/tmp
S3_PUBLIC_CDN_ROOT=/tmp
GEOHARVESTER_SQS_TOPIC_NAME=geo-harvester-input-dev
AWS_ACCESS_KEY_ID=<KEY HERE UNQUOTED>
AWS_SECRET_ACCESS_KEY=<SECRET HERE UNQUOTED>
AWS_SESSION_TOKEN=<SESSION TOKEN UNQUOTED>

Run full harvest against Dev1 files, but ALL outputs of the harvester will be internal to the local docker container:

docker run --env-file .env \
geo-harvester-dev:latest --verbose harvest \
--harvest-type="full" \
--output-file="/tmp/output.jsonl" \
mit \
--input-files="s3://cdn-origin-dev-222053980223/cdn/geo/restricted/" \
--skip-sqs-check \
--skip-eventbridge-events

While this runs -- and it should take some time as there are 2k+ zip files here -- monitor docker stats with the following command in another terminal window:

docker stats

Here is a screenshot showing memory for this container (and a couple of others that were running) on my machine:

Note that the memory consumption of the container remains fairly stable. It will still grow slowly, but orders of magnitude slower than before as successful and failed records are no longer fully maintained in memory as items on the Harvester instance.

Ctrl + C will stop the docker container anytime satisfied memory is not growing rapidly and stop the harvest and container, which will naturally remove files saved locally inside the container.

Discussion

This harvester is almost entirely IO bound, with lots of read network calls to S3 to read zip files and write network calls to S3 writing source + normalized metadata, and appending to a growing JSONLines file.

One dramatic performance improvement could be writing all metadata files locally inside the container, and either writing them all at once to S3 at the end of harvest, or performing batch writes along the way (e.g. every 100 records).

But this introduces some complexity that may not be worth the performance gain. In most cases, the "daily" harvests will be very, very small. The readability and linear logic of the current implementation is easily efficient enough. Similarly, for OGM harvests there are far fewer network writes.

Lastly, the OGM work may further stress test the current implementation. If batch writing seems appealing then, as we'll be writing potentially 100k+ files to JSONlines, it might be the time to make this change.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-156

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: As anticipated, for large harvests, saving the entire Record object on the Harvester instance was prohibitively expensive memory-wise. If the record is successfully harvested, only the identifier is needed. How this addresses that need: * The loop that pulls all records through the step methods saves only the successfull identifier Side effects of this change: * Harvester does not retain successfully harvested Records on the instance Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-156

Why these changes are being introduced: Similar to not saving the entire Record object, with full source and normalized metadata, when a Record fails it is also prohibitively expensive memory-wise to save the entire Record. How this addresses that need: * If a record fails, the identifier, the harvest step, and the Exception object are saved * This allows for debugging/logging of failed records, and why, without the memory overhead of saving everything Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-156

Why these changes are being introduced: Mostly for debugging purposes, skipping the sending of EventBridge events for MIT harvests allows for debugging the harvester flow and performance. With the current ability to exclude writing external files, sending an EventBridge event was still a non-configurable external action the harvester would take. How this addresses that need: * Adds CLI and harvester flag to skip sending EventBridge events Side effects of this change: * None Relevant ticket(s): * None

ehanson8

Looks great, thanks for the very detailed explanation and test!

jonavellecuerdo

[Non-blocking]: Just to make sure I understand -- the second screenshot shows the ECS container metrics after making the change to save only the identifier of successful records to the harvester (but not the change for failed records)?

ghukill · 2024-01-30T21:01:24Z

[Non-blocking]: Just to make sure I understand -- the second screenshot shows the ECS container metrics after making the change to save only the identifier of successful records to the harvester (but not the change for failed records)?

That's correct!

Why these changes are being introduced: * Saving the entire Record object to the 'event_records' dictionary in-memory was causing the container to run out-of-memory. This update applies the solution from #112. How this addresses that need: * Only save the fields that are required for the EventBridge details param. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-323

ghukill added 3 commits January 26, 2024 20:04

ghukill marked this pull request as ready for review January 29, 2024 16:11

ghukill requested review from jonavellecuerdo and ehanson8 January 29, 2024 16:11

ehanson8 approved these changes Jan 30, 2024

View reviewed changes

jonavellecuerdo approved these changes Jan 30, 2024

View reviewed changes

ghukill merged commit ea84b30 into main Jan 30, 2024
5 checks passed

ghukill deleted the GDT-156-memory-performance branch February 7, 2024 14:06

jonavellecuerdo mentioned this pull request May 17, 2024

Save only record fields required for EventBridge events #314

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjustment for memory constraints #112

Adjustment for memory constraints #112

ghukill commented Jan 29, 2024 •

edited by ehanson8

Loading

ehanson8 left a comment

jonavellecuerdo left a comment

ghukill commented Jan 30, 2024

Adjustment for memory constraints #112

Adjustment for memory constraints #112

Conversation

ghukill commented Jan 29, 2024 • edited by ehanson8 Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Discussion

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

ghukill commented Jan 30, 2024

ghukill commented Jan 29, 2024 •

edited by ehanson8

Loading