Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustment for memory constraints #112

Merged
merged 3 commits into from
Jan 30, 2024
Merged

Adjustment for memory constraints #112

merged 3 commits into from
Jan 30, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Jan 29, 2024

Purpose and background context

A recent full harvest of the GeoHarvester via the TIMDEX StepFunction failed because the ECS container ran out of memory. This was anticipated but not confirmed until now.

This was a result of the Harvester maintaining the successful and failed Record objects on the harvester instance throughout the harvest. With a Record object containing both the source XML and normalized JSON (at least for MIT harvests), the volume of data becomes substantial over time; at least for a 512mb container.

The following is a screenshot of ECS container metrics from the failed run. Note the linearly increasing memory that eventually runs out:

Screenshot 2024-01-26 at 1 05 56 PM

This PR updates the harvester in two important ways:

  • for records successfully processed, only the identifier is saved on the harvester (commit)
  • for failed records, only the record's identifier, failed harvest step name, and Exception object are saved on the harvester (commit)

The following is the same ECS container metrics, but now the memory nearly flatlines once in the main iteration loop:

Screenshot 2024-01-26 at 6 35 35 PM

This screenshot was taken before the failed records adjustment, so the memory consumption may grow even slower.

Taken altogether, this should be sufficient for very large harvests (e.g. OGM with 120k+ records).

How can a reviewer manually see the effects of these changes?

Set env vars:

WORKSPACE=dev
SENTRY_DSN=None
S3_RESTRICTED_CDN_ROOT=/tmp
S3_PUBLIC_CDN_ROOT=/tmp
GEOHARVESTER_SQS_TOPIC_NAME=geo-harvester-input-dev
AWS_ACCESS_KEY_ID=<KEY HERE UNQUOTED>
AWS_SECRET_ACCESS_KEY=<SECRET HERE UNQUOTED>
AWS_SESSION_TOKEN=<SESSION TOKEN UNQUOTED>

Run full harvest against Dev1 files, but ALL outputs of the harvester will be internal to the local docker container:

docker run --env-file .env \
geo-harvester-dev:latest --verbose harvest \
--harvest-type="full" \
--output-file="/tmp/output.jsonl" \
mit \
--input-files="s3://cdn-origin-dev-222053980223/cdn/geo/restricted/" \
--skip-sqs-check \
--skip-eventbridge-events

While this runs -- and it should take some time as there are 2k+ zip files here -- monitor docker stats with the following command in another terminal window:

docker stats

Here is a screenshot showing memory for this container (and a couple of others that were running) on my machine:

Screenshot 2024-01-29 at 10 59 19 AM

Note that the memory consumption of the container remains fairly stable. It will still grow slowly, but orders of magnitude slower than before as successful and failed records are no longer fully maintained in memory as items on the Harvester instance.

Ctrl + C will stop the docker container anytime satisfied memory is not growing rapidly and stop the harvest and container, which will naturally remove files saved locally inside the container.

Discussion

This harvester is almost entirely IO bound, with lots of read network calls to S3 to read zip files and write network calls to S3 writing source + normalized metadata, and appending to a growing JSONLines file.

One dramatic performance improvement could be writing all metadata files locally inside the container, and either writing them all at once to S3 at the end of harvest, or performing batch writes along the way (e.g. every 100 records).

But this introduces some complexity that may not be worth the performance gain. In most cases, the "daily" harvests will be very, very small. The readability and linear logic of the current implementation is easily efficient enough. Similarly, for OGM harvests there are far fewer network writes.

Lastly, the OGM work may further stress test the current implementation. If batch writing seems appealing then, as we'll be writing potentially 100k+ files to JSONlines, it might be the time to make this change.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
As anticipated, for large harvests, saving the entire Record object on the Harvester instance was prohibitively
expensive memory-wise.  If the record is successfully harvested, only the identifier is needed.

How this addresses that need:
* The loop that pulls all records through the step methods saves only the successfull identifier

Side effects of this change:
* Harvester does not retain successfully harvested Records on the instance

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-156
Why these changes are being introduced:
Similar to not saving the entire Record object, with full source and normalized metadata, when a Record fails
it is also prohibitively expensive memory-wise to save the entire Record.

How this addresses that need:
* If a record fails, the identifier, the harvest step, and the Exception object are saved
* This allows for debugging/logging of failed records, and why, without the memory overhead of saving everything

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-156
Why these changes are being introduced:
Mostly for debugging purposes, skipping the sending of EventBridge events for MIT harvests allows
for debugging the harvester flow and performance.  With the current ability to exclude writing external files,
sending an EventBridge event was still a non-configurable external action the harvester would take.

How this addresses that need:
* Adds CLI and harvester flag to skip sending EventBridge events

Side effects of this change:
* None

Relevant ticket(s):
* None
@ghukill ghukill marked this pull request as ready for review January 29, 2024 16:11
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for the very detailed explanation and test!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking]: Just to make sure I understand -- the second screenshot shows the ECS container metrics after making the change to save only the identifier of successful records to the harvester (but not the change for failed records)?

@ghukill
Copy link
Collaborator Author

ghukill commented Jan 30, 2024

[Non-blocking]: Just to make sure I understand -- the second screenshot shows the ECS container metrics after making the change to save only the identifier of successful records to the harvester (but not the change for failed records)?

That's correct!

@ghukill ghukill merged commit ea84b30 into main Jan 30, 2024
5 checks passed
@ghukill ghukill deleted the GDT-156-memory-performance branch February 7, 2024 14:06
jonavellecuerdo added a commit that referenced this pull request May 17, 2024
Why these changes are being introduced:
* Saving the entire Record object to the 'event_records' dictionary in-memory
was causing the container to run out-of-memory. This update applies the
solution from #112.

How this addresses that need:
* Only save the fields that are required for the EventBridge details param.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-323
jonavellecuerdo added a commit that referenced this pull request May 21, 2024
Why these changes are being introduced:
* Saving the entire Record object to the 'event_records' dictionary in-memory
was causing the container to run out-of-memory. This update applies the
solution from #112.

How this addresses that need:
* Only save the fields that are required for the EventBridge details param.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-323
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants