Gdt 275 sqs missing file #286

ehanson8 · 2024-04-22T15:54:51Z

Purpose and background context

Adding exception handling for SQS messages that refer to non-existent files in the S3 bucket so that they are skipped and subsequent SQS messages are processed.

How can a reviewer manually see the effects of these changes?

foo.zip was uploaded to geo-upload-dev-222053980223 which triggered an SQS message in the geo-harvester-input-dev.fifo queue and copied the file to cdn-origin-dev-222053980223/cdn/geo/restricted/. The file was then deleted from cdn-origin-dev-222053980223/cdn/geo/restricted/ to trigger the error.

Set Dev1 credentials and the following variables in .env:

WORKSPACE=dev
SENTRY_DSN=None
S3_RESTRICTED_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/restricted/
S3_PUBLIC_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/public/
GEOHARVESTER_SQS_TOPIC_NAME=geo-harvester-input-dev.fifo

Run a incremental harvest to see the error and see that the message remains in the geo-harvester-input-dev.fifo queue

pipenv run harvester --verbose harvest -t incremental -o output/mit_incremental.jsonl mit

Includes new or updated dependencies?

YES

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-275

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * Adding exception handling for SQS messages that refer to non-existent files in the S3 bucket so that they are skipped and subsequent SQS messages are processed. How this addresses that need: * Add try/except block to handle the OSError that results from a missing file in S3 as well as a corresponding unit test * Update and add fixtures to support new unit test Side effects of this change: * SQS messages without a corresponding file in the S3 bucket will remain in the queue Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-275

ghukill

Overall - the location of the try/except logic, the test fixtures, all look really good.

My reason for requesting changes -- whether or not adopted -- is to consider yielding a Record + exception from the incremental_harvest_get_source_records() method instead of logging and skipping the record entirely. As discussed in the comment, I think we might benefit from utilizing the error handling built-in to the pipeline where possible.

Counterpoint is that other harvesters generally don't have this level of sophistication for error handling for source records. Counter-counterpoint, perhaps they should?

Open to anything here, just wanted to propose a good faith example of how yielding Records + exception will get handled in the pre-existing error handling in the pipeline.

tests/conftest.py

ghukill · 2024-04-22T16:21:33Z

tests/test_harvest/test_mit_harvester.py

+def test_mit_harvester_incremental_continues_after_missing_zip_file(
+    caplog,
+    mock_sqs_queue,
+    mocked_sqs_topic_name,
+    mocked_restricted_bucket_one_legacy_fgdc_zip,
+):
+    harvester = MITHarvester(
+        harvest_type="incremental",
+        input_files=mocked_restricted_bucket_one_legacy_fgdc_zip,
+        sqs_topic_name=mocked_sqs_topic_name,
+    )
+    records = harvester.incremental_harvest_get_source_records()
+    assert len(list(records)) == 1
+    assert (
+        "OSError: unable to access bucket: 'mocked_cdn_restricted' "
+        "key: 'cdn/geo/restricted/DEF456.zip'" in caplog.text
+    )


For this test, I'll admit I had to do some digging to understand why it worked. What's not immediately obvious to me is that mocked queue contains two messages: one referencing a file that is in the mocked S3 bucket, and one that is not.

Instead of a docstring, what about updating the fixture mock_sqs_queue to something like mock_sqs_queue_two_messages? I feel like the "two" in that fixture and "one" from mocked_restricted_bucket_one_legacy_fgdc_zip would be enough to understand that final assert = 1 for the length of the results --> one failed, but one still succeeded.

Not a blocking request, but just sharing my experience unpacking this test (which is good, BTW).

Update: see other comment about yield Record + exception from the incremental_harvest_get_source_records() method. If that approach is used, this test would need a couple of updates and maybe then the fixture name is more inconsequential.

ghukill · 2024-04-22T16:37:13Z

harvester/harvest/mit.py

+            try:
+                source_record = self.create_source_record_from_zip_file(
                    identifier=identifier,
                    zip_file=zip_file_event_message.zip_file,
                    event=zip_file_event_message.event,
                    sqs_message=zip_file_event_message,
-                ),
+                )
+            except OSError:
+                logger.exception("File not found")
+                continue
+            yield Record(
+                identifier=identifier,
+                source_record=source_record,


My knee-jerk reaction was that this is a simple and elegant solution to this issue. And I think that's true, but I think it also short-circuits failed records handling aspects of the pipeline.

Notice here in the base Harvester.harvest method, where Records that encountered an error are filtered out (with the same pattern applied to each step):

records = self.filter_failed_records(self.get_source_records())

As such, if we fail to find and parse a source record (e.g. the file is not found in S3) we could yield a Record instance with Record.exception_stage="incremental_harvest_get_source_records" and Record.exception = <EXCEPTION OBJECT>, we'd get the same error handling and reporting as other parts of the pipeline where something fails.

To do so, we could modify the try/except block here to still yield a Record, just one without a SourceRecord attached, and the exception encountered attached:

try: source_record = self.create_source_record_from_zip_file( identifier=identifier, zip_file=zip_file_event_message.zip_file, event=zip_file_event_message.event, sqs_message=zip_file_event_message, ) yield Record( identifier=identifier, source_record=source_record, ) except OSError as exc: # <------- new logic starts here yield Record( identifier=identifier, source_record=None, # <-------- Note this is None exception_stage="incremental_harvest_get_source_records", exception=exc, )

This has a couple of effects:

the total number of records yielded from this method would actually be two (one success, one fail)

the test test_mit_harvester_incremental_continues_after_missing_zip_file could then interrogate the records and assert that one has an exception (and therefore won't carry on) and one does not (will continue as normal).

Example for updates we could apply to that test:

records = list(records) assert len(records) == 2 fail_record, success_record = records assert fail_record.identifier == "DEF456" assert fail_record.exception_stage == "incremental_harvest_get_source_records" assert isinstance(fail_record.exception, OSError) assert ( str(fail_record.exception) == "unable to access bucket: 'mocked_cdn_restricted' key: " "'cdn/geo/restricted/DEF456.zip' version: None error: An error occurred (" "NoSuchKey) when calling the GetObject operation: The specified key does not " "exist." ) assert success_record.identifier == "SDE_DATA_AE_A8GNS_2003" assert success_record.exception is None

If we lean into the failed records handling by yield a Record + exception, then we'll get the same Sentry reporting for the Record's identifier and why (to the degree that is setup and configured).

A much better approach, updating! Thanks!

As mentioned in the PR level comment, inside of a specific harvester it's not obvious this failure handling is happening at the orchestration level. And, while other low level methods may utilize it, I don't think any of the get_source_records() methods (full and incremental) do. So this is kind of the first for them to utilize it.

* Refactor incremental_harvest_get_source_records method to populate Record objects with exceptions rather than skipping entirely * Update corresponding unit test

ghukill

Looks good to me! Thanks for considering the failed records handling approach.

ehanson8 added 2 commits April 22, 2024 11:19

Update Pipfile.lock

f63f22e

ehanson8 requested review from ghukill and jonavellecuerdo April 22, 2024 15:54

ghukill requested changes Apr 22, 2024

View reviewed changes

Updates based on discussion in PR #286

db698a5

* Refactor incremental_harvest_get_source_records method to populate Record objects with exceptions rather than skipping entirely * Update corresponding unit test

ehanson8 requested a review from ghukill April 22, 2024 20:36

ghukill approved these changes Apr 23, 2024

View reviewed changes

ehanson8 merged commit 152069f into main Apr 23, 2024
5 checks passed

ehanson8 deleted the GDT-275-sqs-missing-file branch April 23, 2024 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gdt 275 sqs missing file #286

Gdt 275 sqs missing file #286

ehanson8 commented Apr 22, 2024

ghukill left a comment

ghukill Apr 22, 2024

ghukill Apr 22, 2024 •

edited

Loading

ghukill Apr 22, 2024

ehanson8 Apr 22, 2024

ghukill Apr 22, 2024

ghukill left a comment

Gdt 275 sqs missing file #286

Gdt 275 sqs missing file #286

Conversation

ehanson8 commented Apr 22, 2024

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

ghukill Apr 22, 2024

Choose a reason for hiding this comment

ghukill Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

ghukill Apr 22, 2024

Choose a reason for hiding this comment

ehanson8 Apr 22, 2024

Choose a reason for hiding this comment

ghukill Apr 22, 2024

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

ghukill Apr 22, 2024 •

edited

Loading