Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle missing source record for logging #289

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Apr 23, 2024

Purpose and background context

Fix small bug where missing Record.source_record results in logging failure during harvest pipeline get_source_records() step.

Now that get_source_records() can return a Record without an attached SourceRecord, we need to skip logging details from the source record if not present in this step.

Error before fix:

...
  File "/<PATH>/geo-harvester/harvester/harvest/__init__.py", line 175, in filter_failed_records
    for record in records:
  File "/<PATH>/geo-harvester/harvester/harvest/__init__.py", line 109, in get_source_records
    f"{record.source_record.event}'"
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'event'

Success after fix, showing problematic record marked as "failed" but process continues:

2024-04-23 11:47:38,334 DEBUG harvester.harvest.get_source_records() line 114: Record foo: retrieved source record, event 'None'
2024-04-23 11:47:38,334 DEBUG harvester.harvest.filter_failed_records() line 187: Record error: {'record_identifier': 'foo', 'harvest_step': 'incremental_harvest_get_source_records', 'exception': OSError("unable to access bucket: 'cdn-origin-dev-222053980223' key: 'cdn/geo/restricted/foo.zip' version: None error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.")}
2024-04-23 11:47:43,583 INFO harvester.harvest.harvest() line 59: No source records found for harvest parameters, exiting.
2024-04-23 11:47:43,583 INFO harvester.cli.harvest_mit() line 184: {'processed_records_count': 1, 'successful_records': 0, 'failed_records_count': 1, 'failed_step_and_reason_count': defaultdict(<class 'int'>, {"incremental_harvest_get_source_records: unable to access bucket: 'cdn-origin-dev-222053980223' key: 'cdn/geo/restricted/foo.zip' version: None error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.": 1})}
2024-04-23 11:47:43,583 INFO harvester.cli.harvest_mit() line 186: Total elapsed: 0:00:06.622261

How can a reviewer manually see the effects of these changes?

Change is very minor; logic should be understandable from the change.

But if interested in running, currently one problematic message in SQS queue. Can run incremental MIT harvest:

Env vars:

WORKSPACE=dev
SENTRY_DSN=None
S3_RESTRICTED_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/restricted/
S3_PUBLIC_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/public/
GEOHARVESTER_SQS_TOPIC_NAME=geo-harvester-input-dev

command (that preserves the SQS message):

pipenv run harvester --verbose \
harvest -t incremental \
-o output/mit_incremental.jsonl \
mit --preserve-sqs-messages

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

  • None

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@ghukill ghukill merged commit 85b778d into main Apr 24, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants