Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send EventBridge Events for MIT Harvests #61

Merged
merged 4 commits into from
Jan 4, 2024
Merged

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Dec 22, 2023

Purpose and background context

This PR allows the GeoHarvester to send EventBridge (EB) events when processing MIT harvests.

From the send_eventbridge_events() method:

These events are handled by the StepFunction "geo-upload-<ENV>-shapefile-handler".
That StepFunction will take one of three paths based on the event payload:

    1. Copy zip file data from Restricted to Public CDN bucket
        - detail.restricted=false
    2. Delete zip file data from Public CDN bucket
        - detail.restricted=true
    3. Delete zip file data AND metadata from Public CDN bucket
        - detail.deleted=true

The goal is to decouple KNOWING whether a record is deleted or restricted (this
harvester) and actually MANAGING files in S3.  By sending EventBridge events about
the record's deleted and restricted status, that work is handled elsewhere.

NOTE: It is possible that OGM may harvests may also need to send EventBridge events, e.g. we determine that an OGM resource is deleted, then sending an EB event is the mechanism by which the metadata is deleted from the Public CDN bucket. However, OGM harvests have been deprioritized until MIT harvests are fully formed, and so I'd propose to keep this logic as an MIT "harvester specific step" until the OGM work is started. Perhaps they will duplicate some logic, perhaps it will be refactored as a shared step.

How can a reviewer manually see the effects of these changes?

1- Set env vars:

WORKSPACE=dev
SENTRY_DSN=None
S3_RESTRICTED_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/restricted/
S3_PUBLIC_CDN_ROOT=s3://cdn-origin-dev-222053980223/cdn/geo/public/
GEOHARVESTER_SQS_TOPIC_NAME=geo-harvester-input-dev

2- Ensure AWS credentials are set for TimdexManagers in Dev1

3- Note the contents of Public CDN S3 bucket. Easiest way is likely to note timestamps of files, so you can observe they were updated.

4- Run local harvest of 4 sample zip files in fixtures:

pipenv run harvester --verbose harvest \
--harvest-type="full" \
--output-file="s3://timdex-extract-dev-222053980223/geo/my-unique-filename-here.jsonl" \
mit \
--input-files="tests/fixtures/zip_files" \
--skip-sqs-check

After the harvest, note the following:

  • Debug output logs should show events are created:

2023-12-22 10:26:40,846 DEBUG harvester.aws.eventbridge.send_event() line 38: EventBridge event created: d17ca36c-737e-e8d0-0f4f-1ce18d8c1208

  • Note that a the zip file SDE_DATA_AE_A8GNS_2003.zip was created or modified in the Public CDN bucket; this is a result of the StepFunction handling the EventBridge event and copying the zip file from the Restricted CDN to Public CDN because the normalization of metadata indicated it is NOT a restricted resource

  • Observe the executions for the associated StepFunction; it should have 4 new invocations from the 4 records processed, each sending an EventBridge event

    • for bonus points, observe those invocations and note that some of them result in the step DeleteDataFromPublic, as those resources were Restricted, therefore we need to remove from the Public bucket (gracefully handling if the file was not Public previously, so more like an attempt)

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES - StepFunction may be invoked from MIT harvests

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
Covered in more detail in docstrings, MIT harvests are expected to send EventBridge events for information
known about MIT resources harvested e.g. deleted or restricted status.  These events are handled by a
StepFunction concerned with copying and deleting files.

These changes implement an MITHarvester specific harvest step to send out EventBridge events for items
processed.

How this addresses that need:
* Adds new EventBridgeClient class for sending messages
* Completes MITHarvester.send_eventbridge_event harvest step

Side effects of this change:
* EventBridge events will now get published for MIT Harvests, which may invoke any listening AWS assets

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-87
@ghukill ghukill changed the title Gdt 87 send eb events Send EventBridge Events for MIT Harvests Dec 22, 2023
@ghukill ghukill requested review from jonavellecuerdo and ehanson8 and removed request for jonavellecuerdo January 2, 2024 13:59
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ghukill ! I think this looks good! I Just one question came to mind:

  1. Reading about the 3 paths the StepFunction "geo-upload--shapefile-handler" can take, is the following statement true: nothing gets deleted from the "Restricted" S3 bucket? Is this true even if the GIS team were to delete a restricted record?

@ghukill
Copy link
Collaborator Author

ghukill commented Jan 4, 2024

Hi @ghukill ! I think this looks good! I Just one question came to mind:

  1. Reading about the 3 paths the StepFunction "geo-upload--shapefile-handler" can take, is the following statement true: nothing gets deleted from the "Restricted" S3 bucket? Is this true even if the GIS team were to delete a restricted record?

Ah, nice catch!! Yes, commit added that adds "AND Restricted":

3. Delete zip file data AND metadata from Public AND Restrictred CDN bucket
    - detail.deleted=true

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the review delay!

@@ -97,20 +99,76 @@ def harvester_specific_steps(self, records: Iterator[Record]) -> Iterator[Record
records = self.filter_failed_records(self.delete_sqs_messages(records))
yield from records

def send_eventbridge_event(self, records: Iterator[Record]) -> Iterator[Record]:
"""Method to send EventBridge event indicating access restrictions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great detail here

@ghukill ghukill merged commit 7e590d0 into main Jan 4, 2024
5 checks passed
@ghukill ghukill deleted the GDT-87-send-eb-events branch February 7, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants