USE 86 - new source mitlibwebsite extract command #327
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
This PR adds support for generating an extract/harvest command for the new source
mitlibwebsite. Additionally, it addsmitlibwebsiteto thetimdexandusealiases.As noted in first and larger git commit, the approach was to add this functionality by following established patterns in command generation and helpers. However, this addition of a new harvester is beginning to stress some of the
if/elsepatterns in the lambda code.Previously, we had OAI or GIS, and
if/elsestatements were arguably okay. Now, with the addition ofmitlibwebsitethere is quite a bit of branching.At this point, I feel as though the lambdas are ready for a refactoring. Inputs and outputs could remain identical, but the code organization and branching inside the lambda handlers and helpers could be improved at this point. Proposing to wait on this work, but noting here in this PR.
How can a reviewer manually see the effects of these changes?
Before continuing, make sure to complete steps 1 and 2 from the README for local testing via AWS SAM.
Full Harvest
Run the following to invoke the lambda for a payload simulating a
mitlibwebsitefull extract:sam local invoke -e tests/fixtures/event_payloads/mitlibwebsite-full-extract.jsonThe last line contains the lambda response, which formatted looks like:
{ "run-date": "2025-10-14", "run-type": "full", "source": "mitlibwebsite", "verbose": true, "harvester-type": "browsertrix", "next-step": "transform", "extract": { "extract-command": [ "--verbose", "harvest", "--config-yaml-file=s3://timdex-bucket/mitlibwebsite/config/mitlibwebsite.yaml", "--metadata-output-file=s3://timdex-bucket/mitlibwebsite/mitlibwebsite-2025-10-14-full-extracted-records-to-index.jsonl", "--sitemap=https://libraries.mit.edu/sitemap.xml", "--sitemap=https://libraries.mit.edu/news/sitemap.xml", "--sitemap-urls-output-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt" ] } }Observe:
Daily Harvest
Run the following to invoke the lambda for a payload simulating a
mitlibwebsitedaily extract:sam local invoke -e tests/fixtures/event_payloads/mitlibwebsite-daily-extract.jsonThe last line contains the lambda response, which formatted looks like:
{ "run-date": "2025-10-14", "run-type": "daily", "source": "mitlibwebsite", "verbose": true, "harvester-type": "browsertrix", "next-step": "transform", "extract": { "extract-command": [ "--verbose", "harvest", "--config-yaml-file=s3://timdex-bucket/mitlibwebsite/config/mitlibwebsite.yaml", "--metadata-output-file=s3://timdex-bucket/mitlibwebsite/mitlibwebsite-2025-10-14-daily-extracted-records-to-index.jsonl", "--sitemap=https://libraries.mit.edu/sitemap.xml", "--sitemap=https://libraries.mit.edu/news/sitemap.xml", "--sitemap-from-date=2025-10-13", "--sitemap-urls-output-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt", "--previous-sitemap-urls-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt" ] } }Observe:
--sitemap-from-dateis also added because ofrun-type="daily"Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES
What are the relevant tickets?