Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Oct 15, 2025

Purpose and background context

This PR adds support for generating an extract/harvest command for the new source mitlibwebsite. Additionally, it adds mitlibwebsite to the timdex and use aliases.

As noted in first and larger git commit, the approach was to add this functionality by following established patterns in command generation and helpers. However, this addition of a new harvester is beginning to stress some of the if/else patterns in the lambda code.

Previously, we had OAI or GIS, and if/else statements were arguably okay. Now, with the addition of mitlibwebsite there is quite a bit of branching.

At this point, I feel as though the lambdas are ready for a refactoring. Inputs and outputs could remain identical, but the code organization and branching inside the lambda handlers and helpers could be improved at this point. Proposing to wait on this work, but noting here in this PR.

How can a reviewer manually see the effects of these changes?

Before continuing, make sure to complete steps 1 and 2 from the README for local testing via AWS SAM.

Full Harvest

Run the following to invoke the lambda for a payload simulating a mitlibwebsite full extract:

sam local invoke -e tests/fixtures/event_payloads/mitlibwebsite-full-extract.json

The last line contains the lambda response, which formatted looks like:

{
  "run-date": "2025-10-14",
  "run-type": "full",
  "source": "mitlibwebsite",
  "verbose": true,
  "harvester-type": "browsertrix",
  "next-step": "transform",
  "extract": {
    "extract-command": [
      "--verbose",
      "harvest",
      "--config-yaml-file=s3://timdex-bucket/mitlibwebsite/config/mitlibwebsite.yaml",
      "--metadata-output-file=s3://timdex-bucket/mitlibwebsite/mitlibwebsite-2025-10-14-full-extracted-records-to-index.jsonl",
      "--sitemap=https://libraries.mit.edu/sitemap.xml",
      "--sitemap=https://libraries.mit.edu/news/sitemap.xml",
      "--sitemap-urls-output-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt"
    ]
  }
}

Observe:

  • No "previous" list of sitemap URLs is provided as input or output, because this "full" harvest will establish that file.

Daily Harvest

Run the following to invoke the lambda for a payload simulating a mitlibwebsite daily extract:

sam local invoke -e tests/fixtures/event_payloads/mitlibwebsite-daily-extract.json

The last line contains the lambda response, which formatted looks like:

{
  "run-date": "2025-10-14",
  "run-type": "daily",
  "source": "mitlibwebsite",
  "verbose": true,
  "harvester-type": "browsertrix",
  "next-step": "transform",
  "extract": {
    "extract-command": [
      "--verbose",
      "harvest",
      "--config-yaml-file=s3://timdex-bucket/mitlibwebsite/config/mitlibwebsite.yaml",
      "--metadata-output-file=s3://timdex-bucket/mitlibwebsite/mitlibwebsite-2025-10-14-daily-extracted-records-to-index.jsonl",
      "--sitemap=https://libraries.mit.edu/sitemap.xml",
      "--sitemap=https://libraries.mit.edu/news/sitemap.xml",
      "--sitemap-from-date=2025-10-13",
      "--sitemap-urls-output-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt",
      "--previous-sitemap-urls-file=s3://timdex-bucket/mitlibwebsite/last-sitemaps-urls.txt"
    ]
  }
}

Observe:

  • We do have previous sitemap URLs as both input and output now, where the daily harvest will analyze the previous run's sitemap URLs to find deletions.
  • --sitemap-from-date is also added because of run-type="daily"

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES

What are the relevant tickets?

Why these changes are being introduced:

With the addition of the new source 'mitlibwebsite', we need the
pipeline lambdas to parse the StepFunction input payload and
generate an extract/harvest CLI command.

How this addresses that need:
* 'mitlibwebsite' added as a source in config
* followed established structure for extract/harveste commands

Note: while adding support for 'mitlibwebsite' it's starting to
feel like some refactoring might be in order.  Out of scope for
this commit which is tested and confirmed to work.

Side effects of this change:
* mitlibwebsite now a valid source for extract/harvest commands

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-86
@coveralls
Copy link

coveralls commented Oct 15, 2025

Pull Request Test Coverage Report for Build 18536981192

Details

  • 26 of 27 (96.3%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 95.177%

Changes Missing Coverage Covered Lines Changed/Added Lines %
lambdas/helpers.py 10 11 90.91%
Totals Coverage Status
Change from base Build 18509110623: 0.06%
Covered Lines: 296
Relevant Lines: 311

💛 - Coveralls

@ghukill ghukill marked this pull request as ready for review October 15, 2025 15:27
@ghukill ghukill requested a review from a team October 15, 2025 15:29
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and SAM testing worked as expected!

@ghukill ghukill merged commit ca0a6d9 into main Oct 15, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants