Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Oct 3, 2023

What does this PR do?

This PR adds the browsertrix web crawling functionality to the harvester app.

This begins to build out the CLI command harvest. At this point, all it can do is perform a web crawl, and optionally write a WACZ file to a local or remote location, but no metadata records are parsed.

As outlined in the README, a web crawl will only run in a containerized context.

How can a reviewer manually see the effects of these changes?

The easiest way to confirm a web crawl can be performed is to use a convenience Makefile command is configured to run a small web crawl:

# rebuild the image with new code
make dist-local

# perform local web crawl that will write assets to ./output/crawls
make test-harvest-local

The crawl configuration used (tests/fixtures/lib-website-homepage.yaml) is configured to use a single sitemap XML file, which (as of this writing) is returning 139 sites. But there is a max page limit set to 20 pages, and then an additional runtime override using JSON arguments to limit to 15, thereby testing both configuration approaches; final result should be 15 crawled websites.

The crawl should take roughly 1-2 minutes, and will output a WACZ file to output/crawls/collections/homepage/homepage.wacz.

To further dig into the results of this crawl, you can unzip this file and observe the contents:

cd output/crawls/collections/homepage
mkdir wacz_content
cd wacz_content
unzip ../homepage.wacz

And the file structure should look like this:

.
├── archive
│   ├── rec-20231003150448167240-c24c9596f4b2.warc.gz
│   ├── rec-20231003150448496200-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450050541-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450196576-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450635219-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450850474-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451119787-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451283860-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451471183-c24c9596f4b2.warc.gz
│   └── rec-20231003150451641445-c24c9596f4b2.warc.gz
├── datapackage-digest.json
├── datapackage.json
├── indexes
│   ├── index.cdx.gz
│   └── index.idx
├── logs
│   └── crawl-20231003150430042.log
└── pages
    ├── extraPages.jsonl
    └── pages.jsonl

Includes new or updated dependencies?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-247

Developer

  • All new ENV is documented in README (or there is none)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

The first step in a harvest for this application is performing a web crawl.  This commit adds a Crawler class and a CLI command
for configuring and running browsertrix web crawls inside a Docker container.

At this point, no metadata is parsed, but a WACZ file is produced that will eventually support parsing of metadata records for
websites crawled.

How this addresses that need:

  * adds Crawler class that wraps the functionality of browsertrix-crawler
  * adds CLI commands and configurations for running crawl
  * wraps a web crawl under a larger "harvest" umbrella for eventual metadata parsing as well

Side effects of this change:

None

Relevant ticket(s):
https://mitlibraries.atlassian.net/browse/TIMX-247
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, some questions and suggestions. This is complicated but I think you've done a good job of summarizing the details of the config in the README.md

help="YYYY-MM-DD string to filter websites modified after this date in sitemaps",
)
@click.option(
"--wacz-output-file",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about --wacz-output-path for clarity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a concious choice to match other apps like OAI harvester and Transmog which both use this -output-file or input-file convention. Willing to change, but I think the consistency across apps may be desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I agree on consistency so I think it's worth having some best practices about how we name filepaths vs file objects, both of which could be input_file, to minimize confusion. This might be minor
but I think it's worth updating other repos if that's what we decide. @jonavellecuerdo Your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. To me, filename (if ever used) should likely not include the path of any kind.

It's sublte, but I think it's safe to say file is never used by itself (or at least probably shouldn't be). This means you see things like input-file or output_file. We also see filepath or input_filepath, etc. While it'd be nice to be consistent across those, I'm okay with <something>-file and <something>-filepath being mostly interchangeable.

With the caveat usage should be consistent inside of a project. It does feel off to have input-file right next to output-filepath as arguments in the same project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like <something>-file and <something>-filepath as a convention and agree that consistency inside a project is paramount. After @jonavellecuerdo weighs in, where should we document this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that page is appropriate and in need of an update 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ghukill and @ehanson8, I agree with the convention noted above.

  • <something>-file and <something>-filepath being interchangeable;
  • consistent variable naming per project;
  • <something>-file and -filepathcan refer to full file paths (e.g.,directory/test.xml`) or a file-like object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apologies, I misread Graham's comment and wasn't clear in my own:

  • Consistent variable naming in a project
  • My preference would be -file refers to file objects and -filepath is used for filepaths to minimize confusion. However, if it's at least consistent in a repo and others don't feel strongly, we can stick with interchangeable since that's basically the status quo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given Eric's comment we might also want to differentiate between CLI arguments and variable names.

For example, I'm unsure how or when we'd pass a file object as a CLI command argument.

At the CLI level, I'd imagine we might want naming conventions of files vs directories, e.g. --foo-file vs --foo-directory.

But agreed that within a python program there is more nuance (like filepaths vs open file objects, etc.).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this is an item for discussion at the DataEng team meeting next week?

@ghukill ghukill requested a review from ehanson8 October 4, 2023 13:49
@ghukill ghukill merged commit e62e7c3 into code-review-main Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants