Add web crawl capabilities to harvester app #11

ghukill · 2023-10-03T15:10:15Z

What does this PR do?

This PR adds the browsertrix web crawling functionality to the harvester app.

This begins to build out the CLI command harvest. At this point, all it can do is perform a web crawl, and optionally write a WACZ file to a local or remote location, but no metadata records are parsed.

As outlined in the README, a web crawl will only run in a containerized context.

How can a reviewer manually see the effects of these changes?

The easiest way to confirm a web crawl can be performed is to use a convenience Makefile command is configured to run a small web crawl:

# rebuild the image with new code
make dist-local

# perform local web crawl that will write assets to ./output/crawls
make test-harvest-local

The crawl configuration used (tests/fixtures/lib-website-homepage.yaml) is configured to use a single sitemap XML file, which (as of this writing) is returning 139 sites. But there is a max page limit set to 20 pages, and then an additional runtime override using JSON arguments to limit to 15, thereby testing both configuration approaches; final result should be 15 crawled websites.

The crawl should take roughly 1-2 minutes, and will output a WACZ file to output/crawls/collections/homepage/homepage.wacz.

To further dig into the results of this crawl, you can unzip this file and observe the contents:

cd output/crawls/collections/homepage
mkdir wacz_content
cd wacz_content
unzip ../homepage.wacz

And the file structure should look like this:

.
├── archive
│   ├── rec-20231003150448167240-c24c9596f4b2.warc.gz
│   ├── rec-20231003150448496200-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450050541-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450196576-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450635219-c24c9596f4b2.warc.gz
│   ├── rec-20231003150450850474-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451119787-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451283860-c24c9596f4b2.warc.gz
│   ├── rec-20231003150451471183-c24c9596f4b2.warc.gz
│   └── rec-20231003150451641445-c24c9596f4b2.warc.gz
├── datapackage-digest.json
├── datapackage.json
├── indexes
│   ├── index.cdx.gz
│   └── index.idx
├── logs
│   └── crawl-20231003150430042.log
└── pages
    ├── extraPages.jsonl
    └── pages.jsonl

Includes new or updated dependencies?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-247

Developer

All new ENV is documented in README (or there is none)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

The commit message is clear and follows our guidelines
(not just this pull request message)
There are appropriate tests covering any new functionality
The documentation has been updated or is unnecessary
The changes have been verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: The first step in a harvest for this application is performing a web crawl. This commit adds a Crawler class and a CLI command for configuring and running browsertrix web crawls inside a Docker container. At this point, no metadata is parsed, but a WACZ file is produced that will eventually support parsing of metadata records for websites crawled. How this addresses that need: * adds Crawler class that wraps the functionality of browsertrix-crawler * adds CLI commands and configurations for running crawl * wraps a web crawl under a larger "harvest" umbrella for eventual metadata parsing as well Side effects of this change: None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TIMX-247

ehanson8

Looking good, some questions and suggestions. This is complicated but I think you've done a good job of summarizing the details of the config in the README.md

README.md

harvester/cli.py

ehanson8 · 2023-10-03T21:13:08Z

harvester/cli.py

+    help="YYYY-MM-DD string to filter websites modified after this date in sitemaps",
+)
+@click.option(
+    "--wacz-output-file",


What about --wacz-output-path for clarity?

This was a concious choice to match other apps like OAI harvester and Transmog which both use this -output-file or input-file convention. Willing to change, but I think the consistency across apps may be desirable.

Interesting, I agree on consistency so I think it's worth having some best practices about how we name filepaths vs file objects, both of which could be input_file, to minimize confusion. This might be minor
but I think it's worth updating other repos if that's what we decide. @jonavellecuerdo Your thoughts?

Agreed. To me, filename (if ever used) should likely not include the path of any kind.

It's sublte, but I think it's safe to say file is never used by itself (or at least probably shouldn't be). This means you see things like input-file or output_file. We also see filepath or input_filepath, etc. While it'd be nice to be consistent across those, I'm okay with <something>-file and <something>-filepath being mostly interchangeable.

With the caveat usage should be consistent inside of a project. It does feel off to have input-file right next to output-filepath as arguments in the same project.

I like <something>-file and <something>-filepath as a convention and agree that consistency inside a project is paramount. After @jonavellecuerdo weighs in, where should we document this?

Yeah, that page is appropriate and in need of an update 🙂

Hi @ghukill and @ehanson8, I agree with the convention noted above.

<something>-file and <something>-filepath being interchangeable;

consistent variable naming per project;

<something>-file and -filepathcan refer to full file paths (e.g.,directory/test.xml`) or a file-like object.

My apologies, I misread Graham's comment and wasn't clear in my own:

Consistent variable naming in a project

My preference would be -file refers to file objects and -filepath is used for filepaths to minimize confusion. However, if it's at least consistent in a repo and others don't feel strongly, we can stick with interchangeable since that's basically the status quo.

Given Eric's comment we might also want to differentiate between CLI arguments and variable names.

For example, I'm unsure how or when we'd pass a file object as a CLI command argument.

At the CLI level, I'd imagine we might want naming conventions of files vs directories, e.g. --foo-file vs --foo-directory.

But agreed that within a python program there is more nuance (like filepaths vs open file objects, etc.).

How about this is an item for discussion at the DataEng team meeting next week?

harvester/cli.py

harvester/crawl.py

harvester/utils.py

tests/test_cli.py

harvester/utils.py

ghukill added 2 commits October 3, 2023 10:52

update README with crawl documentation

1490092

ghukill requested review from ehanson8 and jonavellecuerdo October 3, 2023 15:10

ehanson8 reviewed Oct 4, 2023

View reviewed changes

ghukill added 2 commits October 4, 2023 09:19

make CLI arg --crawl-name optional

dd83ba4

capitlize log messages

0773d39

ghukill force-pushed the pr2-web-crawl branch from 8f5c9db to 0773d39 Compare October 4, 2023 13:28

ghukill added 3 commits October 4, 2023 09:28

prefer direct check over double negative

530c350

additional not None syntax updates

443486f

update test names and capitalize logging

665f42a

ghukill requested a review from ehanson8 October 4, 2023 13:49

ehanson8 approved these changes Oct 4, 2023

View reviewed changes

jonavellecuerdo reviewed Oct 4, 2023

View reviewed changes

harvester/utils.py Show resolved Hide resolved

docstring for container decorator

3bb69e7

jonavellecuerdo approved these changes Oct 5, 2023

View reviewed changes

ghukill merged commit e62e7c3 into code-review-main Oct 5, 2023

Add web crawl capabilities to harvester app #11

Add web crawl capabilities to harvester app #11

Uh oh!

Conversation

ghukill commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

What are the relevant tickets?

Developer

Code Reviewer

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Oct 3, 2023 •

edited

Loading