-
Notifications
You must be signed in to change notification settings - Fork 0
Add web crawl capabilities to harvester app #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: The first step in a harvest for this application is performing a web crawl. This commit adds a Crawler class and a CLI command for configuring and running browsertrix web crawls inside a Docker container. At this point, no metadata is parsed, but a WACZ file is produced that will eventually support parsing of metadata records for websites crawled. How this addresses that need: * adds Crawler class that wraps the functionality of browsertrix-crawler * adds CLI commands and configurations for running crawl * wraps a web crawl under a larger "harvest" umbrella for eventual metadata parsing as well Side effects of this change: None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TIMX-247
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, some questions and suggestions. This is complicated but I think you've done a good job of summarizing the details of the config in the README.md
| help="YYYY-MM-DD string to filter websites modified after this date in sitemaps", | ||
| ) | ||
| @click.option( | ||
| "--wacz-output-file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about --wacz-output-path for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a concious choice to match other apps like OAI harvester and Transmog which both use this -output-file or input-file convention. Willing to change, but I think the consistency across apps may be desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I agree on consistency so I think it's worth having some best practices about how we name filepaths vs file objects, both of which could be input_file, to minimize confusion. This might be minor
but I think it's worth updating other repos if that's what we decide. @jonavellecuerdo Your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. To me, filename (if ever used) should likely not include the path of any kind.
It's sublte, but I think it's safe to say file is never used by itself (or at least probably shouldn't be). This means you see things like input-file or output_file. We also see filepath or input_filepath, etc. While it'd be nice to be consistent across those, I'm okay with <something>-file and <something>-filepath being mostly interchangeable.
With the caveat usage should be consistent inside of a project. It does feel off to have input-file right next to output-filepath as arguments in the same project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like <something>-file and <something>-filepath as a convention and agree that consistency inside a project is paramount. After @jonavellecuerdo weighs in, where should we document this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that page is appropriate and in need of an update 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My apologies, I misread Graham's comment and wasn't clear in my own:
- Consistent variable naming in a project
- My preference would be
-filerefers to file objects and-filepathis used for filepaths to minimize confusion. However, if it's at least consistent in a repo and others don't feel strongly, we can stick with interchangeable since that's basically the status quo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given Eric's comment we might also want to differentiate between CLI arguments and variable names.
For example, I'm unsure how or when we'd pass a file object as a CLI command argument.
At the CLI level, I'd imagine we might want naming conventions of files vs directories, e.g. --foo-file vs --foo-directory.
But agreed that within a python program there is more nuance (like filepaths vs open file objects, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this is an item for discussion at the DataEng team meeting next week?
What does this PR do?
This PR adds the browsertrix web crawling functionality to the harvester app.
This begins to build out the CLI command
harvest. At this point, all it can do is perform a web crawl, and optionally write a WACZ file to a local or remote location, but no metadata records are parsed.As outlined in the README, a web crawl will only run in a containerized context.
How can a reviewer manually see the effects of these changes?
The easiest way to confirm a web crawl can be performed is to use a convenience Makefile command is configured to run a small web crawl:
The crawl configuration used (
tests/fixtures/lib-website-homepage.yaml) is configured to use a single sitemap XML file, which (as of this writing) is returning 139 sites. But there is a max page limit set to 20 pages, and then an additional runtime override using JSON arguments to limit to 15, thereby testing both configuration approaches; final result should be 15 crawled websites.The crawl should take roughly 1-2 minutes, and will output a WACZ file to
output/crawls/collections/homepage/homepage.wacz.To further dig into the results of this crawl, you can unzip this file and observe the contents:
And the file structure should look like this:
Includes new or updated dependencies?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-247
Developer
Code Reviewer
(not just this pull request message)