-
Notifications
You must be signed in to change notification settings - Fork 0
Pr3 parser #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pr3 parser #12
Conversation
Why these changes are being introduced: The second part of a harvest is generating metadata records from a web crawl. This adds functionality to select an output metadata file that will trigger the harvest command to parse records from the completed web crawl. How this addresses that need: * adds CrawlParser class responsible for parsing metadata records * add standalone CLI command for parsing existing WACZ file Side effects of this change: None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TIMX-247
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions and some suggested changes
Why these changes are being introduced: While our current very limited use cases only required the pages/extraPages.jsonl file, it is conceivable that we might want to also use this application for crawls that do not only get URLs from sitemaps. How this addresses that need: This creates the canonical list of URLs from both the pages/pages.jsonl and pages/extraPages.jsonl files. This commit also introduces some additional handling of missing files from WACZ files, and better error handling if a crawl does not produce a WACZ file in the first place. Side effects of this change: A harvest will exit before attemping to parse metadata records if the crawl was unsuccessful and no WACZ file exists.
Why these changes are being introduced: Previously, much of this functionality was all grouped under a single CrawlParser class. This was mixing the task of generating metadata records from a completed crawl, and the somewhat complex helpers and utilities for interacting with a WACZ file. How this addresses that need: This refactor breaks the functionality into two classes: CrawlMetadataParser and WACZClient. This helps isolate the functionality, allowing for easier code readability and testing. Additionally, another pass at variable renaming and docstring improvement.
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A major improvement, much easier to understand! Some questions and suggestions
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done, great refactor!
| with smart_open.open(self.wacz_filepath, "rb") as file_obj: | ||
| _wacz_archive = zipfile.ZipFile(io.BytesIO(file_obj.read())) | ||
| self._wacz_archive = _wacz_archive | ||
| return self._wacz_archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
| # CLI commands | ||
| shell: | ||
| pipenv run harvest-dockerized shell | ||
| docker-shell: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How/when is this command used compared to test-harvest-local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Makefile command test-harvest-local actually kicks off a harvest, while this docker-shell command merely opens a bash shell in the a docker container of the application image. It's mostly for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have to be done as part of this PR, but I think it would be helpful if the README had instructions on how to use this command for debugging!
|
@ghukill Thank you so much for taking the time to carefully document important classes and methods! I would definitely need more time understand it on a much deeper level (which I aim to do via attempt to create a diagram for this application) and will probably have more feedback then. However, from a high-level perspective, I think the codebase is well-organized, documented, and it's apparent that a lot of care and thought has been put into it. :) I just have three clarifying questions above! |
What does this PR do?
This PR continues to extend this application to parse metadata records from a completed crawl.
Helpful background context
As noted throughout the README, the motivating factor for wrapping the browsertrix-crawler in a container was so that we could perform actions after the crawl was to parse metadata records representing websites from that crawl.
As noted in documentation about metadata parsing, the creation of metadata records is an opinionated process. This harvester was built to support crawling of the library websites and producing records for TIMDEX.
While this
CrawlParserclass could be extended in a number of ways to generate meaningful metadata records about non-library websites, and technically it would work for non-library websites, the additional metadata it extracts from the actual HTML content is geared towards our Library WordPress websites.How can a reviewer manually see the effects of these changes?
Update the docker image and run a test harvest via a Make command:
Once the crawl is complete, a single XML should be available locally on the host machine (as the harvest was performed via a docker container) at
output/crawls/collections/homepage/homepage.xml.Reviewing this XML file will reveal what kind of data is parsed from the websites crawled.
Includes new or updated dependencies?
NO
What are the relevant tickets?
https://mitlibraries.atlassian.net/browse/TIMX-247
Developer
Code Reviewer
(not just this pull request message)