Pr3 parser #12

ghukill · 2023-10-05T16:26:21Z

What does this PR do?

This PR continues to extend this application to parse metadata records from a completed crawl.

Helpful background context

As noted throughout the README, the motivating factor for wrapping the browsertrix-crawler in a container was so that we could perform actions after the crawl was to parse metadata records representing websites from that crawl.

As noted in documentation about metadata parsing, the creation of metadata records is an opinionated process. This harvester was built to support crawling of the library websites and producing records for TIMDEX.

While this CrawlParser class could be extended in a number of ways to generate meaningful metadata records about non-library websites, and technically it would work for non-library websites, the additional metadata it extracts from the actual HTML content is geared towards our Library WordPress websites.

How can a reviewer manually see the effects of these changes?

Update the docker image and run a test harvest via a Make command:

make dist-local
make test-harvest-local

Once the crawl is complete, a single XML should be available locally on the host machine (as the harvest was performed via a docker container) at output/crawls/collections/homepage/homepage.xml.

Reviewing this XML file will reveal what kind of data is parsed from the websites crawled.

Includes new or updated dependencies?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-247

Developer

All new ENV is documented in README (or there is none)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

The commit message is clear and follows our guidelines
(not just this pull request message)
There are appropriate tests covering any new functionality
The documentation has been updated or is unnecessary
The changes have been verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: The second part of a harvest is generating metadata records from a web crawl. This adds functionality to select an output metadata file that will trigger the harvest command to parse records from the completed web crawl. How this addresses that need: * adds CrawlParser class responsible for parsing metadata records * add standalone CLI command for parsing existing WACZ file Side effects of this change: None Relevant ticket(s): https://mitlibraries.atlassian.net/browse/TIMX-247

ehanson8

Questions and some suggested changes

README.md

harvester/cli.py

harvester/parse.py

Why these changes are being introduced: While our current very limited use cases only required the pages/extraPages.jsonl file, it is conceivable that we might want to also use this application for crawls that do not only get URLs from sitemaps. How this addresses that need: This creates the canonical list of URLs from both the pages/pages.jsonl and pages/extraPages.jsonl files. This commit also introduces some additional handling of missing files from WACZ files, and better error handling if a crawl does not produce a WACZ file in the first place. Side effects of this change: A harvest will exit before attemping to parse metadata records if the crawl was unsuccessful and no WACZ file exists.

Why these changes are being introduced: Previously, much of this functionality was all grouped under a single CrawlParser class. This was mixing the task of generating metadata records from a completed crawl, and the somewhat complex helpers and utilities for interacting with a WACZ file. How this addresses that need: This refactor breaks the functionality into two classes: CrawlMetadataParser and WACZClient. This helps isolate the functionality, allowing for easier code readability and testing. Additionally, another pass at variable renaming and docstring improvement.

ehanson8

A major improvement, much easier to understand! Some questions and suggestions

README.md

harvester/wacz.py

tests/test_parser.py

ehanson8

Well done, great refactor!

ehanson8 · 2023-10-11T20:42:16Z

harvester/wacz.py

+            with smart_open.open(self.wacz_filepath, "rb") as file_obj:
+                _wacz_archive = zipfile.ZipFile(io.BytesIO(file_obj.read()))
+                self._wacz_archive = _wacz_archive
+        return self._wacz_archive


Got it, thanks!

harvester/cli.py

jonavellecuerdo · 2023-10-12T20:15:41Z

Makefile

 # CLI commands
-shell:
-	pipenv run harvest-dockerized shell
+docker-shell:


How/when is this command used compared to test-harvest-local?

The Makefile command test-harvest-local actually kicks off a harvest, while this docker-shell command merely opens a bash shell in the a docker container of the application image. It's mostly for debugging.

It doesn't have to be done as part of this PR, but I think it would be helpful if the README had instructions on how to use this command for debugging!

docs/crawl_data_parsing.md

harvester/cli.py

jonavellecuerdo · 2023-10-13T13:33:00Z

@ghukill Thank you so much for taking the time to carefully document important classes and methods! I would definitely need more time understand it on a much deeper level (which I aim to do via attempt to create a diagram for this application) and will probably have more feedback then. However, from a high-level perspective, I think the codebase is well-organized, documented, and it's apparent that a lot of care and thought has been put into it. :) I just have three clarifying questions above!

ghukill added 4 commits October 5, 2023 11:43

resolving conflicts and aligning new parsing functionality

80950d7

add test for URL content CLI command

60227e9

updates and clarifications to documentation

01fe769

ghukill requested review from ehanson8 and jonavellecuerdo October 5, 2023 16:26

ehanson8 reviewed Oct 6, 2023

View reviewed changes

ghukill added 2 commits October 6, 2023 15:41

ehanson8 reviewed Oct 10, 2023

View reviewed changes

variable renaming and sylistic requests

428c77c

ghukill requested a review from ehanson8 October 11, 2023 19:06

ehanson8 approved these changes Oct 11, 2023

View reviewed changes

jonavellecuerdo reviewed Oct 12, 2023

View reviewed changes

jonavellecuerdo reviewed Oct 13, 2023

View reviewed changes

docs/crawl_data_parsing.md Show resolved Hide resolved

jonavellecuerdo reviewed Oct 13, 2023

View reviewed changes

docs/crawl_data_parsing.md Show resolved Hide resolved

jonavellecuerdo reviewed Oct 13, 2023

View reviewed changes

harvester/cli.py Outdated Show resolved Hide resolved

jonavellecuerdo approved these changes Oct 13, 2023

View reviewed changes

align Makefile and CLI and remove CLI ctx if not used

6cad0bf

ghukill merged commit cf7365e into code-review-main Oct 13, 2023

Pr3 parser #12

Pr3 parser #12

Uh oh!

Conversation

ghukill commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Helpful background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

What are the relevant tickets?

Developer

Code Reviewer

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Oct 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonavellecuerdo Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Oct 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonavellecuerdo commented Oct 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Oct 5, 2023 •

edited

Loading

jonavellecuerdo Oct 12, 2023 •

edited

Loading