Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Oct 5, 2023

What does this PR do?

This PR continues to extend this application to parse metadata records from a completed crawl.

Helpful background context

As noted throughout the README, the motivating factor for wrapping the browsertrix-crawler in a container was so that we could perform actions after the crawl was to parse metadata records representing websites from that crawl.

As noted in documentation about metadata parsing, the creation of metadata records is an opinionated process. This harvester was built to support crawling of the library websites and producing records for TIMDEX.

While this CrawlParser class could be extended in a number of ways to generate meaningful metadata records about non-library websites, and technically it would work for non-library websites, the additional metadata it extracts from the actual HTML content is geared towards our Library WordPress websites.

How can a reviewer manually see the effects of these changes?

Update the docker image and run a test harvest via a Make command:

make dist-local
make test-harvest-local

Once the crawl is complete, a single XML should be available locally on the host machine (as the harvest was performed via a docker container) at output/crawls/collections/homepage/homepage.xml.

Reviewing this XML file will reveal what kind of data is parsed from the websites crawled.

Includes new or updated dependencies?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-247

Developer

  • All new ENV is documented in README (or there is none)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

The second part of a harvest is generating metadata records from a web crawl.  This adds functionality
to select an output metadata file that will trigger the harvest command to parse records from the
completed web crawl.

How this addresses that need:
* adds CrawlParser class responsible for parsing metadata records
* add standalone CLI command for parsing existing WACZ file

Side effects of this change:

None

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/TIMX-247
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions and some suggested changes

Why these changes are being introduced:

While our current very limited use cases only required the pages/extraPages.jsonl file, it is conceivable
that we might want to also use this application for crawls that do not only get URLs from sitemaps.

How this addresses that need:

This creates the canonical list of URLs from both the pages/pages.jsonl and pages/extraPages.jsonl files.

This commit also introduces some additional handling of missing files from WACZ files, and better error handling
if a crawl does not produce a WACZ file in the first place.

Side effects of this change:

A harvest will exit before attemping to parse metadata records if the crawl was unsuccessful and no
WACZ file exists.
Why these changes are being introduced:

Previously, much of this functionality was all grouped under a single CrawlParser class.  This was
mixing the task of generating metadata records from a completed crawl, and the somewhat complex
helpers and utilities for interacting with a WACZ file.

How this addresses that need:

This refactor breaks the functionality into two classes: CrawlMetadataParser and WACZClient.  This
helps isolate the functionality, allowing for easier code readability and testing.

Additionally, another pass at variable renaming and docstring improvement.
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A major improvement, much easier to understand! Some questions and suggestions

@ghukill ghukill requested a review from ehanson8 October 11, 2023 19:06
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done, great refactor!

with smart_open.open(self.wacz_filepath, "rb") as file_obj:
_wacz_archive = zipfile.ZipFile(io.BytesIO(file_obj.read()))
self._wacz_archive = _wacz_archive
return self._wacz_archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

# CLI commands
shell:
pipenv run harvest-dockerized shell
docker-shell:
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/when is this command used compared to test-harvest-local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Makefile command test-harvest-local actually kicks off a harvest, while this docker-shell command merely opens a bash shell in the a docker container of the application image. It's mostly for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be done as part of this PR, but I think it would be helpful if the README had instructions on how to use this command for debugging!

@jonavellecuerdo
Copy link
Contributor

@ghukill Thank you so much for taking the time to carefully document important classes and methods! I would definitely need more time understand it on a much deeper level (which I aim to do via attempt to create a diagram for this application) and will probably have more feedback then. However, from a high-level perspective, I think the codebase is well-organized, documented, and it's apparent that a lot of care and thought has been put into it. :) I just have three clarifying questions above!

@ghukill ghukill merged commit cf7365e into code-review-main Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants