Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use WARCs in the crawler #62

Merged
merged 25 commits into from Dec 16, 2023
Merged

Use WARCs in the crawler #62

merged 25 commits into from Dec 16, 2023

Conversation

vlofgren
Copy link
Contributor

@vlofgren vlofgren commented Dec 11, 2023

A major drawback with the current crawler design is that if the crawler terminates abruptly,
each ongoing crawl needs to be restarted from scratch. This is time consuming and annoying for the webmasters.

This pull request adds a flight recorder-style component that records each document that is fetched to a WARC file, which can be played back to recover much of the state of an aborted crawl. It does this using the jwarc library.

A caveat is that it's not possible to fully record every aspect of the crawl due to incompatibilities of design and operation between the crawler and the expectations by the designers of the warc format, but a record of crawling is constructed after the fact. It may be possible to reconcile the two in the future, but this is outside of the scope of this change.

Since the WARC format is significantly larger than the preceding ZStd comrpessed JSON format, it's not used for long term format. Instead the WARCs are converted to a dense parquet. This leaves us with three different formats, legacy, WARC and parquet; all currently in used and all storing the same type of information. In a future change, the legacy format will be scrapped but support needs to be retained until we've migrated off it.

vlofgren and others added 25 commits December 6, 2023 18:43
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
This functionality needs to be accessed by the WarcSideloader, which is in the converter.  The resultant microlibrary is tiny, but I think in this case it's justifiable.
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted.  This component is currently not hooked into anything.

The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.

The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
Partially hook in the WarcRecorder into the crawler process.  So far it's not read, but should record the crawled documents.

The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
…HttpFetcher with a virtual thread pool dispatcher instead of the default.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
…r process.

This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
This is not hooked into anything yet.  The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays.  This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
…ther a website uses cookies

This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.

Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.

The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
@vlofgren vlofgren marked this pull request as ready for review December 16, 2023 15:01
@vlofgren
Copy link
Contributor Author

vlofgren commented Dec 16, 2023

This needs more testing, but it's really difficult to thoroughly test this sort of change outside of production (or a near-production like environment), so it's merged down as-is for now with the understanding that the next crawl run may be a bit choppy and involve some bug fixing and performance optimization...

@vlofgren vlofgren merged commit 8bbb533 into master Dec 16, 2023
@vlofgren vlofgren changed the title (WIP) Use WARCs in the crawler Use WARCs in the crawler Dec 16, 2023
@vlofgren vlofgren deleted the warc branch March 21, 2024 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant