Refactor the DomainProcessor to take advantage of the new crawl data format #69

vlofgren · 2023-12-27T13:00:10Z

With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.

The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there is a small number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment that skips the topological analysis. This is similar to what is done when sideloading Wikipedia and StackExchange. This processing is executed in the converter batch writer thread, which is slower, but the documents will not be persisted in memory, thus lowering the overall memory requirement for the converter process.

The change also moves the link database out of MariaDB and into in-memory storage in the executor services. This is done because writing and reading from the link database is very slow, leading to long locks and timeouts. It's was one of the few remaining points that couldn't be rolled back and restored from backup. This is doable now. In doing this, the old "linkdb" is renamed "documents db", to avoid confusion in naming. An automatic migration is added to rename the associated files. This breaks old backups.

Other changes were committed to this branch while it worked as a temporary main branch, but should also have been cherry picked to master.

With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.

The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.

This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...

Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.

Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.

…e single thread instead of using the pool

Showing a total of 200 connected domains is not very informative.

The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.

Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.

This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.

Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.

A necessary step was accidentally deleted when cleaning up these tests previously.

This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.

There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.

This is a test to evaluate how this impacts load times.

This dependency causes the executor service docker image to change when the index service docker image changes.

Settings for enabling reproducible builds for all subprojects were added to improve build consistency. This includes preserving file timestamps and ordering files reproducibly. This is primarily of help for docker, since it uses hashes to determine if a file or image layer has changed.

The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.

The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.

This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.

…ew...

Rendering is very slow. Let's see if this has a measurable effect on latency.

Will by default show results from the last 2 years. May need to tune this later.

Actually persist the value of the toggle between searches too...

Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.

…plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.

vlofgren changed the title ~~(WIP) Refactor the DomainProcessor to take advantage of the new crawl data format~~ Refactor the DomainProcessor to take advantage of the new crawl data format (WIP) Dec 27, 2023

vlofgren added 28 commits December 27, 2023 18:20

(converter) Basic test coverage for sideloading-style processing

b37223c

(converter) Basic test coverage for sideloading-style processing

7428ba2

Merge branch 'master' into converter-optimizations

5ce46a6

(converter) Add size hint to converter sideload processing

c847d83

Merge branch 'master' into converter-optimizations

ff7d1a2

(converter) Swallow errors in parquet data stream

bcecc93

(converter) Fix NPE in converter

c488599

(converter) Fix NPEs in converter due to the new data format

407915a

Merge branch 'master' into converter-optimizations

b5fc967

(converter) Optimize sideload-loading

e7dd28b

Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.

Reduce queue polling time in ProcessingIterator

647d380

Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.

Fix bug in ProcessingIterator where it would run the tasks in only on…

a1f3ccd

…e single thread instead of using the pool

Fix bug in ProcessingIterator where it would run the tasks in only on…

ba8a75c

…e single thread instead of using the pool

Merge branch 'master' into converter-optimizations

4015680

(search) Fetch fewer linking and similar domains.

68ac8d3

Showing a total of 200 connected domains is not very informative.

(warc) Update URL encoding in WarcProtocolReconstructor

0b112cb

The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.

(converter) Clean up fullProcessing()

70c83b6

This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.

(converter) Better use of ProcessingIterator

7a1d20e

Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.

(crawler) Fix broken test

0fe44c9

A necessary step was accidentally deleted when cleaning up these tests previously.

(crawler) Disable Java's infinite DNS caching

75d87c7

(backup) Add task heartbeats to the backup service

7f3f3f5

(keyword-extractor) Add another test for Name-extractor

e46e174

(index) Further ranking adjustments

5077104

vlofgren added 9 commits January 3, 2024 16:02

(valuation) Tweaking penalties a bit

8704851

(search) Add query strategy requiring link

7bbaede

(index) Adjust BM25 parameters

1e06aee

(converter) Penalize chatgpt content farm spam

f599944

(converter) Penalize chatgpt content farm spam

41a540a

(feature) Add another doubleclick variant to the adtech trackers

7af07ce

(feature) More trackers

1f66568

(feature) More trackers

f7560cb

(converter) Add upper 128KB limit to how much HTML we'll parse

60361f8

vlofgren force-pushed the converter-optimizations branch from 8764a77 to 60361f8 Compare January 4, 2024 12:14

vlofgren added 15 commits January 4, 2024 13:18

(search) Fetch fewer results per page

343ea9c

This is a test to evaluate how this impacts load times.

(qs) Better metrics for QS

4078708

(build) Remove false depdencency between icp and index-service

6d2e14a

This dependency causes the executor service docker image to change when the index service docker image changes.

Merge branch 'master' into converter-optimizations

d304c10

(linkdb) Add delegating implementation of DomainLinkDb

fbad625

This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.

(search) When clicking asn-links, show results from the unfiltered vi…

cb55273

…ew...

(search) Clean up search results template

d4b0539

Rendering is very slow. Let's see if this has a measurable effect on latency.

(search) Toggle for showing recent results

aff690f

Will by default show results from the last 2 years. May need to tune this later.

(search) Toggle for showing recent results

41cccfd

Actually persist the value of the toggle between searches too...

(search) Mobile UX improvements.

c47730f

Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.

(search) Swap swipe direction for more consistent experience

bd7970f

(search) Fix acknowledgement page for domain complaints rendering as …

f592c9f

…plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.

vlofgren changed the title ~~Refactor the DomainProcessor to take advantage of the new crawl data format (WIP)~~ Refactor the DomainProcessor to take advantage of the new crawl data format Jan 10, 2024

vlofgren marked this pull request as ready for review January 10, 2024 08:46

vlofgren merged commit fad9575 into master Jan 10, 2024

vlofgren mentioned this pull request Jan 10, 2024

Move domain links outside of the MariaDB database #66

Closed

vlofgren deleted the converter-optimizations branch March 21, 2024 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the DomainProcessor to take advantage of the new crawl data format #69

Refactor the DomainProcessor to take advantage of the new crawl data format #69

vlofgren commented Dec 27, 2023 •

edited

Refactor the DomainProcessor to take advantage of the new crawl data format #69

Refactor the DomainProcessor to take advantage of the new crawl data format #69

Conversation

vlofgren commented Dec 27, 2023 • edited

vlofgren commented Dec 27, 2023 •

edited