Skip to content

Conversation

vlofgren
Copy link
Contributor

@vlofgren vlofgren commented Jul 11, 2025

Continuing PR #201, this change set adds a new internal API for fetching DOM samples for a given domain, which gets called from the converter. The converter analyzes the DOM sample, and adds new document features, which are acted upon by the index to better penalize documents with advertisement.

--

The change also largely retires the quality property on documents, which was used to filter out documents with too much javascript. This is a very blunt instrument that threw out a lot of perfectly valid search results. Using DOM sampling to assess what sort of scripts is a better way of doing things.

Additionally, the converter is less strict on filtering out documents based on length. Instead, short documents are assigned a feature bit so that they can be ranked lower.

--

In order to easier adjust the request classification model, a GUI was built for viewing the sampled traffic. Since the data is interesting, this was made public and accessible under the site inspector tool.

image
  • Implement API + Client
  • Implement classifier
  • Integrate classifier data into converter
  • Retire old quality measure when DOM sample data is available
  • Integrate new features into index
  • Verify UI filters act properly
  • Build inspection UI
  • Tune search result quality in test environment
  • (fix misidentification of blogspot and wordpress sites as having ads)
  • Fix UI scaling for mobile
  • Verify light mode layout doesn't look like crap
  • Ensure system doesn't break for non-prod deploys that don't have the domsample api provided

vlofgren added 24 commits July 11, 2025 15:41
responseObserver.onError(...) should be passed Status.WHATEVER.foo().asRuntimeException() and not random throwables as was done before.
... but assign short documents a special flag and penalize them in index lookups
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter.

The change also adds a parser for DDG's tracker radar data.
@vlofgren vlofgren force-pushed the ads-fingerprinting branch from 7446ae7 to 5d88592 Compare July 18, 2025 15:54
@vlofgren vlofgren force-pushed the ads-fingerprinting branch 2 times, most recently from 1fa83d8 to 1ca8495 Compare July 19, 2025 12:24
@vlofgren vlofgren force-pushed the ads-fingerprinting branch from 0286e52 to ceaefa2 Compare July 19, 2025 16:38
@vlofgren vlofgren force-pushed the ads-fingerprinting branch from ceaefa2 to 90a1ff2 Compare July 19, 2025 16:41
@vlofgren vlofgren changed the title (WIP) Implement advertisement and popover identification based on DOM sample data Implement advertisement and popover identification based on DOM sample data Jul 21, 2025
@vlofgren vlofgren merged commit 3b2ac41 into master Jul 21, 2025
@vlofgren vlofgren deleted the ads-fingerprinting branch July 21, 2025 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant