-
-
Notifications
You must be signed in to change notification settings - Fork 40
Implement advertisement and popover identification based on DOM sample data #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… the converter workflow
…ng detection code
responseObserver.onError(...) should be passed Status.WHATEVER.foo().asRuntimeException() and not random throwables as was done before.
... but assign short documents a special flag and penalize them in index lookups
The change also moves the dom classifier to a separate package so that it can be accessed from both the search service and converter. The change also adds a parser for DDG's tracker radar data.
7446ae7
to
5d88592
Compare
1fa83d8
to
1ca8495
Compare
0286e52
to
ceaefa2
Compare
ceaefa2
to
90a1ff2
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Continuing PR #201, this change set adds a new internal API for fetching DOM samples for a given domain, which gets called from the converter. The converter analyzes the DOM sample, and adds new document features, which are acted upon by the index to better penalize documents with advertisement.
--
The change also largely retires the quality property on documents, which was used to filter out documents with too much javascript. This is a very blunt instrument that threw out a lot of perfectly valid search results. Using DOM sampling to assess what sort of scripts is a better way of doing things.
Additionally, the converter is less strict on filtering out documents based on length. Instead, short documents are assigned a feature bit so that they can be ranked lower.
--
In order to easier adjust the request classification model, a GUI was built for viewing the sampled traffic. Since the data is interesting, this was made public and accessible under the site inspector tool.