feat(engine): demote off-topic and wrong-language results in ranking#56
Merged
Conversation
Port the desktop relevance signal (shipped in SearchMob-Desktop 26.06.04) to Android. RRF alone trusts each engine's order, so with mostly single-engine results the fused scores are near-tied and an off-topic result one engine ranked highly slips into the top. This adds the missing query-match signal. `engine/aggregate/Relevance.kt` is a 1:1 port of `engines/relevance.py`: a stopword-filtered, lightly (ASCII-gated) stemmed lexical coverage score over title and snippet with a head-term penalty and phrase bonus, plus a script-relative language affinity. The aggregator folds them into its final ordering as a demotion-only blend (factor capped at 1.0), so weak or wrong-language matches sink toward a floor while strong matches keep full engine consensus weight and the existing deterministic tie-breakers are preserved. ResultSorter, Personalizer, and DomainRanker consume the order positionally, so they inherit the improvement unchanged. It is language-agnostic ahead of the localization pass: tokenization scans Unicode code points via Character.isLetterOrDigit rather than regex `\w` (which is ASCII-only in Java/Kotlin and would silently drop all non-Latin text), English stemming is gated to ASCII so non-Latin words are never corrupted, and the stopword list degrades harmlessly for other languages. Verified on the searchmob emulator: an album query returns only on-topic results, and a Cyrillic query keeps same-script results on top while demoting a stray Latin result. RelevanceTest mirrors the desktop suite (14 cases); AggregatorTest is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports the desktop relevance signal (shipped in SearchMob-Desktop 26.06.04) to Android, closing the parity gap. Releases as standalone Android GA 26.06.02.
Problem
Ranking was RRF (engine consensus) over de-duplicated results, then sort and the user's domain rules. Nothing asked "does this result actually match the query?", so with mostly single-engine results the fused scores are near-tied and off-topic or wrong-language results slip into the top (users reported results "very far from relevant" and "in different languages than the request").
Change
engine/aggregate/Relevance.kt— a 1:1 port of desktopengines/relevance.py:rrf * minOf(1.0, BASE + GAIN*lexical) * affinity(BASE=0.5, GAIN=1.0): a weak/wrong-language match sinks toward a floor; a strong match never outranks engine consensus, so keyword stuffing is not promoted.engine/aggregate/Aggregator.kt— folds the blend into the final ordering, keeping the existing deterministic tie-breakers.ResultSorter/Personalizer/DomainRankerconsume the order positionally, so they inherit the improvement unchanged; pin/raise/lower/block still win.Multilingual (ahead of the localization pass)
Tokenization scans Unicode code points via
Character.isLetterOrDigitrather than regex\w(which is ASCII-only in Java/Kotlin and would silently drop all non-Latin text). English stemming is gated to ASCII so non-Latin words are never corrupted; the stopword list degrades harmlessly for other languages.Verification
ktlintCheck,lintDebug,testDebugUnitTest,assembleDebugall green.RelevanceTest(14 cases, mirrors the desktop suite);AggregatorTestunchanged (7/7).searchmobemulator:the cure disintegration albumreturns only on-topic results; a Cyrillic query (новости москва) keeps same-script results on top and demotes a stray Latin result to position 11.Includes the
add-relevance-rankingOpenSpec change (validates--strict).🤖 Generated with Claude Code