Skip to content

feat(engine): demote off-topic and wrong-language results in ranking#56

Merged
ErikChevalier merged 1 commit into
mainfrom
feat/relevance-ranking-android
Jun 3, 2026
Merged

feat(engine): demote off-topic and wrong-language results in ranking#56
ErikChevalier merged 1 commit into
mainfrom
feat/relevance-ranking-android

Conversation

@ErikChevalier
Copy link
Copy Markdown
Contributor

Ports the desktop relevance signal (shipped in SearchMob-Desktop 26.06.04) to Android, closing the parity gap. Releases as standalone Android GA 26.06.02.

Problem

Ranking was RRF (engine consensus) over de-duplicated results, then sort and the user's domain rules. Nothing asked "does this result actually match the query?", so with mostly single-engine results the fused scores are near-tied and off-topic or wrong-language results slip into the top (users reported results "very far from relevant" and "in different languages than the request").

Change

  • New engine/aggregate/Relevance.kt — a 1:1 port of desktop engines/relevance.py:
    • Lexical query-match: stopword-filtered, lightly (ASCII-gated) stemmed content-term coverage over title + snippet, title-weighted, with a head-term (subject) penalty and a small exact-phrase bonus.
    • Script-relative language affinity: demote a result whose dominant alphabet differs from the query's (works in any language; same-script never penalized).
    • Demotion-only blend rrf * minOf(1.0, BASE + GAIN*lexical) * affinity (BASE=0.5, GAIN=1.0): a weak/wrong-language match sinks toward a floor; a strong match never outranks engine consensus, so keyword stuffing is not promoted.
  • engine/aggregate/Aggregator.kt — folds the blend into the final ordering, keeping the existing deterministic tie-breakers. ResultSorter/Personalizer/DomainRanker consume the order positionally, so they inherit the improvement unchanged; pin/raise/lower/block still win.

Multilingual (ahead of the localization pass)

Tokenization scans Unicode code points via Character.isLetterOrDigit rather than regex \w (which is ASCII-only in Java/Kotlin and would silently drop all non-Latin text). English stemming is gated to ASCII so non-Latin words are never corrupted; the stopword list degrades harmlessly for other languages.

Verification

  • ktlintCheck, lintDebug, testDebugUnitTest, assembleDebug all green.
  • New RelevanceTest (14 cases, mirrors the desktop suite); AggregatorTest unchanged (7/7).
  • On the searchmob emulator: the cure disintegration album returns only on-topic results; a Cyrillic query (новости москва) keeps same-script results on top and demotes a stray Latin result to position 11.

Includes the add-relevance-ranking OpenSpec change (validates --strict).

🤖 Generated with Claude Code

Port the desktop relevance signal (shipped in SearchMob-Desktop 26.06.04) to
Android. RRF alone trusts each engine's order, so with mostly single-engine
results the fused scores are near-tied and an off-topic result one engine ranked
highly slips into the top. This adds the missing query-match signal.

`engine/aggregate/Relevance.kt` is a 1:1 port of `engines/relevance.py`: a
stopword-filtered, lightly (ASCII-gated) stemmed lexical coverage score over
title and snippet with a head-term penalty and phrase bonus, plus a
script-relative language affinity. The aggregator folds them into its final
ordering as a demotion-only blend (factor capped at 1.0), so weak or
wrong-language matches sink toward a floor while strong matches keep full engine
consensus weight and the existing deterministic tie-breakers are preserved.
ResultSorter, Personalizer, and DomainRanker consume the order positionally, so
they inherit the improvement unchanged.

It is language-agnostic ahead of the localization pass: tokenization scans
Unicode code points via Character.isLetterOrDigit rather than regex `\w` (which
is ASCII-only in Java/Kotlin and would silently drop all non-Latin text),
English stemming is gated to ASCII so non-Latin words are never corrupted, and
the stopword list degrades harmlessly for other languages.

Verified on the searchmob emulator: an album query returns only on-topic
results, and a Cyrillic query keeps same-script results on top while demoting a
stray Latin result. RelevanceTest mirrors the desktop suite (14 cases);
AggregatorTest is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ErikChevalier ErikChevalier merged commit b50f047 into main Jun 3, 2026
2 checks passed
@ErikChevalier ErikChevalier deleted the feat/relevance-ranking-android branch June 3, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant