Skip to content

FTS5 search misses standalone erpimage function due to BM25 ranking bias #141

@neuromechanist

Description

@neuromechanist

Problem

When a user asks about the erpimage function, the EEGLAB assistant returns documentation for pop_erpimage and std_erpimage (GUI/STUDY wrappers) but misses the standalone erpimage() function, which is the core signal processing function with the actual parameters researchers need.

This was reported by @arnodelorme and @smakeig.

Root Cause

The search_eeglab_docstrings tool uses SQLite FTS5 with default BM25 ranking. When searching for "erpimage", the standalone erpimage function ranks 10th and falls outside the default limit=5:

Rank Symbol File BM25 Score
1 pop_erpimage functions/popfunc/pop_erpimage.m -7.54
2 std_erpimage functions/studyfunc/std_erpimage.m -7.12
3 std_readersp functions/studyfunc/std_readersp.m -6.77
4 processerpim functions/studyfunc/std_readdata.m -6.29
5 checkdataerpimage functions/studyfunc/std_readdata.m -6.26
... ... ... ...
10 erpimage functions/sigprocfunc/erpimage.m -4.69

The core erpimage function has a 10,000-character docstring, which dilutes the BM25 term frequency score for "erpimage" relative to shorter wrapper functions that mention it more densely.

Possible Solutions

Two complementary approaches (would appreciate input from @arnodelorme and @smakeig on which direction is preferred):

Option A: Boost exact symbol_name matches

Add a pre-filter or ranking boost so that when a query exactly matches a symbol_name, that result always appears first regardless of BM25 score. This is the simplest fix and handles the common case of users searching by function name.

Option B: Summarize large docstrings before indexing

The 10K-character docstring for erpimage includes every parameter, output variable, and example. This level of detail is valuable for the user but hurts search ranking. We could:

  • Store a summarized version (first ~2000 chars, covering the function signature and main parameters) in the FTS5 index
  • Keep the full docstring for display after retrieval
  • This would improve BM25 scores for functions with large help headers

@arnodelorme @smakeig -- do you think researchers typically need the full detailed parameter list in the response, or would a summary with a link to the full docs be sufficient? This would help decide between the approaches.

Additional Issue: Dev database not synced

The dev container (osa-dev) has a 0-byte eeglab.db, meaning the docstring search tool returns "Knowledge base not initialized" on dev. Prod has 929 docstrings indexed correctly. The dev database needs to be synced.

Files Involved

  • src/knowledge/search.py - search_docstrings() FTS5 query and ranking
  • src/knowledge/db.py - FTS5 table definitions and column weights
  • src/assistants/eeglab/tools.py - search_eeglab_docstrings tool (limit parameter)
  • src/knowledge/docstring_sync.py - sync script (for dev re-sync)

Reproduction

# On prod container:
docker exec osa python3 -c '
import sqlite3
conn = sqlite3.connect("/app/data/knowledge/eeglab.db")
rows = conn.execute("""
    SELECT d.symbol_name, d.file_path, rank
    FROM docstrings_fts f
    JOIN docstrings d ON f.rowid = d.id
    WHERE docstrings_fts MATCH "\"erpimage\""
    ORDER BY rank LIMIT 10
""").fetchall()
for r in rows:
    print(f"rank={r[2]:.4f} {r[0]} - {r[1]}")
'

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1: Critical, fix as soon as possiblebugSomething isn't workingchat-experience

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions