-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Problem
When a user asks about the erpimage function, the EEGLAB assistant returns documentation for pop_erpimage and std_erpimage (GUI/STUDY wrappers) but misses the standalone erpimage() function, which is the core signal processing function with the actual parameters researchers need.
This was reported by @arnodelorme and @smakeig.
Root Cause
The search_eeglab_docstrings tool uses SQLite FTS5 with default BM25 ranking. When searching for "erpimage", the standalone erpimage function ranks 10th and falls outside the default limit=5:
| Rank | Symbol | File | BM25 Score |
|---|---|---|---|
| 1 | pop_erpimage | functions/popfunc/pop_erpimage.m | -7.54 |
| 2 | std_erpimage | functions/studyfunc/std_erpimage.m | -7.12 |
| 3 | std_readersp | functions/studyfunc/std_readersp.m | -6.77 |
| 4 | processerpim | functions/studyfunc/std_readdata.m | -6.29 |
| 5 | checkdataerpimage | functions/studyfunc/std_readdata.m | -6.26 |
| ... | ... | ... | ... |
| 10 | erpimage | functions/sigprocfunc/erpimage.m | -4.69 |
The core erpimage function has a 10,000-character docstring, which dilutes the BM25 term frequency score for "erpimage" relative to shorter wrapper functions that mention it more densely.
Possible Solutions
Two complementary approaches (would appreciate input from @arnodelorme and @smakeig on which direction is preferred):
Option A: Boost exact symbol_name matches
Add a pre-filter or ranking boost so that when a query exactly matches a symbol_name, that result always appears first regardless of BM25 score. This is the simplest fix and handles the common case of users searching by function name.
Option B: Summarize large docstrings before indexing
The 10K-character docstring for erpimage includes every parameter, output variable, and example. This level of detail is valuable for the user but hurts search ranking. We could:
- Store a summarized version (first ~2000 chars, covering the function signature and main parameters) in the FTS5 index
- Keep the full docstring for display after retrieval
- This would improve BM25 scores for functions with large help headers
@arnodelorme @smakeig -- do you think researchers typically need the full detailed parameter list in the response, or would a summary with a link to the full docs be sufficient? This would help decide between the approaches.
Additional Issue: Dev database not synced
The dev container (osa-dev) has a 0-byte eeglab.db, meaning the docstring search tool returns "Knowledge base not initialized" on dev. Prod has 929 docstrings indexed correctly. The dev database needs to be synced.
Files Involved
src/knowledge/search.py-search_docstrings()FTS5 query and rankingsrc/knowledge/db.py- FTS5 table definitions and column weightssrc/assistants/eeglab/tools.py-search_eeglab_docstringstool (limit parameter)src/knowledge/docstring_sync.py- sync script (for dev re-sync)
Reproduction
# On prod container:
docker exec osa python3 -c '
import sqlite3
conn = sqlite3.connect("/app/data/knowledge/eeglab.db")
rows = conn.execute("""
SELECT d.symbol_name, d.file_path, rank
FROM docstrings_fts f
JOIN docstrings d ON f.rowid = d.id
WHERE docstrings_fts MATCH "\"erpimage\""
ORDER BY rank LIMIT 10
""").fetchall()
for r in rows:
print(f"rank={r[2]:.4f} {r[0]} - {r[1]}")
'