feat: retrival detail add matched_count#422
Conversation
There was a problem hiding this comment.
Code Review
This pull request enhances MTEB search adapter traces by attaching relevant documents (qrels) to distinguish between mapped and relevant matches, updating the Semantic Scholar backend documentation, and adding corresponding unit tests. A critical review comment points out that the implementation of _attach_relevant_docs will raise AttributeError on standard MTEB tasks when attempting to iterate over task.dataset and call .get() on Hugging Face Dataset objects, suggesting to retrieve qrels directly from task.relevant_docs instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for hf_subset, splits in getattr(task, "dataset", {}).items(): | ||
| for hf_split, data_split in splits.items(): | ||
| relevant_docs = data_split.get("relevant_docs", {}) | ||
| model.set_relevant_docs( | ||
| task_name, | ||
| hf_split, | ||
| hf_subset, | ||
| relevant_docs, | ||
| ) |
There was a problem hiding this comment.
The current implementation of _attach_relevant_docs has two critical bugs that will cause it to crash on almost all MTEB retrieval tasks:
- AttributeError on standard tasks: For standard retrieval tasks (without subsets),
task.datasetis a Hugging FaceDatasetDict(mapping split names toDatasetobjects). Callinggetattr(task, "dataset", {}).items()will yield(hf_split, data_split). The inner loop then attempts to callsplits.items()(which isdata_split.items()). Sincedata_splitis a Hugging FaceDatasetobject, this will raiseAttributeError: 'Dataset' object has no attribute 'items'. - AttributeError on
Dataset.get: Even if the task has subsets andsplitsis aDatasetDict,data_splitwill be a Hugging FaceDatasetobject. Callingdata_split.get("relevant_docs", {})will raiseAttributeError: 'Dataset' object has no attribute 'get'becauseDatasetdoes not have a.get()method.
Instead, you should retrieve the loaded qrels directly from task.relevant_docs (which is a standard dictionary of split -> qrels populated after task.load_data()), and get the subset name from task.hf_subset (defaulting to 'default').
| for hf_subset, splits in getattr(task, "dataset", {}).items(): | |
| for hf_split, data_split in splits.items(): | |
| relevant_docs = data_split.get("relevant_docs", {}) | |
| model.set_relevant_docs( | |
| task_name, | |
| hf_split, | |
| hf_subset, | |
| relevant_docs, | |
| ) | |
| hf_subset = getattr(task, "hf_subset", "default") | |
| relevant_docs_dict = getattr(task, "relevant_docs", {}) | |
| if relevant_docs_dict: | |
| for hf_split, relevant_docs in relevant_docs_dict.items(): | |
| model.set_relevant_docs( | |
| task_name, | |
| hf_split, | |
| hf_subset, | |
| relevant_docs, | |
| ) |
No description provided.