DEV-1516: return all 50 sample values for per-column search hits#161
Conversation
When `search()` returns a column EntityHit, `render_column_text` now surfaces the full top-50 `Column.sampled_values` (was 20-truncated `Column.sampled` text), plus a `Distinct count: N` follow-up line when overflow is detected. `inspect_model`'s all-columns-at-once markdown table stays capped at 20 for readability. New shared helper `slayer/engine/profiling.py::ensure_column_sample_fresh` owns the cache-miss + persist pattern; both `inspect_model` (categorical loop) and `SearchService` (post-fusion column-hit hook) delegate to it. The search hook auto-refreshes stale legacy (pre-DEV-1480) and count_distinct-failed columns on the spot — grouped by `(data_source, model_name)` so per-model writes serialise while cross-model refreshes parallelise via `asyncio.gather`. `SearchService` gains an optional `engine` kwarg; MCP `create_mcp_server` and REST `create_app` wire it through. Renderer gates on `sampled_values is not None` so an authoritative `[]` skips the line (no fallback to stale `sampled`). The distinct-count follow-up fires only when `distinct_count > len(sampled_values)` to avoid duplicating the legacy `"... (N distinct)"` suffix in the `sampled` fallback path. Known limitation: ranking still uses corpus text captured at index build, so sample-value-token recall past position 20 stays bounded by indexed text until the next `slayer ingest` content-hash re-embeds the column. Only `EntityHit.text` is refreshed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds ensure_column_sample_fresh to refresh stale categorical column samples, prefers structured sampled_values in render output, and wires an optional SlayerQueryEngine into SearchService to refresh column EntityHit text post-RRF; integrates helper into inspect_model and adds tests and docs. ChangesCategorical Column Sample Refresh (DEV-1480/DEV-1516)
Sequence DiagramsequenceDiagram
participant API as REST/MCP API
participant Search as SearchService.search()
participant RRF as RRF Fusion
participant Refresh as _refresh_stale_column_hits
participant Group as Group by (data_source,model_name)
participant Gather as asyncio.gather
participant Ensure as ensure_column_sample_fresh
participant Render as render_column_text
participant Storage as StorageBackend
participant Engine as SlayerQueryEngine
API->>Search: search(query,...)
Search->>RRF: compute entity_hits
RRF-->>Search: fused hits
Search->>Refresh: if engine present
Refresh->>Group: bucket column hits (preserve indices)
Group-->>Refresh: grouped hits
Refresh->>Gather: start per-group tasks
par per-group
Gather->>Ensure: (model, column, engine, storage)
Ensure->>Engine: profile_column(...)
Engine-->>Ensure: sampled, sampled_values, distinct_count
Ensure->>Storage: update_column_sampled(...)
Storage-->>Ensure: persisted or logged failure
Ensure-->>Gather: refreshed column
end
Gather->>Render: render_column_text(refreshed_column)
Render-->>Gather: rendered text
Gather->>Refresh: update EntityHit at original index
Refresh-->>Search: updated entity_hits
Search-->>API: SearchResponse
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
slayer/search/service.py (1)
658-756: ⚡ Quick winExtract the grouping and per-model refresh steps out of
_refresh_stale_column_hits.This helper is already over the repo's cognitive-complexity limit, and it now mixes id parsing, grouping, model loading, refresh execution, and hit rewriting in one place. Pulling the grouping/parsing logic and the per-group worker into small helpers should get Sonar back under the threshold without changing behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@slayer/search/service.py` around lines 658 - 756, _refreshed_stale_column_hits is too complex; split the id-parsing/grouping and the per-model refresh loop into helpers to reduce cognitive complexity: extract the loop that builds groups into a new helper (e.g. _group_column_hits(entity_hits) -> Dict[Tuple[str,str], List[Tuple[int,EntityHit,str]]]) and move the inner async per-model worker into a separate coroutine (e.g. _refresh_group_worker(ds_name, model_name, members, engine, storage, refreshed_by_idx)); ensure the new worker contains the get_model call, per-column calls to ensure_column_sample_fresh and the render_column_text + refreshed_by_idx assignment, and keep the asyncio.gather invocation in _refresh_stale_column_hits to run all workers concurrently and then reconstruct the returned list using refreshed_by_idx exactly as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@slayer/engine/profiling.py`:
- Around line 716-739: The current ensure_column_sample_fresh flow
unconditionally returns column.model_copy with the new sample fields, clobbering
a previously richer column.sampled when the new sample lacks sampled_values or
distinct_count; change the return logic so that after
storage.update_column_sampled (and its except block) you only replace
sampled_values and distinct_count on the returned column when
sample.sampled_values and sample.distinct_count are present—otherwise keep the
existing column.sampled / column.sampled_values / column.distinct_count (i.e.,
return the original column or only update safe flags) so partial/overflow
refreshes do not overwrite richer cached data; refer to
ensure_column_sample_fresh, storage.update_column_sampled, and the return
column.model_copy(update={...}) when implementing this guard.
In `@slayer/search/render.py`:
- Around line 189-191: The current rendering joins column.sampled_values with
commas which makes values containing commas ambiguous; change the code that
builds the line (where column.sampled_values is checked and lines.append is
called) to serialize the list unambiguously (e.g., use
json.dumps(column.sampled_values)) so the EntityHit.text preserves exact values,
and add the required import for json at the top of slayer/search/render.py.
---
Nitpick comments:
In `@slayer/search/service.py`:
- Around line 658-756: _refreshed_stale_column_hits is too complex; split the
id-parsing/grouping and the per-model refresh loop into helpers to reduce
cognitive complexity: extract the loop that builds groups into a new helper
(e.g. _group_column_hits(entity_hits) -> Dict[Tuple[str,str],
List[Tuple[int,EntityHit,str]]]) and move the inner async per-model worker into
a separate coroutine (e.g. _refresh_group_worker(ds_name, model_name, members,
engine, storage, refreshed_by_idx)); ensure the new worker contains the
get_model call, per-column calls to ensure_column_sample_fresh and the
render_column_text + refreshed_by_idx assignment, and keep the asyncio.gather
invocation in _refresh_stale_column_hits to run all workers concurrently and
then reconstruct the returned list using refreshed_by_idx exactly as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 676419cb-afd5-45e7-bd6e-202b48a28574
📒 Files selected for processing (12)
CLAUDE.mddocs/concepts/search.mdslayer/api/server.pyslayer/engine/profiling.pyslayer/mcp/server.pyslayer/search/render.pyslayer/search/service.pytests/integration/test_mcp_inspect.pytests/test_engine_profiling.pytests/test_search_render.pytests/test_search_service.pytests/test_search_surfaces.py
- profiling.ensure_column_sample_fresh: skip persist + return input when overflow-retry returns the generic "> 50 distinct" marker AND the column already has a richer cached sampled text (CR thread 1). - render.render_column_text: JSON-encode sampled_values so values containing commas survive intact (CR thread 2). Updated render + inspect tests, CLAUDE.md + docs/concepts/search.md. - search.service: extract _group_column_hits + _refresh_group_worker helpers; _refresh_stale_column_hits drops below the cognitive- complexity threshold (Sonar S3776 + CR nitpick). - test_search_service: replace tautological assert svc is not None with assert svc._engine is engine — pins what the test name claims (Sonar S5727). - test_engine_profiling: NOSONAR(S7503) on async mock helpers that monkeypatch async production functions; the async keyword is required for the mock to be awaitable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all-50-sample-values-in-appropriate-contexts # Conflicts: # slayer/search/service.py
|



Summary
search()column EntityHits now surface the full top-50Column.sampled_values(was 20-truncated) plus aDistinct count: Nline on overflow;inspect_modelmarkdown stays at 20 for readability.ensure_column_sample_freshhelper inslayer/engine/profiling.pyowns the cache-miss + persist pattern — used by bothinspect_model's categorical loop andSearchService's new post-fusion column-hit hook (auto-refreshes pre-DEV-1480 legacy and count_distinct-failed columns on the spot).SearchServicegains an optionalenginekwarg; MCPcreate_mcp_serverand RESTcreate_appwire it. Refresh is grouped by(data_source, model_name)so per-model writes serialise; cross-model refreshes parallelise viaasyncio.gather.Renderer contract
sampled_values is not Noneis the gate. Authoritative[]skips the sample-values line (no fallback to stalesampled).Distinct count: Nfollow-up only whendistinct_count > len(sampled_values)— avoids duplicating the legacy"... (N distinct)"suffix on the fallback path.sampledtext.Known limitation
Ranking (BM25 / tantivy / embeddings) still uses corpus text built at index time. The post-fusion hook only refreshes the returned
EntityHit.text. Sample-value-token recall past position 20 stays bounded by indexed text until the nextslayer ingestcontent-hashes the column and re-embeds it. Documented in CLAUDE.md and docs/concepts/search.md.Test plan
poetry run pytest -m "not integration" -q— 3504 passed, 0 failedpoetry run pytest tests/integration/test_mcp_inspect.py tests/integration/test_integration.py -m integration— 134 passedpoetry run ruff check slayer/ tests/— cleanensure_column_sample_freshcache hit / miss / failure semantics + warning logging; search-side refresh including per-model serialization and cross-model concurrency; MCP + REST engine wiring; inspect_model markdown 20-cap regression; legacysampledtext survivesprofile_columnfailure.Linear: DEV-1516
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests