Skip to content

Dense#170

Merged
m1rl0k merged 43 commits intotestfrom
dense
Jan 15, 2026
Merged

Dense#170
m1rl0k merged 43 commits intotestfrom
dense

Conversation

@m1rl0k
Copy link
Collaborator

@m1rl0k m1rl0k commented Jan 10, 2026

No description provided.

m1rl0k and others added 17 commits January 9, 2026 23:57
Introduces a new graph_edges module for pre-computed symbol relationship indexing and fast graph queries, with integration into the ingest pipeline for emitting and backfilling call/import edges. Adds an intent_classifier module for semantic query intent detection using embedding similarity. Updates hybrid_search to leverage graph-guided candidate injection and intent-based query routing. Minor updates to pipeline and ingest_code to support graph edge backfill and emission.
scripts/ingest/metadata.py: Derive _TS_CALL_LANGUAGES dynamically from config keys to prevent drift.
scripts/mcp_impl/context_search.py: Offload blocking
run_hybrid_search
 to thread to avoid stalling the asyncio event loop.
…hybrid_search, run_pure_dense_search, _retrieve_fn) in asyncio.to_thread across context_search.py, context_answer.py, and search.py to prevent event loop blocking.
Updated LICENSE file from MIT to Business Source License 1.1 (BUSL-1.1), reflecting new licensing terms and restrictions. Updated README and package.json to indicate BUSL-1.1 as the project license.
Added copyright and Business Source License 1.1 headers to all Python scripts in the repository. This ensures proper attribution and clarifies licensing terms for all contributors and users.
Added documentation for the 'symbol_graph' and 'info_request' context tools in CLAUDE.example.md, including usage tips and notes on hydrated results. Updated SKILL.md to mention hydrated source snippets for symbol_graph results, improving clarity on available context features.
Expanded the tool selection table in MCP skill documentation with additional question types and corresponding tools. Refined the description in the context-engine skill to clarify its capabilities and usage.
Introduces LEX_SPARSE_IDF and LEX_SPLADE_MODE configuration options for improved lexical search. The IDF modifier enables BM25-style term weighting for sparse vectors in Qdrant, enhancing precision for technical queries. Documentation and configuration files are updated, and Qdrant collection creation now supports the IDF modifier when enabled.
…me support

Improves CoIR benchmark scripts by adding robust language detection for tasks and entries, supporting multiple CodeSearchNet languages, and propagating task language hints throughout the indexing and retrieval pipeline. Adds resume functionality to the core indexer, allowing interrupted indexing to skip already-indexed documents, and optimizes memory usage by streaming embedding and upserts in mega-batches. Updates runner CLI with new options for search mode and LLM features, and improves graph edge retrieval by caching missing collections to avoid repeated errors. Minor refactorings and comments for clarity.
Introduces an optional 'repo' parameter to symbol graph-related functions, allowing queries to be filtered by repository name. Updates documentation and internal calls to support repository-specific searches, improving query precision for multi-repo environments.
Enhances pattern search query mode detection with confidence scoring, AST validation, and embedding-based NL similarity. Adds repo filtering to pattern search for scalable queries. Refactors code to use cached collection config and improves snippet extraction from indexed payloads. Includes a comprehensive test suite for query mode classification covering 100+ cases.
@m1rl0k m1rl0k requested a review from voarsh2 January 11, 2026 02:09
m1rl0k and others added 12 commits January 10, 2026 21:31
This update adds a cache step for Hugging Face embedding models and a pre-download step for the BAAI/bge-base-en-v1.5 model in the CI workflow. These changes improve CI efficiency by reducing redundant downloads and ensuring the model is available before running tests.
Replaces direct assignment to os.environ with monkeypatch.setenv for setting EMBEDDING_MODEL in integration and fallback tests. This improves test isolation and ensures environment variables are properly managed during test execution.
Added new exemplars and regex patterns to improve detection of queries about function/method invocation in intent_classifier.py. Updated tests to allow for ambiguous cases where code and description modes are both valid, reflecting more permissive detection logic.
- Update _infer_language docstring to match actual implementation
- Improve error handling in core_indexer.py: separate transient from critical errors
- Fix corpus fingerprint to use unfiltered docs (prevents incorrect hash on resume)
- Use actual collection size in hybrid_search.py router (not hardcoded 10000)
- Fix symbol_path logic: only extract edges for true symbol identifiers
- Fix race condition in is_graph_intent: return fallback_used flag
- Prevent partial _EXEMPLAR_EMBEDDINGS init on failure (use temp dict)
- Add defensive int parsing in symbol_graph.py line number extraction
- Remove duplicate collection config cache in search.py
- Improve delete_graph_edges_by_path: include callee_path, return actual count
- Add dedicated graph_queries counter to OptimizationStats
- Fix avg_ef_used computation (use dedicated counter, not total_queries)
- Use get_boolean_env utility for consistency in pseudo.py
- Add mode assertions to pattern detection tests
voarsh and others added 4 commits January 11, 2026 04:03
Introduces subgraph context injection in context_answer, enhancing retrieval results with 1-hop graph neighbors based on AST-extracted symbols. Adds a configurable GRAPH_CONNECTION_BOOST and mode-based routing for graph/dense/lexical weights in hybrid_search. Extends symbol_graph and _symbol_graph_impl to support multi-hop (depth) traversal, with API and result structure updates. These changes improve context relevance by leveraging code connectivity in search and answer generation.
Introduces a --skip-index flag to the CoIR benchmark runner and retriever to allow skipping corpus indexing and using existing collection data. Adds support for the TRUST_STORED_FINGERPRINT environment variable in core_indexer.py to trust stored fingerprints and skip recomputation, improving flexibility for repeated runs. Also updates hybrid config to include GRAPH_CONNECTION_BOOST in __all__.
Strips verbose path fields from citations in MCP output, keeping only essential fields for agents. Improves rel_path extraction logic in both Python and JS to prefer server-provided values and ensure repo-relative paths are accurate.
@m1rl0k
Copy link
Collaborator Author

m1rl0k commented Jan 11, 2026

auggie review

@augmentcode
Copy link

augmentcode bot commented Jan 11, 2026

🤖 Augment PR Summary

Summary: This PR expands Context-Engine’s retrieval stack with graph-aware navigation, improved hybrid query routing, and more robust benchmarking/indexing utilities.

Changes:

  • Adds pre-computed graph-edge collections in Qdrant (*_graph) and integrates them across ingestion, pruning, and watch indexing.
  • Enhances symbol_graph with graph-collection queries, result hydration (snippets + line ranges), repo filtering, and optional multi-hop traversal.
  • Upgrades hybrid search with a unified query optimizer, dense-only routing, graph-guided candidate injection, and a graph-connection scoring component.
  • Improves MCP tools: context_answer can inject subgraph neighbors; citations are slimmed to reduce payload size.
  • Strengthens pattern search by adding AST+embedding-based “code vs description” detection, repo filtering, and cached collection capability checks.
  • Updates CoIR/benchmark tooling (multi-language defaults, language inference, streaming embed+upsert, resume/skip indexing) and adds sparse IDF weighting support.
  • CI now caches and pre-downloads the embedding model; docs and the MCP bridge package version are updated.

Technical Notes: Introduces/threads multiple env knobs (e.g., LEX_SPARSE_IDF, HYBRID_GRAPH_CONNECTION_BOOST, CONTEXT_ANSWER_SUBGRAPH, TRUST_STORED_FINGERPRINT) to tune behavior without rework.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Update graph_backfill_tick to ensure old edges are deleted once per path and clarify that only caller_path is used for edge deletion. Adjust prune.py to remove callee_path from deletion filter, reflecting that graph edges only store caller_path, and update related comments for accuracy.
Introduces system-aware defaults and environment variables for memory management and batch sizing on unified memory systems (Apple Silicon). Adds GPU/CoreML acceleration support for embedding and reranking, with ONNX provider selection and a new --gpu flag in the runner. Suppresses noisy httpx logs and improves environment setup for reproducible benchmarking.
Improves AST analysis to track caller context for function/method calls across Python, JS, Go, Rust, Java, C/C++, and Ruby. Indexing pipeline now extracts and stores chunk- and symbol-level calls/imports, enabling precise symbol-to-symbol call graph edges. Updates chunking, graph edge extraction, and ingestion to leverage richer AST metadata, and adds MRR metric to COIR benchmarks. Documentation updated to clarify call graph schema and query semantics.
Revised the CoIR benchmark section to reflect updated evaluation metrics, including new NDCG@10 scores and query counts for Python, Go, and JavaScript. Clarified that results are for dense retrieval using Jina-Code embeddings.
Refines test_ingest_chunking.py to ignore additional keys in chunk comparison, improves test_intent_confidence.py to robustly locate specific log events, and updates test_symbol_graph_tool.py to check for MatchText instead of MatchValue for path prefix matching.
@m1rl0k
Copy link
Collaborator Author

m1rl0k commented Jan 12, 2026

augment review

@augmentcode
Copy link

augmentcode bot commented Jan 12, 2026

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

Lowers the max_neighbors parameter from 5 to 2 when calling the subgraph context injection. This may improve performance or relevance by limiting the number of neighbors considered.
Enhanced symbol hydration logic to support both path and symbol-based lookups, improving accuracy for callees without explicit paths. Updated graph edge queries to prefer caller_symbol over caller_path for more precise matching. Refactored symbol selection in chunking to handle both object and dict symbol representations. These changes improve robustness and correctness in symbol graph operations and hydration.
Updated the comment for the WATCH_USE_POLLING environment variable to indicate it should be set on Mac OS X. No functional changes were made.
Introduces support for asymmetric embedding via environment variable, enabling use of query_embed and passage_embed for models like Jina v3. Refactors doc_id extraction logic for consistency across retriever, hybrid search, and MCP search. Adds registration for Snowflake Arctic v2.0 model in embedder. Implements query variant fusion in dense search for improved recall.
@m1rl0k m1rl0k merged commit 01ef58d into test Jan 15, 2026
1 check passed
@m1rl0k m1rl0k deleted the dense branch January 23, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants