Dense by m1rl0k · Pull Request #170 · Context-Engine-AI/Context-Engine

m1rl0k · 2026-01-10T05:03:46Z

No description provided.

Introduces a new graph_edges module for pre-computed symbol relationship indexing and fast graph queries, with integration into the ingest pipeline for emitting and backfilling call/import edges. Adds an intent_classifier module for semantic query intent detection using embedding similarity. Updates hybrid_search to leverage graph-guided candidate injection and intent-based query routing. Minor updates to pipeline and ingest_code to support graph edge backfill and emission.

… dense

scripts/ingest/metadata.py: Derive _TS_CALL_LANGUAGES dynamically from config keys to prevent drift. scripts/mcp_impl/context_search.py: Offload blocking run_hybrid_search to thread to avoid stalling the asyncio event loop.

… dense

…hybrid_search, run_pure_dense_search, _retrieve_fn) in asyncio.to_thread across context_search.py, context_answer.py, and search.py to prevent event loop blocking.

… dense

Updated LICENSE file from MIT to Business Source License 1.1 (BUSL-1.1), reflecting new licensing terms and restrictions. Updated README and package.json to indicate BUSL-1.1 as the project license.

Added copyright and Business Source License 1.1 headers to all Python scripts in the repository. This ensures proper attribution and clarifies licensing terms for all contributors and users.

Added documentation for the 'symbol_graph' and 'info_request' context tools in CLAUDE.example.md, including usage tips and notes on hydrated results. Updated SKILL.md to mention hydrated source snippets for symbol_graph results, improving clarity on available context features.

Expanded the tool selection table in MCP skill documentation with additional question types and corresponding tools. Refined the description in the context-engine skill to clarify its capabilities and usage.

Introduces LEX_SPARSE_IDF and LEX_SPLADE_MODE configuration options for improved lexical search. The IDF modifier enables BM25-style term weighting for sparse vectors in Qdrant, enhancing precision for technical queries. Documentation and configuration files are updated, and Qdrant collection creation now supports the IDF modifier when enabled.

…me support Improves CoIR benchmark scripts by adding robust language detection for tasks and entries, supporting multiple CodeSearchNet languages, and propagating task language hints throughout the indexing and retrieval pipeline. Adds resume functionality to the core indexer, allowing interrupted indexing to skip already-indexed documents, and optimizes memory usage by streaming embedding and upserts in mega-batches. Updates runner CLI with new options for search mode and LLM features, and improves graph edge retrieval by caching missing collections to avoid repeated errors. Minor refactorings and comments for clarity.

Introduces an optional 'repo' parameter to symbol graph-related functions, allowing queries to be filtered by repository name. Updates documentation and internal calls to support repository-specific searches, improving query precision for multi-repo environments.

… dense

Enhances pattern search query mode detection with confidence scoring, AST validation, and embedding-based NL similarity. Adds repo filtering to pattern search for scalable queries. Refactors code to use cached collection config and improves snippet extraction from indexed payloads. Includes a comprehensive test suite for query mode classification covering 100+ cases.

This update adds a cache step for Hugging Face embedding models and a pre-download step for the BAAI/bge-base-en-v1.5 model in the CI workflow. These changes improve CI efficiency by reducing redundant downloads and ensuring the model is available before running tests.

Replaces direct assignment to os.environ with monkeypatch.setenv for setting EMBEDDING_MODEL in integration and fallback tests. This improves test isolation and ensures environment variables are properly managed during test execution.

Added new exemplars and regex patterns to improve detection of queries about function/method invocation in intent_classifier.py. Updated tests to allow for ambiguous cases where code and description modes are both valid, reflecting more permissive detection logic.

- Update _infer_language docstring to match actual implementation - Improve error handling in core_indexer.py: separate transient from critical errors - Fix corpus fingerprint to use unfiltered docs (prevents incorrect hash on resume) - Use actual collection size in hybrid_search.py router (not hardcoded 10000) - Fix symbol_path logic: only extract edges for true symbol identifiers - Fix race condition in is_graph_intent: return fallback_used flag - Prevent partial _EXEMPLAR_EMBEDDINGS init on failure (use temp dict) - Add defensive int parsing in symbol_graph.py line number extraction - Remove duplicate collection config cache in search.py - Improve delete_graph_edges_by_path: include callee_path, return actual count - Add dedicated graph_queries counter to OptimizationStats - Fix avg_ef_used computation (use dedicated counter, not total_queries) - Use get_boolean_env utility for consistency in pseudo.py - Add mode assertions to pattern detection tests

… dense

…behavior

Introduces subgraph context injection in context_answer, enhancing retrieval results with 1-hop graph neighbors based on AST-extracted symbols. Adds a configurable GRAPH_CONNECTION_BOOST and mode-based routing for graph/dense/lexical weights in hybrid_search. Extends symbol_graph and _symbol_graph_impl to support multi-hop (depth) traversal, with API and result structure updates. These changes improve context relevance by leveraging code connectivity in search and answer generation.

Introduces a --skip-index flag to the CoIR benchmark runner and retriever to allow skipping corpus indexing and using existing collection data. Adds support for the TRUST_STORED_FINGERPRINT environment variable in core_indexer.py to trust stored fingerprints and skip recomputation, improving flexibility for repeated runs. Also updates hybrid config to include GRAPH_CONNECTION_BOOST in __all__.

Strips verbose path fields from citations in MCP output, keeping only essential fields for agents. Improves rel_path extraction logic in both Python and JS to prefer server-provided values and ensure repo-relative paths are accurate.

m1rl0k · 2026-01-11T17:21:17Z

auggie review

augmentcode · 2026-01-11T17:25:37Z

🤖 Augment PR Summary

Summary: This PR expands Context-Engine’s retrieval stack with graph-aware navigation, improved hybrid query routing, and more robust benchmarking/indexing utilities.

Changes:

Adds pre-computed graph-edge collections in Qdrant (*_graph) and integrates them across ingestion, pruning, and watch indexing.
Enhances symbol_graph with graph-collection queries, result hydration (snippets + line ranges), repo filtering, and optional multi-hop traversal.
Upgrades hybrid search with a unified query optimizer, dense-only routing, graph-guided candidate injection, and a graph-connection scoring component.
Improves MCP tools: context_answer can inject subgraph neighbors; citations are slimmed to reduce payload size.
Strengthens pattern search by adding AST+embedding-based “code vs description” detection, repo filtering, and cached collection capability checks.
Updates CoIR/benchmark tooling (multi-language defaults, language inference, streaming embed+upsert, resume/skip indexing) and adds sparse IDF weighting support.
CI now caches and pre-downloads the embedding model; docs and the MCP bridge package version are updated.

Technical Notes: Introduces/threads multiple env knobs (e.g., LEX_SPARSE_IDF, HYBRID_GRAPH_CONNECTION_BOOST, CONTEXT_ANSWER_SUBGRAPH, TRUST_STORED_FINGERPRINT) to tune behavior without rework.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

scripts/prune.py

scripts/ingest/pipeline.py

Update graph_backfill_tick to ensure old edges are deleted once per path and clarify that only caller_path is used for edge deletion. Adjust prune.py to remove callee_path from deletion filter, reflecting that graph edges only store caller_path, and update related comments for accuracy.

Introduces system-aware defaults and environment variables for memory management and batch sizing on unified memory systems (Apple Silicon). Adds GPU/CoreML acceleration support for embedding and reranking, with ONNX provider selection and a new --gpu flag in the runner. Suppresses noisy httpx logs and improves environment setup for reproducible benchmarking.

Improves AST analysis to track caller context for function/method calls across Python, JS, Go, Rust, Java, C/C++, and Ruby. Indexing pipeline now extracts and stores chunk- and symbol-level calls/imports, enabling precise symbol-to-symbol call graph edges. Updates chunking, graph edge extraction, and ingestion to leverage richer AST metadata, and adds MRR metric to COIR benchmarks. Documentation updated to clarify call graph schema and query semantics.

Revised the CoIR benchmark section to reflect updated evaluation metrics, including new NDCG@10 scores and query counts for Python, Go, and JavaScript. Clarified that results are for dense retrieval using Jina-Code embeddings.

Refines test_ingest_chunking.py to ignore additional keys in chunk comparison, improves test_intent_confidence.py to robustly locate specific log events, and updates test_symbol_graph_tool.py to check for MatchText instead of MatchValue for path prefix matching.

m1rl0k · 2026-01-12T16:09:22Z

augment review

augmentcode · 2026-01-12T16:09:32Z

This pull request is too large for Augment to review. The PR exceeds the maximum size limit of 100000 tokens (approximately 400000 characters) for automated code review. Please consider breaking this PR into smaller, more focused changes.

Lowers the max_neighbors parameter from 5 to 2 when calling the subgraph context injection. This may improve performance or relevance by limiting the number of neighbors considered.

Enhanced symbol hydration logic to support both path and symbol-based lookups, improving accuracy for callees without explicit paths. Updated graph edge queries to prefer caller_symbol over caller_path for more precise matching. Refactored symbol selection in chunking to handle both object and dict symbol representations. These changes improve robustness and correctness in symbol graph operations and hydration.

Updated the comment for the WATCH_USE_POLLING environment variable to indicate it should be set on Mac OS X. No functional changes were made.

Introduces support for asymmetric embedding via environment variable, enabling use of query_embed and passage_embed for models like Jina v3. Refactors doc_id extraction logic for consistency across retriever, hybrid search, and MCP search. Adds registration for Snowflake Arctic v2.0 model in embedder. Implements query variant fusion in dense search for improved recall.

m1rl0k and others added 17 commits January 9, 2026 23:57

Merge branch 'dense' of https://github.com/m1rl0k/Context-Engine into…

0292dd8

… dense

dynamic TS languages and async safety

23c9fdf

scripts/ingest/metadata.py: Derive _TS_CALL_LANGUAGES dynamically from config keys to prevent drift. scripts/mcp_impl/context_search.py: Offload blocking run_hybrid_search to thread to avoid stalling the asyncio event loop.

Merge branch 'dense' of https://github.com/m1rl0k/Context-Engine into…

4a03940

… dense

scripts/mcp_impl: Systematically wrap blocking Qdrant I/O calls (run_…

a633253

…hybrid_search, run_pure_dense_search, _retrieve_fn) in asyncio.to_thread across context_search.py, context_answer.py, and search.py to prevent event loop blocking.

Merge branch 'dense' of https://github.com/m1rl0k/Context-Engine into…

ec7af58

… dense

Change license to Business Source License 1.1

4ba4338

Updated LICENSE file from MIT to Business Source License 1.1 (BUSL-1.1), reflecting new licensing terms and restrictions. Updated README and package.json to indicate BUSL-1.1 as the project license.

Add copyright and license headers to scripts

ec797c3

Added copyright and Business Source License 1.1 headers to all Python scripts in the repository. This ensures proper attribution and clarifies licensing terms for all contributors and users.

Update tool selection and context-engine skill docs

fe4b33e

Expanded the tool selection table in MCP skill documentation with additional question types and corresponding tools. Refined the description in the context-engine skill to clarify its capabilities and usage.

Merge branch 'test' into dense

ad9e8ee

Merge branch 'test' into dense

f9c0822

Merge branch 'dense' of https://github.com/m1rl0k/Context-Engine into…

0c622e8

… dense

m1rl0k requested a review from voarsh2 January 11, 2026 02:09

m1rl0k and others added 12 commits January 10, 2026 21:31

Update pattern_search.py

68cbc24

Use monkeypatch.setenv for EMBEDDING_MODEL in tests

2cf68b5

Replaces direct assignment to os.environ with monkeypatch.setenv for setting EMBEDDING_MODEL in integration and fallback tests. This improves test isolation and ensures environment variables are properly managed during test execution.

Update pattern_search.py

57964d0

Update conftest.py

1467d7c

Update pattern_search.py

624ae39

Merge branch 'dense' of https://github.com/m1rl0k/Context-Engine into…

389d1e8

… dense

test: add mode assertions to pattern detection tests

85c9d10

test: fix assertions for changed classify_intent API and actual mode …

ba98239

…behavior

fix: restore dense-mode default and tighten pattern tests

ecbe0d6

voarsh and others added 4 commits January 11, 2026 04:03

fix: prune graph edges correctly and harden pattern detection test

25c3ab0

Slim citation output and improve rel_path handling

935bfde

Strips verbose path fields from citations in MCP output, keeping only essential fields for agents. Improves rel_path extraction logic in both Python and JS to prefer server-provided values and ensure repo-relative paths are accurate.

m1rl0k assigned m1rl0k and voarsh2 Jan 11, 2026

augmentcode bot reviewed Jan 11, 2026

View reviewed changes

scripts/prune.py Outdated Show resolved Hide resolved

scripts/ingest/pipeline.py Show resolved Hide resolved

m1rl0k added 5 commits January 11, 2026 15:19

Update CoIR benchmark results in README

e889824

Revised the CoIR benchmark section to reflect updated evaluation metrics, including new NDCG@10 scores and query counts for Python, Go, and JavaScript. Clarified that results are for dense retrieval using Jina-Code embeddings.

m1rl0k added 5 commits January 12, 2026 11:19

Update docker-compose.yml

0538aef

Reduce max_neighbors from 5 to 2 in context answer

10968d8

Lowers the max_neighbors parameter from 5 to 2 when calling the subgraph context injection. This may improve performance or relevance by limiting the number of neighbors considered.

Clarify comment for WATCH_USE_POLLING in compose file

83f751b

Updated the comment for the WATCH_USE_POLLING environment variable to indicate it should be set on Mac OS X. No functional changes were made.

m1rl0k merged commit 01ef58d into test Jan 15, 2026
1 check passed

m1rl0k deleted the dense branch January 23, 2026 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense#170

Dense#170
m1rl0k merged 43 commits intotestfrom
dense

m1rl0k commented Jan 10, 2026

Uh oh!

m1rl0k commented Jan 11, 2026

Uh oh!

augmentcode bot commented Jan 11, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

Uh oh!

Uh oh!

m1rl0k commented Jan 12, 2026

Uh oh!

augmentcode bot commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

m1rl0k commented Jan 10, 2026

Uh oh!

m1rl0k commented Jan 11, 2026

Uh oh!

augmentcode bot commented Jan 11, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

m1rl0k commented Jan 12, 2026

Uh oh!

augmentcode bot commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants