Chunk by m1rl0k · Pull Request #195 · Context-Engine-AI/Context-Engine

m1rl0k · 2026-01-24T14:08:37Z

No description provided.

Introduces elbow detection utilities using the Kneedle algorithm for adaptive threshold computation in hybrid search (elbow_detection.py). Adds a high-performance chunk deduplication module with exact and substring-based deduplication logic (chunk_deduplication.py), ported from ChunkHound to Context-Engine.

Introduces a new deduplication method using O(n log n) substring detection in both CASTPlusChunker and SearchOptimizedChunker, falling back to legacy methods if unavailable. Adds deduplicate_semantic_chunks to chunk_deduplication.py for more accurate and efficient deduplication of SemanticChunk objects.

Adds concept-based extraction to ast_analyzer using declarative tree-sitter queries for 32+ languages, supporting universal concepts (definition, block, comment, import, structure). Updates language mapping classes for correct language keys, improves Go and TypeScript import queries, and enhances Redis connection handling with pooling and retries. Introduces elbow detection filtering for hybrid search, adds multi-hop chunk similarity search in Qdrant, and improves upload bundle manifest validation. Includes comprehensive tests for language mappings and analyzer integration.

Added a postinstall script in package.json to ensure bin/ctxce.js is executable after installation. This helps prevent permission issues when running the start script.

m1rl0k · 2026-01-24T14:10:32Z

auggie review

Added internal and utility symbols to the __all__ list in scripts/hybrid/qdrant.py for more explicit exports. Updated scripts/hybrid_qdrant.py shim to import __all__ for improved backward compatibility.

augmentcode · 2026-01-24T14:15:55Z

🤖 Augment PR Summary

Summary: This PR expands the chunking + retrieval pipeline with richer language-aware extraction, smarter result filtering, and improved operational defaults.

Changes:

Integrates language-mappings (tree-sitter concept queries) into scripts/ast_analyzer.py and adds tests for mapping-backed analysis.
Adds an O(n log n) chunk deduplication module (exact + substring overlap) and wires it into CAST+ and search chunkers.
Introduces elbow-detection filtering for hybrid search (opt-in via HYBRID_ELBOW_FILTER) plus a standalone elbow detection module and tests.
Adds statistical termination utilities for iterative search (Welford + Page-Hinkley) with unit tests.
Makes Qdrant “flat graph” edge indexing effectively always-on; clarifies Neo4j precedence and updates config-drift tests/docs accordingly.
Improves CLI handling of TOON-formatted responses and makes ctx.py prefer results_json when present.
Improves Redis workspace-state robustness (pool sizing + simple retry) and documents new Redis pool env var.
Updates CI to publish JUnit XML and render a test summary.
Extends the VS Code uploader extension with Cursor MCP config support and updates webviews to use a bundled logo.

Technical Notes: Adds xxhash as a runtime dependency (now in pyproject.toml) and introduces new background-worker defaults for pseudo-tag backfill in common deployment configs.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

scripts/hybrid/qdrant.py

scripts/hybrid/elbow_detection.py

scripts/ingest/chunk_deduplication.py

Refactored score extraction in elbow_detection.py to handle missing keys more robustly. Updated CONCEPT_SPECIFICITY in chunk_deduplication.py to use lowercase keys for CAST+ concept types, ensuring consistency with get_chunk_specificity().

Introduce comprehensive test suites for CAST+ chunker, chunk deduplication core, elbow detection, and smart termination logic. These tests cover configuration defaults, algorithm correctness, edge cases, and expected behaviors for each module.

Added logic to decode TOON-formatted result strings in the search command. If decoding fails, an error message is shown with a hint to install the required package or disable TOON. This improves compatibility with different result formats returned by the server.

m1rl0k · 2026-01-24T14:52:02Z

augment review

Symbol graph edges (Qdrant flat graph) are now always enabled and no longer configurable via the INDEX_GRAPH_EDGES env var; related config, comments, and tests updated to reflect unconditional activation. Elbow detection utilities have been refactored to use curvature-based, changepoint, and Kneedle methods for adaptive thresholding, with improved statistical termination logic in iterative search. Specificity scoring for chunk deduplication now uses a weighted formula for more granular ranking.

augmentcode

Review completed. 4 suggestions posted.

Comment augment review to trigger a new review at any time.

scripts/hybrid_search.py

scripts/ingest/chunk_deduplication.py

ctx-mcp-bridge/package.json

Modified the Page-Hinkley test in termination.py to detect downward mean shifts, updated its threshold and logic, and clarified docstrings. Adjusted TerminationConfig defaults and improved test coverage in test_termination.py to reflect the new Page-Hinkley behavior. Also updated chunk specificity logic and tests in test_chunk_deduplication_core.py to return 0 for unknown types.

Updated _get_graph_backend to support both Neo4j and Qdrant backends through a unified interface, defaulting to Qdrant when Neo4j is not enabled. Simplified fallback logic in _symbol_graph_impl to handle empty results or backend failures, ensuring callees and callers queries use appropriate legacy array field lookups when necessary.

Included the xxhash library (version 3.0.0 or higher) in the main dependencies to support fast non-cryptographic hashing.

Adjusts the min_results parameter to be zero when the limit is zero, preventing unintended behavior when no results are requested.

Refactors hybrid_search.py to lazily import filter_by_elbow only when elbow filtering is enabled, preventing unnecessary numpy dependency when the feature is disabled. Also updates the postinstall script in package.json to use a Node.js-based chmod for better cross-platform compatibility.

m1rl0k · 2026-01-24T15:52:23Z

auggie review

Set PSEUDO_DEFER_TO_WORKER=1 by default in Kubernetes and Docker Compose to defer LLM-based pseudo-tag generation to a background worker, improving initial indexing speed. Updated documentation to explain the new default, how the deferred worker operates, and its benefits for production deployments.

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

scripts/hybrid/termination.py

augmentcode · 2026-01-24T15:57:10Z

scripts/ast_analyzer.py

+        # Use language mappings (32 languages, declarative queries)
+        if _LANGUAGE_MAPPINGS_AVAILABLE and self.use_tree_sitter:
+            result = self._analyze_with_mapping(content, file_path, language)
+            if result and (result.get("symbols") or result.get("imports")):


The mapping-based analysis is ignored unless it yields symbols or imports; for files/languages where only calls/concepts are present (and there’s no legacy analyzer), this can drop call-edge extraction and return an empty analysis. Consider treating any non-empty mapping output (e.g., calls) as “successful” here.

_{🤖 Was this useful? React with 👍 or 👎}

scripts/ingest/chunk_deduplication.py

Introduces support for writing MCP config for Cursor (mcp.json) in the VSCode extension. Adds new settings, commands, and UI elements for Cursor integration, updates the dashboard and settings webviews to use a logo image, and bumps package versions. Also changes a warning print to logger.debug in the Qdrant ingest script.

Enhances the TerminationChecker to handle None and non-numeric scores gracefully, ensuring robust sorting and filtering. Updates chunk deduplication to better handle structure and definition overlaps. Also updates ASTAnalyzer to consider 'calls' in result validation and adds documentation for Cursor integration in the VSCode extension README.

m1rl0k · 2026-01-24T16:31:29Z

augment review

Introduces the INDEX_WORKERS environment variable to docker-compose and reset.py for configurable parallel indexing. Optimizes index_repo in pipeline.py to skip unnecessary cache checks and deduplication when recreating a fresh collection, reducing redundant Qdrant calls and improving performance.

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-24T16:37:32Z

scripts/hybrid/termination.py

+            ranks[val].append(avg_rank)
+        i = j
+
+    r1 = sum(ranks[v][0] if len(ranks[v]) == 1 else ranks[v].pop(0) for v in x)


In mann_whitney_u, r1 = sum(ranks[v][0] if len(ranks[v]) == 1 else ranks[v].pop(0) ...) can reuse the same rank when x contains repeated values (the last occurrence doesn’t pop()), which will skew the U statistic/p-value for ties.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-24T16:37:32Z

scripts/ast_analyzer.py

+        symbol_ranges = [(s.start_line, s.end_line, s.path or s.name) for s in symbols]
+
+        def find_enclosing_symbol(line: int) -> str:
+            for start, end, path in symbol_ranges:


find_enclosing_symbol() returns the first symbol range that contains the call site; with nested symbols (e.g., class → method), this can mis-attribute calls to an outer symbol depending on symbols ordering.

_{🤖 Was this useful? React with 👍 or 👎}

augment review

Corrected indentation issues and removed extraneous whitespace in the TerminationChecker class. This improves code readability and ensures proper execution of the termination checks.

Refined the symbol lookup in ASTAnalyzer to select the most specific enclosing symbol by minimizing the span. Simplified the Mann-Whitney U implementation by replacing the ranks dictionary with a rank list, streamlining rank assignment and calculation.

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-01-24T16:48:06Z

scripts/ast_analyzer.py

+                    names = [n.strip().split(" as ")[0] for n in names_str.split(",")]
+                    return module, names, True
+            elif text.startswith("import "):
+                match = re.match(r"import\s+([\w.]+)", text)


_parse_import_text() only captures the first module for Python statements like import os, sys, so imports (and any downstream dependency/graph extraction) will silently miss additional modules on the same line. Consider handling comma-separated imports (and parenthesized multi-line forms) so the extracted import list is complete.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-24T16:48:06Z

scripts/hybrid/qdrant.py

+    must = []
+
+    if path_filter:
+        must.append(models.FieldCondition(


path_filter is described as a “path prefix”, but the filter uses metadata.path with MatchText, which may not behave like a prefix match (and may not work if metadata.path is indexed as a keyword). It might be safer to align this with the rest of the codebase’s path filtering (e.g., metadata.path_prefix / MatchValue) so the constraint is applied reliably.

_{🤖 Was this useful? React with 👍 or 👎}

augmentcode · 2026-01-24T16:48:06Z

scripts/ingest/chunk_deduplication.py

+        else:
+            best = max(
+                chunk_list,
+                key=lambda c: (


When resolving exact-content duplicates, the tie-breaker uses -(end_line - start_line), which prefers smaller spans; this seems to conflict with the module’s stated goal of favoring larger chunks for more context. Worth double-checking the intended behavior here, since it affects which duplicate instance is retained.

_{🤖 Was this useful? React with 👍 or 👎}

Refines Python import statement parsing in ASTAnalyzer to handle multiple modules and aliases. Updates Qdrant path filtering to use 'metadata.path_prefix' and strips trailing slashes for more accurate matching.

m1rl0k · 2026-01-24T17:06:08Z

augment review

augmentcode

Review completed. 4 suggestions posted.

Comment augment review to trigger a new review at any time.

scripts/ast_analyzer.py

scripts/ctx.py

scripts/ingest/pipeline.py

scripts/hybrid_search.py

Introduces new modules for embedding provider protocols, structured exception hierarchy, file discovery and syntax tree caching, type aliases, and domain models for code indexing. Also updates the base language mapping to support constant extraction and import resolution extension points.

Refactored context hit extraction in ctx.py to handle TOON-encoded strings and provide better error handling and debug output. Replaced a print statement with logger.info in pipeline.py for consistency with logging practices.

Corrected indentation issues in scripts/ctx.py to ensure proper function definitions and logic flow. Updated a debug message in scripts/hybrid_search.py to clarify the threshold method. In scripts/ast_analyzer.py, normalized the language before retrieving the mapping to prevent errors.

Update _extract_calls_from_tree to only attribute function calls to symbols of kind function, method, class, async_function, or module. This prevents assignments and constants from being incorrectly identified as callers.

m1rl0k added 5 commits January 24, 2026 07:55

Create termination.py

4da2177

Add postinstall script to set execute permission

636bd52

Added a postinstall script in package.json to ensure bin/ctxce.js is executable after installation. This helps prevent permission issues when running the start script.

m1rl0k closed this Jan 24, 2026

m1rl0k reopened this Jan 24, 2026

Expand __all__ exports in qdrant.py and update shim

4512de2

Added internal and utility symbols to the __all__ list in scripts/hybrid/qdrant.py for more explicit exports. Updated scripts/hybrid_qdrant.py shim to import __all__ for improved backward compatibility.