Conversation
Introduces elbow detection utilities using the Kneedle algorithm for adaptive threshold computation in hybrid search (elbow_detection.py). Adds a high-performance chunk deduplication module with exact and substring-based deduplication logic (chunk_deduplication.py), ported from ChunkHound to Context-Engine.
Introduces a new deduplication method using O(n log n) substring detection in both CASTPlusChunker and SearchOptimizedChunker, falling back to legacy methods if unavailable. Adds deduplicate_semantic_chunks to chunk_deduplication.py for more accurate and efficient deduplication of SemanticChunk objects.
Adds concept-based extraction to ast_analyzer using declarative tree-sitter queries for 32+ languages, supporting universal concepts (definition, block, comment, import, structure). Updates language mapping classes for correct language keys, improves Go and TypeScript import queries, and enhances Redis connection handling with pooling and retries. Introduces elbow detection filtering for hybrid search, adds multi-hop chunk similarity search in Qdrant, and improves upload bundle manifest validation. Includes comprehensive tests for language mappings and analyzer integration.
Added a postinstall script in package.json to ensure bin/ctxce.js is executable after installation. This helps prevent permission issues when running the start script.
|
auggie review |
Added internal and utility symbols to the __all__ list in scripts/hybrid/qdrant.py for more explicit exports. Updated scripts/hybrid_qdrant.py shim to import __all__ for improved backward compatibility.
🤖 Augment PR SummarySummary: This PR expands the chunking + retrieval pipeline with richer language-aware extraction, smarter result filtering, and improved operational defaults. Changes:
Technical Notes: Adds 🤖 Was this summary useful? React with 👍 or 👎 |
Refactored score extraction in elbow_detection.py to handle missing keys more robustly. Updated CONCEPT_SPECIFICITY in chunk_deduplication.py to use lowercase keys for CAST+ concept types, ensuring consistency with get_chunk_specificity().
Introduce comprehensive test suites for CAST+ chunker, chunk deduplication core, elbow detection, and smart termination logic. These tests cover configuration defaults, algorithm correctness, edge cases, and expected behaviors for each module.
Added logic to decode TOON-formatted result strings in the search command. If decoding fails, an error message is shown with a hint to install the required package or disable TOON. This improves compatibility with different result formats returned by the server.
|
augment review |
Symbol graph edges (Qdrant flat graph) are now always enabled and no longer configurable via the INDEX_GRAPH_EDGES env var; related config, comments, and tests updated to reflect unconditional activation. Elbow detection utilities have been refactored to use curvature-based, changepoint, and Kneedle methods for adaptive thresholding, with improved statistical termination logic in iterative search. Specificity scoring for chunk deduplication now uses a weighted formula for more granular ranking.
Modified the Page-Hinkley test in termination.py to detect downward mean shifts, updated its threshold and logic, and clarified docstrings. Adjusted TerminationConfig defaults and improved test coverage in test_termination.py to reflect the new Page-Hinkley behavior. Also updated chunk specificity logic and tests in test_chunk_deduplication_core.py to return 0 for unknown types.
Updated _get_graph_backend to support both Neo4j and Qdrant backends through a unified interface, defaulting to Qdrant when Neo4j is not enabled. Simplified fallback logic in _symbol_graph_impl to handle empty results or backend failures, ensuring callees and callers queries use appropriate legacy array field lookups when necessary.
Included the xxhash library (version 3.0.0 or higher) in the main dependencies to support fast non-cryptographic hashing.
Adjusts the min_results parameter to be zero when the limit is zero, preventing unintended behavior when no results are requested.
Refactors hybrid_search.py to lazily import filter_by_elbow only when elbow filtering is enabled, preventing unnecessary numpy dependency when the feature is disabled. Also updates the postinstall script in package.json to use a Node.js-based chmod for better cross-platform compatibility.
|
auggie review |
Set PSEUDO_DEFER_TO_WORKER=1 by default in Kubernetes and Docker Compose to defer LLM-based pseudo-tag generation to a background worker, improving initial indexing speed. Updated documentation to explain the new default, how the deferred worker operates, and its benefits for production deployments.
scripts/ast_analyzer.py
Outdated
| # Use language mappings (32 languages, declarative queries) | ||
| if _LANGUAGE_MAPPINGS_AVAILABLE and self.use_tree_sitter: | ||
| result = self._analyze_with_mapping(content, file_path, language) | ||
| if result and (result.get("symbols") or result.get("imports")): |
There was a problem hiding this comment.
The mapping-based analysis is ignored unless it yields symbols or imports; for files/languages where only calls/concepts are present (and there’s no legacy analyzer), this can drop call-edge extraction and return an empty analysis. Consider treating any non-empty mapping output (e.g., calls) as “successful” here.
🤖 Was this useful? React with 👍 or 👎
Introduces support for writing MCP config for Cursor (mcp.json) in the VSCode extension. Adds new settings, commands, and UI elements for Cursor integration, updates the dashboard and settings webviews to use a logo image, and bumps package versions. Also changes a warning print to logger.debug in the Qdrant ingest script.
Enhances the TerminationChecker to handle None and non-numeric scores gracefully, ensuring robust sorting and filtering. Updates chunk deduplication to better handle structure and definition overlaps. Also updates ASTAnalyzer to consider 'calls' in result validation and adds documentation for Cursor integration in the VSCode extension README.
|
augment review |
Introduces the INDEX_WORKERS environment variable to docker-compose and reset.py for configurable parallel indexing. Optimizes index_repo in pipeline.py to skip unnecessary cache checks and deduplication when recreating a fresh collection, reducing redundant Qdrant calls and improving performance.
scripts/hybrid/termination.py
Outdated
| ranks[val].append(avg_rank) | ||
| i = j | ||
|
|
||
| r1 = sum(ranks[v][0] if len(ranks[v]) == 1 else ranks[v].pop(0) for v in x) |
There was a problem hiding this comment.
| symbol_ranges = [(s.start_line, s.end_line, s.path or s.name) for s in symbols] | ||
|
|
||
| def find_enclosing_symbol(line: int) -> str: | ||
| for start, end, path in symbol_ranges: |
Corrected indentation issues and removed extraneous whitespace in the TerminationChecker class. This improves code readability and ensures proper execution of the termination checks.
Refined the symbol lookup in ASTAnalyzer to select the most specific enclosing symbol by minimizing the span. Simplified the Mann-Whitney U implementation by replacing the ranks dictionary with a rank list, streamlining rank assignment and calculation.
scripts/ast_analyzer.py
Outdated
| names = [n.strip().split(" as ")[0] for n in names_str.split(",")] | ||
| return module, names, True | ||
| elif text.startswith("import "): | ||
| match = re.match(r"import\s+([\w.]+)", text) |
There was a problem hiding this comment.
_parse_import_text() only captures the first module for Python statements like import os, sys, so imports (and any downstream dependency/graph extraction) will silently miss additional modules on the same line. Consider handling comma-separated imports (and parenthesized multi-line forms) so the extracted import list is complete.
🤖 Was this useful? React with 👍 or 👎
| must = [] | ||
|
|
||
| if path_filter: | ||
| must.append(models.FieldCondition( |
There was a problem hiding this comment.
path_filter is described as a “path prefix”, but the filter uses metadata.path with MatchText, which may not behave like a prefix match (and may not work if metadata.path is indexed as a keyword). It might be safer to align this with the rest of the codebase’s path filtering (e.g., metadata.path_prefix / MatchValue) so the constraint is applied reliably.
🤖 Was this useful? React with 👍 or 👎
| else: | ||
| best = max( | ||
| chunk_list, | ||
| key=lambda c: ( |
There was a problem hiding this comment.
When resolving exact-content duplicates, the tie-breaker uses -(end_line - start_line), which prefers smaller spans; this seems to conflict with the module’s stated goal of favoring larger chunks for more context. Worth double-checking the intended behavior here, since it affects which duplicate instance is retained.
🤖 Was this useful? React with 👍 or 👎
Refines Python import statement parsing in ASTAnalyzer to handle multiple modules and aliases. Updates Qdrant path filtering to use 'metadata.path_prefix' and strips trailing slashes for more accurate matching.
|
augment review |
Introduces new modules for embedding provider protocols, structured exception hierarchy, file discovery and syntax tree caching, type aliases, and domain models for code indexing. Also updates the base language mapping to support constant extraction and import resolution extension points.
Refactored context hit extraction in ctx.py to handle TOON-encoded strings and provide better error handling and debug output. Replaced a print statement with logger.info in pipeline.py for consistency with logging practices.
Corrected indentation issues in scripts/ctx.py to ensure proper function definitions and logic flow. Updated a debug message in scripts/hybrid_search.py to clarify the threshold method. In scripts/ast_analyzer.py, normalized the language before retrieving the mapping to prevent errors.
Update _extract_calls_from_tree to only attribute function calls to symbols of kind function, method, class, async_function, or module. This prevents assignments and constants from being incorrectly identified as callers.
No description provided.