Add CAST+ and SOSC chunking strategies #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

m1rl0k merged 1 commit into test from chunk

Jan 24, 2026

.env.example

-Original file line number
+Diff line change
@@ Expand Up / @@ -183,6 +183,19 @@ USE_TREE_SITTER=1 @@
     INDEX_USE_ENHANCED_AST=1
     INDEX_SEMANTIC_CHUNKS=1
+    # Search-Optimized Semantic Chunking (SOSC) - concept-aware chunking for search
+    # Combines cAST concept-awareness with SDC density scoring, optimized for MCP search.
+    # Key features:
+    #   - 5 concept types: DEFINITION, BLOCK, COMMENT, IMPORT, STRUCTURE
+    #   - Concept-aware merging (docstring+function OK, block+definition NO)
+    #   - Deduplication (no duplicate chunks indexed)
+    #   - Emergency splitting for minified code
+    #   - Parent tracking for metadata (no context padding in chunk text)
+    # Set to 1 to enable SOSC instead of SDC (INDEX_SEMANTIC_CHUNKS)
+    # INDEX_SOSC_CHUNKS=0
+    # SOSC_MAX_CHARS=1200
+    # SOSC_MIN_CHARS=50
     # Pattern Search - structural code similarity across languages
     # Enables pattern_search MCP tool and indexes 64-dim pattern vectors
     # Uses WL graph kernel, CFG fingerprints, SimHash, spectral features
@@ Expand Down @@

docs/ARCHITECTURE.md

-Original file line number
+Diff line change
@@ Expand Up @@
     - **Call Graph Construction**: Maps caller → callee relationships with enclosing function context
     - **Dependency Tracking**: Extracts imports and module dependencies
     - **Semantic Chunking**: Splits code at function/class boundaries (not arbitrary line counts)
+    - **SOSC**: Search-Optimized Semantic Chunking using 34 language mappings for concept-aware chunks
+    - **CAST+**: Hybrid chunking with concept-aware merging and density scoring
     **Supported Languages:**
     | Language | Package |
@@ Expand Down @@

docs/CONFIGURATION.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -10,6 +10,7 @@ Complete environment variable reference for Context Engine. @@
     - [Core Settings](#core-settings)
     - [Embedding Models](#embedding-models)
     - [Indexing & Micro-Chunks](#indexing--micro-chunks)
+      - [Chunking Strategies](#chunking-strategies)
     - [Query Optimization](#query-optimization)
     - [Watcher Settings](#watcher-settings)
     - [Reranker](#reranker)
@@ Expand Down Expand Up / @@ -105,13 +106,45 @@ make reset-dev-dual # Recreates collection and reindexes @@
     | USE_TREE_SITTER | Enable tree-sitter parsing (py/js/ts) | 1 (on) |
     | INDEX_USE_ENHANCED_AST | Enable advanced AST-based semantic chunking | 1 (on) |
     | INDEX_SEMANTIC_CHUNKS | Enable semantic chunking (preserve function/class boundaries) | 1 (on) |
+    | INDEX_SOSC_CHUNKS | Enable SOSC chunking (concept-aware, search-optimized) | 0 (off) |
+    | INDEX_CAST_CHUNKS | Enable CAST+ chunking (hybrid merging with density scoring) | 0 (off) |
+    | INDEX_SDC_CHUNKS | Enable SDC chunking (semantic density chunking) | 0 (off) |
     | INDEX_CHUNK_LINES | Lines per chunk (non-micro mode) | 120 |
     | INDEX_CHUNK_OVERLAP | Overlap lines between chunks | 20 |
     | INDEX_BATCH_SIZE | Upsert batch size | 64 |
     | INDEX_PROGRESS_EVERY | Log progress every N files | 200 |
     | SMART_SYMBOL_REINDEXING | Reuse embeddings when only symbols change | 1 (enabled) |
     | MAX_CHANGED_SYMBOLS_RATIO | Threshold for full reindex vs smart update | 0.6 |
+    ### Chunking Strategies
+    Context Engine supports multiple chunking strategies. Only one can be active at a time. Priority order (first enabled wins):
+. **MICRO** (`INDEX_MICRO_CHUNKS=1`) - Token-based micro-chunking for ReFRAG. 16-token windows with 8-token stride.
+. **SOSC** (`INDEX_SOSC_CHUNKS=1`) - Search-Optimized Semantic Chunking. Uses tree-sitter + language mappings to extract concept-aware chunks (DEFINITION, BLOCK, COMMENT, IMPORT, STRUCTURE). Best for search quality - clean boundaries, respects symbol structure.
+. **CAST+** (`INDEX_CAST_CHUNKS=1`) - Hybrid chunking combining concept-aware grouping with density scoring. Merges compatible concepts aggressively (e.g., docstring + function). Best for token efficiency.
+. **SDC** (`INDEX_SDC_CHUNKS=1`) - Semantic Density Chunking. Token-aware chunking with density scoring.
+. **SEMANTIC** (`INDEX_SEMANTIC_CHUNKS=1`, default) - AST-aware chunking that preserves function/class boundaries.
+. **LINE-BASED** (fallback) - Simple line-based chunking with overlap.
+    **Recommended for search quality:** `INDEX_SOSC_CHUNKS=1`
+    **Recommended for token efficiency:** `INDEX_CAST_CHUNKS=1`
+    | SOSC Config | Description | Default |
+    |-------------|-------------|---------|
+    | SOSC_MAX_CHARS | Max non-whitespace chars per chunk | 1200 |
+    | SOSC_MIN_CHARS | Min chars to avoid tiny fragments | 50 |
+    | CAST+ Config | Description | Default |
+    |--------------|-------------|---------|
+    | CAST_MAX_SIZE | Max non-whitespace chars per chunk | 1200 |
+    | CAST_MIN_SIZE | Min chars to avoid tiny fragments | 50 |
     ## Query Optimization
     Dynamic HNSW_EF tuning and intelligent query routing for 2x faster simple queries.
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CAST+ and SOSC chunking strategies #194

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!