Bug Description
semantic_code_search and semantic_navigate fail with the error:
Unable to embed oversized input after adaptive retries
This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window.
Environment
- Context+ version: latest via
bunx contextplus
- OS: macOS (Apple Silicon, 64 GB RAM)
- Ollama: running locally
- Embed model:
nomic-embed-text (context window: 2048 tokens)
- Chat model:
gemma2:27b
Reproduction
Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree.
Steps
- Configure Context+ as MCP server with Ollama (
nomic-embed-text)
- Have a project with a few JS/TS source files and some large
.json data files (100KB+)
- Run
semantic_code_search with any query:
semantic_code_search({ query: "authentication logic", top_k: 3 })
- Result:
Unable to embed oversized input after adaptive retries
What works vs what doesn't
| Tool |
Status |
Notes |
get_context_tree |
✅ Works |
AST-based, no embeddings |
get_file_skeleton |
✅ Works |
AST-based, no embeddings |
semantic_identifier_search |
✅ Works |
Embeds function signatures (small) |
get_blast_radius |
✅ Works |
Import/usage tracing |
semantic_code_search |
❌ Fails |
Tries to embed large data files |
semantic_navigate |
❌ Fails |
Same issue |
Expected Behavior
Context+ should either:
- Skip non-code files (
.json, .geojson, .csv, etc.) during embedding, or
- Chunk large files before embedding instead of sending the entire content, or
- Add a configurable
max_file_size threshold (e.g., 50KB) beyond which files are skipped for embedding, or
- Gracefully degrade — skip files that exceed the model's context window and continue with the rest
Suggested Fix
The embeddings.ts core module could:
- Filter out known data-only extensions (
.json, .geojson, .csv, .xlsx) from the embedding pipeline
- Add a
CONTEXTPLUS_MAX_EMBED_FILE_SIZE env var (default ~50KB)
- Or use the existing Tree-sitter parser to detect if a file has meaningful code symbols — if not, skip it
Additional Context
semantic_identifier_search works because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used by semantic_code_search and semantic_navigate.
The nomic-embed-text model has a 2048-token context window. Large data files far exceed this limit.
Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!
Bug Description
semantic_code_searchandsemantic_navigatefail with the error:This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window.
Environment
bunx contextplusnomic-embed-text(context window: 2048 tokens)gemma2:27bReproduction
Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree.
Steps
nomic-embed-text).jsondata files (100KB+)semantic_code_searchwith any query:Unable to embed oversized input after adaptive retriesWhat works vs what doesn't
get_context_treeget_file_skeletonsemantic_identifier_searchget_blast_radiussemantic_code_searchsemantic_navigateExpected Behavior
Context+ should either:
.json,.geojson,.csv, etc.) during embedding, ormax_file_sizethreshold (e.g., 50KB) beyond which files are skipped for embedding, orSuggested Fix
The
embeddings.tscore module could:.json,.geojson,.csv,.xlsx) from the embedding pipelineCONTEXTPLUS_MAX_EMBED_FILE_SIZEenv var (default ~50KB)Additional Context
semantic_identifier_searchworks because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used bysemantic_code_searchandsemantic_navigate.The
nomic-embed-textmodel has a 2048-token context window. Large data files far exceed this limit.Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!