Skip to content

semantic_code_search and semantic_navigate fail with 'Unable to embed oversized input' on projects with large data files #15

@Morghot42

Description

@Morghot42

Bug Description

semantic_code_search and semantic_navigate fail with the error:

Unable to embed oversized input after adaptive retries

This happens on projects that contain large non-code files (JSON data files, GeoJSON, CSV, etc.) alongside source code. The tools appear to attempt embedding the entire file content including these large data files, which exceeds the embedding model's context window.

Environment

  • Context+ version: latest via bunx contextplus
  • OS: macOS (Apple Silicon, 64 GB RAM)
  • Ollama: running locally
  • Embed model: nomic-embed-text (context window: 2048 tokens)
  • Chat model: gemma2:27b

Reproduction

Any project that has both source code files and large data files (JSON > 100KB, GeoJSON, CSV, etc.) in the project tree.

Steps

  1. Configure Context+ as MCP server with Ollama (nomic-embed-text)
  2. Have a project with a few JS/TS source files and some large .json data files (100KB+)
  3. Run semantic_code_search with any query:
    semantic_code_search({ query: "authentication logic", top_k: 3 })
    
  4. Result: Unable to embed oversized input after adaptive retries

What works vs what doesn't

Tool Status Notes
get_context_tree ✅ Works AST-based, no embeddings
get_file_skeleton ✅ Works AST-based, no embeddings
semantic_identifier_search ✅ Works Embeds function signatures (small)
get_blast_radius ✅ Works Import/usage tracing
semantic_code_search ❌ Fails Tries to embed large data files
semantic_navigate ❌ Fails Same issue

Expected Behavior

Context+ should either:

  1. Skip non-code files (.json, .geojson, .csv, etc.) during embedding, or
  2. Chunk large files before embedding instead of sending the entire content, or
  3. Add a configurable max_file_size threshold (e.g., 50KB) beyond which files are skipped for embedding, or
  4. Gracefully degrade — skip files that exceed the model's context window and continue with the rest

Suggested Fix

The embeddings.ts core module could:

  • Filter out known data-only extensions (.json, .geojson, .csv, .xlsx) from the embedding pipeline
  • Add a CONTEXTPLUS_MAX_EMBED_FILE_SIZE env var (default ~50KB)
  • Or use the existing Tree-sitter parser to detect if a file has meaningful code symbols — if not, skip it

Additional Context

semantic_identifier_search works because it only embeds function/class signatures (small strings). The bug is specifically in the file-level embedding pipeline used by semantic_code_search and semantic_navigate.

The nomic-embed-text model has a 2048-token context window. Large data files far exceed this limit.

Great project — the AST tools work beautifully. Looking forward to semantic search handling mixed codebases!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions