Skip to content

fix(tokenizer): document ACI_TOKENIZER and fix offline test failures caused by tiktoken network dependency#24

Merged
AperturePlus merged 2 commits intodevelopfrom
copilot/fix-indexing-failure
Mar 17, 2026
Merged

fix(tokenizer): document ACI_TOKENIZER and fix offline test failures caused by tiktoken network dependency#24
AperturePlus merged 2 commits intodevelopfrom
copilot/fix-indexing-failure

Conversation

Copy link
Contributor

Copilot AI commented Mar 14, 2026

ACI already supports configurable tokenizer strategies (tiktoken, character, simple) to address the BPE/WordPiece mismatch with Ollama/BERT models — but ACI_TOKENIZER was undocumented, and 14 unit tests were failing because TiktokenTokenizer lazily downloads its vocabulary from openaipublic.blob.core.windows.net on first use, breaking offline/sandboxed environments.

Changes

.env.example

  • Documents ACI_TOKENIZER with explanation of when to use each strategy — specifically calling out character and simple as the fix for Ollama/BERT users hitting "input length exceeds context length"
# Use "character" or "simple" for Ollama/BERT-based models (e.g. nomic-embed-text,
# mxbai-embed-large) to avoid the tiktoken/WordPiece mismatch.
#   tiktoken  - OpenAI BPE (default, accurate for OpenAI models)
#   character - len(text)/4 estimate (conservative, works with any model)
#   simple    - whitespace split (for generic non-BPE models)
ACI_TOKENIZER=tiktoken

Test fixtures (test_chunker_core.py, test_chunker_splitter.py)

  • Replace get_default_tokenizer() fixtures with CharacterTokenizer() — tests don't need tiktoken accuracy, just a functional tokenizer that works without network

test_incremental_update_scopes_to_root.py

  • Pass an explicit CharacterTokenizer-based chunker to IndexingService instead of relying on the default, which silently failed file processing when tiktoken couldn't fetch its encoding, causing the regression assertion to fail

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • openaipublic.blob.core.windows.net
    • Triggering command: /home/REDACTED/work/augmented-codebase-indexer/augmented-codebase-indexer/.venv/bin/pytest pytest tests/ -v --tb=short -q --durations=10 (dns block)
    • Triggering command: /home/REDACTED/work/augmented-codebase-indexer/augmented-codebase-indexer/.venv/bin/pytest pytest tests/unit/ -v --tb=short -q --durations=10 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch</issue_title>
<issue_description>
ACI uses tiktoken with cl100k_base encoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using Ollama with BERT-based embedding models (e.g., nomic-embed-text, mxbai-embed-large), these models use WordPiece tokenizers that produce significantly different token counts. This mismatch causes indexing to always fail with "the input length exceeds the context length" — no matter how low ACI_INDEXING_MAX_CHUNK_TOKENS is set.

Environment

  • ACI version: latest master (4b52424)
  • Embedding provider: Ollama (local) via http://localhost:11434/v1/embeddings
  • Models tested: nomic-embed-text (768 dim, 2048 context), mxbai-embed-large (1024 dim, 512 context)
  • OS: macOS (Darwin), Python 3.14
  • Qdrant: local Docker (auto-started by ACI)

Steps to Reproduce

  1. Install ACI and set up .env for Ollama:
ACI_EMBEDDING_API_URL=http://localhost:11434/v1/embeddings
ACI_EMBEDDING_API_KEY=ollama
ACI_EMBEDDING_MODEL=nomic-embed-text
ACI_EMBEDDING_DIMENSION=768
ACI_EMBEDDING_BATCH_SIZE=1
ACI_INDEXING_MAX_CHUNK_TOKENS=256  # Even extremely low values don't help
ACI_VECTOR_STORE_VECTOR_SIZE=768
  1. Run: aci index /path/to/any/codebase

  2. Observe failure:

Chunk exceeds token limit (528 > 256), this may indicate a very long single line
Batch batch_xxx failed: API error: 400 - {"error":{"message":"the input length exceeds the context length","type":"api_error","param":null,"code":null}}

Root Cause

In src/aci/core/tokenizer.py, ACI hardcodes cl100k_base (OpenAI's BPE tokenizer):

class TiktokenTokenizer(TokenizerInterface):
    def __init__(self, encoding_name: str = "cl100k_base"):
        self._encoding_name = encoding_name
        # ...

    def count_tokens(self, text: str) -> int:
        return len(self.encoding.encode(text))

And in get_default_tokenizer():

def get_default_tokenizer() -> TokenizerInterface:
    return TiktokenTokenizer(encoding_name="cl100k_base")

The problem: cl100k_base and BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:

  • BPE (tiktoken) merges common subword pairs aggressively — code identifiers like handleAuthenticationCallback might be 2-3 BPE tokens
  • WordPiece splits more conservatively — the same identifier could be 5-8 WordPiece tokens
  • Special characters, camelCase, and code syntax amplify the divergence

This means the chunker produces chunks that it thinks are within limits, but the actual embedding model rejects them.

Observed Behavior

ACI_INDEXING_MAX_CHUNK_TOKENS Result
8192 (default) Fails — chunks up to 42,982 tokens in model's tokenizer
2048 Fails — chunks still exceed nomic's 2048 context
512 Fails — chunks reported as 528 by ACI, but actually much larger for model
256 Fails — chunks reported as 256-528, still exceed context

There is no safe value because the token count ratio is unpredictable and can be 2-4x.

Suggested Fix

Option A: Configurable tokenizer (minimal change)

Add an ACI_TOKENIZER env var that allows selecting the tokenizer strategy:

# .env
ACI_TOKENIZER=character  # or "tiktoken" (default), "simple" (whitespace split)

A simple character-based estimator (e.g., len(text) / 4) would be conservative enough for any model.

Option B: Auto-detect from embedding model (better UX)

Query the Ollama API (/api/show) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.

Option C: Graceful skip on embedding failure (safety net)

The branch codex/fix-indexing-failure-for-oversized-items (commit 72e88b5) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.

Recommended approach

Combine Option A + Option C: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.

Additional Context

  • The README advertises "OpenAI-compatible API (OpenAI, SiliconFlow, etc.)" support, which implies Ollama should work since it exposes an OpenAI-compatible /v1/embeddings endpoint
  • Ollama is a very popular local alternative — many users who want a "free" setup (as documented in ACI's README) will hit this
  • The ACI_EMBEDDING_BATCH_SIZE=1 setting doesn't help because even individual chunks exceed the model's context
  • Th...

📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.

…e tests

Why: Tests using TiktokenTokenizer fail when network is unavailable (tiktoken
downloads its vocabulary from openaipublic.blob.core.windows.net on first use).
This caused 14 unit test failures in sandboxed/offline environments. The
ACI_TOKENIZER env var was also undocumented, leaving Ollama users with no
guidance on how to resolve the BPE/WordPiece mismatch.

What:
- .env.example: add ACI_TOKENIZER with documentation for Ollama/BERT users
- test_chunker_core.py: use CharacterTokenizer fixture (network-free)
- test_chunker_splitter.py: use CharacterTokenizer fixture (network-free)
- test_incremental_update_scopes_to_root.py: pass CharacterTokenizer-based
  chunker to IndexingService to avoid tiktoken download at test time

Test: uv run pytest tests/unit/ -> 226 passed, 0 failed

Co-authored-by: AperturePlus <146049978+AperturePlus@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix indexing failure with Ollama/BERT models due to tokenizer mismatch fix(tokenizer): document ACI_TOKENIZER and fix offline test failures caused by tiktoken network dependency Mar 14, 2026
Copilot AI requested a review from AperturePlus March 14, 2026 11:52
@AperturePlus AperturePlus marked this pull request as ready for review March 14, 2026 12:18
@AperturePlus AperturePlus merged commit 9301964 into develop Mar 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants