fix(tokenizer): document ACI_TOKENIZER and fix offline test failures caused by tiktoken network dependency#24
Merged
AperturePlus merged 2 commits intodevelopfrom Mar 17, 2026
Conversation
…e tests Why: Tests using TiktokenTokenizer fail when network is unavailable (tiktoken downloads its vocabulary from openaipublic.blob.core.windows.net on first use). This caused 14 unit test failures in sandboxed/offline environments. The ACI_TOKENIZER env var was also undocumented, leaving Ollama users with no guidance on how to resolve the BPE/WordPiece mismatch. What: - .env.example: add ACI_TOKENIZER with documentation for Ollama/BERT users - test_chunker_core.py: use CharacterTokenizer fixture (network-free) - test_chunker_splitter.py: use CharacterTokenizer fixture (network-free) - test_incremental_update_scopes_to_root.py: pass CharacterTokenizer-based chunker to IndexingService to avoid tiktoken download at test time Test: uv run pytest tests/unit/ -> 226 passed, 0 failed Co-authored-by: AperturePlus <146049978+AperturePlus@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix indexing failure with Ollama/BERT models due to tokenizer mismatch
fix(tokenizer): document ACI_TOKENIZER and fix offline test failures caused by tiktoken network dependency
Mar 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ACI already supports configurable tokenizer strategies (
tiktoken,character,simple) to address the BPE/WordPiece mismatch with Ollama/BERT models — butACI_TOKENIZERwas undocumented, and 14 unit tests were failing becauseTiktokenTokenizerlazily downloads its vocabulary fromopenaipublic.blob.core.windows.neton first use, breaking offline/sandboxed environments.Changes
.env.exampleACI_TOKENIZERwith explanation of when to use each strategy — specifically calling outcharacterandsimpleas the fix for Ollama/BERT users hitting"input length exceeds context length"Test fixtures (
test_chunker_core.py,test_chunker_splitter.py)get_default_tokenizer()fixtures withCharacterTokenizer()— tests don't need tiktoken accuracy, just a functional tokenizer that works without networktest_incremental_update_scopes_to_root.pyCharacterTokenizer-based chunker toIndexingServiceinstead of relying on the default, which silently failed file processing when tiktoken couldn't fetch its encoding, causing the regression assertion to failWarning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
openaipublic.blob.core.windows.net/home/REDACTED/work/augmented-codebase-indexer/augmented-codebase-indexer/.venv/bin/pytest pytest tests/ -v --tb=short -q --durations=10(dns block)/home/REDACTED/work/augmented-codebase-indexer/augmented-codebase-indexer/.venv/bin/pytest pytest tests/unit/ -v --tb=short -q --durations=10(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
This section details on the original issue you should resolve
<issue_title>Indexing always fails with Ollama/BERT-based embedding models due to tiktoken tokenizer mismatch</issue_title>
<issue_description>
ACI uses
tiktokenwithcl100k_baseencoding (OpenAI's BPE tokenizer) to count chunk tokens, but when using Ollama with BERT-based embedding models (e.g.,nomic-embed-text,mxbai-embed-large), these models use WordPiece tokenizers that produce significantly different token counts. This mismatch causes indexing to always fail with"the input length exceeds the context length"— no matter how lowACI_INDEXING_MAX_CHUNK_TOKENSis set.Environment
4b52424)http://localhost:11434/v1/embeddingsnomic-embed-text(768 dim, 2048 context),mxbai-embed-large(1024 dim, 512 context)Steps to Reproduce
.envfor Ollama:Run:
aci index /path/to/any/codebaseObserve failure:
Root Cause
In
src/aci/core/tokenizer.py, ACI hardcodescl100k_base(OpenAI's BPE tokenizer):And in
get_default_tokenizer():The problem:
cl100k_baseand BERT WordPiece tokenizers can produce wildly different token counts for the same text. A code chunk that tiktoken counts as 256 tokens can easily be 600–1000+ tokens in a BERT WordPiece tokenizer, because:handleAuthenticationCallbackmight be 2-3 BPE tokensThis means the chunker produces chunks that it thinks are within limits, but the actual embedding model rejects them.
Observed Behavior
ACI_INDEXING_MAX_CHUNK_TOKENSThere is no safe value because the token count ratio is unpredictable and can be 2-4x.
Suggested Fix
Option A: Configurable tokenizer (minimal change)
Add an
ACI_TOKENIZERenv var that allows selecting the tokenizer strategy:A simple character-based estimator (e.g.,
len(text) / 4) would be conservative enough for any model.Option B: Auto-detect from embedding model (better UX)
Query the Ollama API (
/api/show) at startup to get the model's actual context length and tokenizer type, then apply a safety margin.Option C: Graceful skip on embedding failure (safety net)
The branch
codex/fix-indexing-failure-for-oversized-items(commit72e88b5) adds skip-on-failure logic. This should be merged to master as a safety net regardless of the tokenizer fix — one oversized chunk shouldn't abort the entire index.Recommended approach
Combine Option A + Option C: let users pick a conservative tokenizer for non-OpenAI models, and always gracefully skip chunks that the embedding API rejects rather than aborting the whole batch.
Additional Context
/v1/embeddingsendpointACI_EMBEDDING_BATCH_SIZE=1setting doesn't help because even individual chunks exceed the model's context📍 Connect Copilot coding agent with Jira, Azure Boards or Linear to delegate work to Copilot in one click without leaving your project management tool.