fix: don't reject UTF-8 text files dense in CJK as binary by he-yufeng · Pull Request #171 · OpenHands/openhands-aci

he-yufeng · 2026-04-22T14:44:10Z

Problem

OHEditor.view() (and str_replace, insert, etc.) raises FileValidationError: File appears to be binary… when operating on perfectly valid UTF-8 markdown files that contain dense CJK (Chinese/Japanese/Korean) characters.

Reported in OpenHands/OpenHands#13517.

Minimal repro:

from openhands_aci.editor.editor import OHEditor
import tempfile

with tempfile.NamedTemporaryFile(suffix='.md', delete=False, mode='w', encoding='utf-8') as f:
    f.write('中文测试内容。' * 50)   # 1050 bytes of valid UTF-8
    path = f.name

OHEditor()(command='view', path=path)
# => FileValidationError: File appears to be binary and this file type cannot be read or edited by this tool.

Root cause

validate_file defers the text/binary decision to binaryornot.check.is_binary, which uses chardet internally. Once a UTF-8 text file crosses ~1 KB with dense multi-byte characters (no ASCII dilution), chardet's byte-ratio heuristics cross the binary threshold and is_binary returns True — even though the bytes decode cleanly as UTF-8.

Dense CJK bytes	`is_binary()` result
< ~1000	False (correct)
>= ~1050	True (false positive)

This affects anyone working with non-trivial amounts of CJK text (also reproduces with dense Cyrillic, Arabic, etc.).

Fix

Before raising FileValidationError, do a second-chance decode using the existing EncodingManager (which uses charset_normalizer, not chardet, and handles CJK correctly). If we can decode a 64 KB sample with the detected encoding, the file is text.

Genuinely binary files (null bytes, random bytes) still fail both checks and are rejected.
No new dependencies — charset_normalizer is already in use via EncodingManager.
Preserves the fast path: is_binary returning False still accepts immediately; the second-chance decode only runs on files is_binary flagged.

Testing

New regression tests in tests/unit/test_file_validation.py:

test_validate_dense_cjk_text_file — 1050-byte dense Chinese markdown, asserts is_binary still returns True so we're actually exercising the fix, then asserts validate_file accepts the file.
test_validate_bom_utf16_text_file — UTF-16 encoded Japanese/Chinese text.

All existing tests that were passing on main still pass. Two tests (test_validate_binary_file, test_validate_image_file) fail locally on both main and this branch due to a binaryornot/Python-3.12 compat issue (NameError: name 'unicode' is not defined inside binaryornot/helpers.py:106) — unrelated to this PR.

Fixes OpenHands/OpenHands#13517

binaryornot's chardet-based heuristic has a known false positive on UTF-8 files that are dense in multi-byte characters (CJK, Cyrillic, etc.) once the sample crosses ~1 KB — the high-byte ratio trips the binary threshold even though the bytes decode cleanly as text. Before raising FileValidationError, ask the encoding manager (which uses charset_normalizer, not chardet) whether the file decodes. A clean sample decode with the detected encoding means the file is text and should be viewable/editable. Reproduction (from openhands-ai/OpenHands#13517): cjk = '中文测试内容。' * 50 # 1050 bytes, valid UTF-8 editor.view(path) # raised FileValidationError # after this fix: returns content Tested on macOS with openhands-aci 0.3.3. Fixes openhands-ai/OpenHands#13517

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't reject UTF-8 text files dense in CJK as binary#171

fix: don't reject UTF-8 text files dense in CJK as binary#171
he-yufeng wants to merge 1 commit into
OpenHands:mainfrom
he-yufeng:fix/validate-cjk-text-file

he-yufeng commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

he-yufeng commented Apr 22, 2026

Problem

Root cause

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant