Skip to content
This repository was archived by the owner on May 15, 2026. It is now read-only.

fix: don't reject UTF-8 text files dense in CJK as binary#171

Open
he-yufeng wants to merge 1 commit into
OpenHands:mainfrom
he-yufeng:fix/validate-cjk-text-file
Open

fix: don't reject UTF-8 text files dense in CJK as binary#171
he-yufeng wants to merge 1 commit into
OpenHands:mainfrom
he-yufeng:fix/validate-cjk-text-file

Conversation

@he-yufeng
Copy link
Copy Markdown

Problem

OHEditor.view() (and str_replace, insert, etc.) raises FileValidationError: File appears to be binary… when operating on perfectly valid UTF-8 markdown files that contain dense CJK (Chinese/Japanese/Korean) characters.

Reported in OpenHands/OpenHands#13517.

Minimal repro:

from openhands_aci.editor.editor import OHEditor
import tempfile

with tempfile.NamedTemporaryFile(suffix='.md', delete=False, mode='w', encoding='utf-8') as f:
    f.write('中文测试内容。' * 50)   # 1050 bytes of valid UTF-8
    path = f.name

OHEditor()(command='view', path=path)
# => FileValidationError: File appears to be binary and this file type cannot be read or edited by this tool.

Root cause

validate_file defers the text/binary decision to binaryornot.check.is_binary, which uses chardet internally. Once a UTF-8 text file crosses ~1 KB with dense multi-byte characters (no ASCII dilution), chardet's byte-ratio heuristics cross the binary threshold and is_binary returns True — even though the bytes decode cleanly as UTF-8.

Dense CJK bytes is_binary() result
< ~1000 False (correct)
>= ~1050 True (false positive)

This affects anyone working with non-trivial amounts of CJK text (also reproduces with dense Cyrillic, Arabic, etc.).

Fix

Before raising FileValidationError, do a second-chance decode using the existing EncodingManager (which uses charset_normalizer, not chardet, and handles CJK correctly). If we can decode a 64 KB sample with the detected encoding, the file is text.

  • Genuinely binary files (null bytes, random bytes) still fail both checks and are rejected.
  • No new dependencies — charset_normalizer is already in use via EncodingManager.
  • Preserves the fast path: is_binary returning False still accepts immediately; the second-chance decode only runs on files is_binary flagged.

Testing

New regression tests in tests/unit/test_file_validation.py:

  • test_validate_dense_cjk_text_file — 1050-byte dense Chinese markdown, asserts is_binary still returns True so we're actually exercising the fix, then asserts validate_file accepts the file.
  • test_validate_bom_utf16_text_file — UTF-16 encoded Japanese/Chinese text.

All existing tests that were passing on main still pass. Two tests (test_validate_binary_file, test_validate_image_file) fail locally on both main and this branch due to a binaryornot/Python-3.12 compat issue (NameError: name 'unicode' is not defined inside binaryornot/helpers.py:106) — unrelated to this PR.

Fixes OpenHands/OpenHands#13517

binaryornot's chardet-based heuristic has a known false positive on
UTF-8 files that are dense in multi-byte characters (CJK, Cyrillic,
etc.) once the sample crosses ~1 KB — the high-byte ratio trips the
binary threshold even though the bytes decode cleanly as text.

Before raising FileValidationError, ask the encoding manager (which
uses charset_normalizer, not chardet) whether the file decodes. A
clean sample decode with the detected encoding means the file is
text and should be viewable/editable.

Reproduction (from openhands-ai/OpenHands#13517):

    cjk = '中文测试内容。' * 50        # 1050 bytes, valid UTF-8
    editor.view(path)                   # raised FileValidationError
                                        # after this fix: returns content

Tested on macOS with openhands-aci 0.3.3.

Fixes openhands-ai/OpenHands#13517
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: OHEditor view command fails on valid UTF-8 markdown files with dense CJK characters

1 participant