This repository was archived by the owner on May 15, 2026. It is now read-only.
fix: don't reject UTF-8 text files dense in CJK as binary#171
Open
he-yufeng wants to merge 1 commit into
Open
Conversation
binaryornot's chardet-based heuristic has a known false positive on
UTF-8 files that are dense in multi-byte characters (CJK, Cyrillic,
etc.) once the sample crosses ~1 KB — the high-byte ratio trips the
binary threshold even though the bytes decode cleanly as text.
Before raising FileValidationError, ask the encoding manager (which
uses charset_normalizer, not chardet) whether the file decodes. A
clean sample decode with the detected encoding means the file is
text and should be viewable/editable.
Reproduction (from openhands-ai/OpenHands#13517):
cjk = '中文测试内容。' * 50 # 1050 bytes, valid UTF-8
editor.view(path) # raised FileValidationError
# after this fix: returns content
Tested on macOS with openhands-aci 0.3.3.
Fixes openhands-ai/OpenHands#13517
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
OHEditor.view()(andstr_replace,insert, etc.) raisesFileValidationError: File appears to be binary…when operating on perfectly valid UTF-8 markdown files that contain dense CJK (Chinese/Japanese/Korean) characters.Reported in OpenHands/OpenHands#13517.
Minimal repro:
Root cause
validate_filedefers the text/binary decision tobinaryornot.check.is_binary, which useschardetinternally. Once a UTF-8 text file crosses ~1 KB with dense multi-byte characters (no ASCII dilution), chardet's byte-ratio heuristics cross the binary threshold andis_binaryreturnsTrue— even though the bytes decode cleanly as UTF-8.is_binary()resultThis affects anyone working with non-trivial amounts of CJK text (also reproduces with dense Cyrillic, Arabic, etc.).
Fix
Before raising
FileValidationError, do a second-chance decode using the existingEncodingManager(which usescharset_normalizer, notchardet, and handles CJK correctly). If we can decode a 64 KB sample with the detected encoding, the file is text.charset_normalizeris already in use viaEncodingManager.is_binaryreturningFalsestill accepts immediately; the second-chance decode only runs on filesis_binaryflagged.Testing
New regression tests in
tests/unit/test_file_validation.py:test_validate_dense_cjk_text_file— 1050-byte dense Chinese markdown, assertsis_binarystill returns True so we're actually exercising the fix, then assertsvalidate_fileaccepts the file.test_validate_bom_utf16_text_file— UTF-16 encoded Japanese/Chinese text.All existing tests that were passing on
mainstill pass. Two tests (test_validate_binary_file,test_validate_image_file) fail locally on bothmainand this branch due to abinaryornot/Python-3.12 compat issue (NameError: name 'unicode' is not definedinsidebinaryornot/helpers.py:106) — unrelated to this PR.Fixes OpenHands/OpenHands#13517