[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path` #136

aytey · 2021-11-09T16:27:18Z

Describe the bug

I have a file such as this:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ file temp.txt
temp.txt: C source, ASCII text
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ du -hs temp.txt
9.6M    temp.txt
(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ wc temp.txt
  188585  1001674 10000082 temp.txt

and I'm trying to parse it with:

#!/usr/bin/env python3

from charset_normalizer import from_path

file = "temp.txt"

lines = [
    line.strip() for line in str(from_path(file).best()).split("\n")
]

using this version of charset_normalizer:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ echo -e "import charset_normalizer\nprint(charset_normalizer.version.VERSION)" | python
['2', '0', '7']

On the main file, I get this exception:

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ ./test.py
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813828: ordinal not in range(128)

However, it seems that something "weird" goes on at around the 10000082 character mark:

This crashes (file size: 10000082 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
10000082 temp.txt
Traceback (most recent call last):
  File "./test.py", line 8, in <module>
    line.strip() for line in str(from_path(file).best()).split("\n")
  File "/home/avj/clones/compile_commands_processor/venv/lib64/python3.8/site-packages/charset_normalizer/models.py", line 114, in __str__
    self._string = str(self._payload, self._encoding, "strict")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 813522: ordinal not in range(128)
Command exited with non-zero status 1
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.04
        System time (seconds): 0.03
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.07
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 45904
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 5116
        Voluntary context switches: 3
        Involuntary context switches: 3
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1

where as this does not finish within 5 seconds (maybe that's reasonable for a ~10 MiB file) (file size: 9999820 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188602 good.txt | tail -n 188584 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999820 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 806
        Voluntary context switches: 2
        Involuntary context switches: 1
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Now, it would be reasonable to say "okay, but what happens in the one line you've removed?", so we take slightly more head and leave tail alone (file size: 9999847 chars):

(venv) avj@vistrrdslin0001 ~/clones/cvise_runners/avj/charset_normalizer$ head -n 188603 good.txt | tail -n 188585 > temp.txt && wc -c temp.txt && /usr/bin/time -vvv 
timeout -k 5 -s 9 5 ./test.py
9999847 temp.txt
Command terminated by signal 9
        Command being timed: "timeout -k 5 -s 9 5 ./test.py"
        User time (seconds): 0.00
        System time (seconds): 0.00
        Percent of CPU this job got: 0%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.00
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 10148
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 772
        Voluntary context switches: 2
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

To Reproduce

Unfortunately, I am not able to immediately share this file -- I tried to use cvise and halfempty on it to find the smallest file, but hit the road-block at around the 10000082 character mark

Expected behavior

I believe that charset_normalizer shouldn't crash with UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path

Desktop (please complete the following information):

OS: Linux
Python version: 3.8.12
Package version 2.0.7

The text was updated successfully, but these errors were encountered:

Fix issue #136

Ousret · 2021-11-09T20:03:48Z

Hi,

Thanks for the detailed report. Indeed your observation is verified.
I have put together a small reproducible example that will serve to verify and fix the issue.

Feel free to verify against the branch bugfix-lazy-str-decode-error.
This bug concern only file or bytes seq that is> 1MB.

aytey · 2021-11-10T11:41:56Z

I tried it -- works! 🎉

I'll close this issue given the PR is up.

* ✔️ Add simple test case that show the problem (Issue #136) * 🐛 Fix getting misleaded by large sequence (lazy str loading) * 🐛 Ignore too insignificant extracted chunk * 🔖 Bump to 2.0.8.dev3

aytey added bug Something isn't working help wanted Extra attention is needed labels Nov 9, 2021

Ousret added a commit that referenced this issue Nov 9, 2021

✔️ Add simple test case that show the problem (Issue #136)

3b148f3

Ousret added a commit that referenced this issue Nov 9, 2021

🐛 Fix getting misleaded by large sequence (lazy str loading)

d0ad20b

Fix issue #136

Ousret mentioned this issue Nov 9, 2021

🐛 Fix large (misleading) sequence giving UnicodeDecodeError #137

Merged

aytey closed this as completed Nov 10, 2021

aytey mentioned this issue Dec 20, 2021

[BUG] UnicodeDecodeError: 'ascii' codec can't decode when using detect #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path` #136

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path` #136

aytey commented Nov 9, 2021

Ousret commented Nov 9, 2021

aytey commented Nov 10, 2021

[BUG] UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path #136

[BUG] UnicodeDecodeError: 'ascii' codec can't decode byte when using from_path #136

Comments

aytey commented Nov 9, 2021

Ousret commented Nov 9, 2021

aytey commented Nov 10, 2021

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path` #136

[BUG] `UnicodeDecodeError: 'ascii' codec can't decode byte` when using `from_path` #136