[BUG] Memory usage increase for big files #376

josteinl · 2023-10-31T15:08:44Z

Describe the bug
After upgrading from version 3.2.0 to either 3.3.0 or 3.31 I notice a huge increase in memory usage. Run from_bytes() on a 25 MB file, now results in using almost 3 GB of memory.

To Reproduce
Run this file, placed inside the charset_normalizer folder, with the scalene memory profiler (Linux/WSL):

memory_profile_test.py:

"""
Run from the project root:

    poetry run python3 -m scalene charset_normalizer/memory_profile_test.py

or (with an activated virtual environment)

    pip install scalene
    scalene charset_normalizer/memory_profile_test.py
"""

from charset_normalizer.api import from_bytes

file_name = "data/memory_profile_test.txt"

with open(file_name, "rb") as file:
    data = file.read()
    result = from_bytes(data)
    best = result.best()
    print(f"{best=}")

Data file used (25 MB), placed in the data folder :
memory_profile_test.txt

Profiler result (download and view in browser):
profile_charset_normalizer_3.3.1.html

Expected behaviour
Expected that the function did use just a bit more memory than the file I passed into from_bytes().

Testing Environment

OS: Ubuntu on WSL
Python version 3.11.6
Package version 3.3.0/1

Additional context
We use the charset-normalizer in our program running in containers with strict memory limits. We noticed the change in behaviour after our pods were Out Of Memory (OOM) killed.

Doing some debugging, it seems that the increase in memory consumption comes from storing the decoded_payload in the CharsetMatch().

Finally
A big thank you to the authors and maintainers! This library is much needed, used and appreciated!

The text was updated successfully, but these errors were encountered:

…that match several encoding (#376)

Ousret · 2023-10-31T19:44:42Z

You are welcome.
The report you gave us helped us understand and fix the issue quickly.
We will be publishing a patch release soon.

…that match several encoding (#376) (#377)

josteinl added bug Something isn't working help wanted Extra attention is needed labels Oct 31, 2023

Ousret added a commit that referenced this issue Oct 31, 2023

🐛 Fix unintentional memory usage regression when using large payload …

1f3cac5

…that match several encoding (#376)

Ousret mentioned this issue Oct 31, 2023

🐛 Fix unintentional memory usage regression when using large payload that match several encoding (#376) #377

Merged

Ousret closed this as completed in #377 Oct 31, 2023

Ousret added a commit that referenced this issue Oct 31, 2023

🐛 Fix unintentional memory usage regression when using large payload …

e274dcc

…that match several encoding (#376) (#377)

Ousret removed the help wanted Extra attention is needed label Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory usage increase for big files #376

[BUG] Memory usage increase for big files #376

josteinl commented Oct 31, 2023 •

edited

Ousret commented Oct 31, 2023

[BUG] Memory usage increase for big files #376

[BUG] Memory usage increase for big files #376

Comments

josteinl commented Oct 31, 2023 • edited

Ousret commented Oct 31, 2023

josteinl commented Oct 31, 2023 •

edited