Don't inject unicodedata2 into sys.modules #57

akx · 2021-07-14T12:51:48Z

I noticed charset_normalizer meddles with sys.modules, causing this:

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata2' from '.../site-packages/unicodedata2.cpython-39-darwin.so'>

This PR fixes that by using a fairly standard try: except ImportError: guard instead of the sys.modules hook.

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata' from '.../python3.9/lib-dynload/unicodedata.cpython-39-darwin.so'>

Ousret · 2021-07-14T14:55:55Z

Hi @akx

Thanks for your report and your PR proposal.

Why do you say that PR "fixes" when this behavior is well documented and is expected?
I would be open to change that, considering recent events.

I have some further questions regarding this patch.
Why did you install and not expect it to be used instead of your cPython unicodedata distribution?

Removing the charset_normalizer.hook and replacing it with a plain import won't be useful unless we create some intermediary compat like solution.

It would be nicer to propose something that both keeps the backward compatibility AND "fix" your concern.

Thanks.

charset_normalizer/utils.py

akx · 2021-07-14T15:02:05Z

Hi @Ousret.

As far as I can see, it's certainly not documented that simply importing charset_normalizer while having unicodedata2 installed will make it impossible to access unicodedata (unless you had imported unicodedata beforehand).

Sure, it's documented that charset_normalizer will internally use unicodedata2 if it's available, and that's fine, but I don't expect it to mess with the global state of my Python interpreter. If I wanted to patch in unicodedata2 for all unicodedatas in my interpreter, I'd want to be explicit about that. (Explicit is better than implicit.)

My proposal is exactly the same mechanism that patches in charset_normalizer for chardet in requests (which is how I stumbled upon this library in the first place):

https://github.com/psf/requests/blob/a1a6a549a0143d9b32717dbe3d75cd543ae5a4f6/requests/compat.py#L11-L14

Ousret · 2021-07-14T15:07:58Z

I really do not understand why having more up-to-date unicodedata is messing with your global Python environment. Anyway.
Like I said earlier, I am open to change that behavior, but I must be convinced that using your method/patch actually does what is expected.

akx · 2021-07-14T15:12:59Z

I really do not understand why having more up-to-date unicodedata is messing with your global Python environment.

It's a matter of principle. If I do import unicodedata, I want unicodedata, not another module just because I had happened to import an unrelated module (charset_normalizer) before importing unicodedata.

It's maybe slightly far-fetched, but without this patch, you can't easily write a program that compares the differences between unicodedata and unicodedata2 unless you're specific about your import order!

I must be convinced that using your method/patch actually does what is expected.

Well, I'm pretty sure the CI suite for charset_normalizer would show that it doesn't break things. Can you enable running the GitHub workflows for this PR?

Ousret · 2021-07-14T15:21:05Z

I need another person's opinion on that. @sethmlarson What do you think of that?
Unfornutally, in the current state the test suite does not prove without a doupt that unicodedata2 is correctly used. I am working on it.

How is that the right thing for us?

import unicodata2 as unicodedata

"a".isalpha()

What I am concern about is methods like .isalpha() from a str instance. How do you ptr them to unicodedata2?

akx · 2021-07-14T15:26:15Z

import unicodedata2 as unicodedata

"a".isalpha()
What I am concern about is methods like .isalpha() from a str instance. How do you ptr them to unicodedata2?

As mentioned in #57 (comment) , I don't believe having unicodedata2 installed will help at all with str.isalpha(), etc. calls. In other words, you'd need to have your own def isalpha(s) sort of implementation that'd consult unicodedata2's tables, and doing that in pure Python will likely be slow.

sethmlarson · 2021-07-14T15:32:52Z

@Ousret Don't know much about the unicodedata module or whether it even interacts with isalpha, from a brief reading of CPython it doesn't seem like it interacts with this static table: https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Python/pyctype.c

If you chase down the implementation of str.isalpha it ends up checking flags in this table, at least from my reading.

If this is the case we shouldn't be injecting unicodedata2 as it doesn't modify str.isalpha behavior.

akx · 2021-07-14T15:35:58Z

@sethmlarson

whether it even interacts with isalpha, from a brief reading of CPython it doesn't seem like it interacts with this static table: https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Python/pyctype.c

That's the non-Unicode table. See #57 (comment) for the Unicode isalpha chase.

codecov-commenter · 2021-07-14T15:44:56Z

Codecov Report

Merging #57 (a30cf9f) into master (929f13c) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   84.46%   84.49%   +0.03%     
==========================================
  Files          12       11       -1     
  Lines        1062     1058       -4     
==========================================
- Hits          897      894       -3     
+ Misses        165      164       -1

Impacted Files	Coverage Δ
charset_normalizer/__init__.py	`100.00% <ø> (ø)`
charset_normalizer/utils.py	`76.25% <100.00%> (+0.52%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 929f13c...a30cf9f. Read the comment docs.

Ousret · 2021-07-14T15:54:23Z

Thanks, @akx @sethmlarson for your inputs.
I am okay with merging this. I don't know when a new tag will be available, It would be wise to wait upon any remarks or concerns beforehand.

Do you have any more concerns @akx ?

akx · 2021-07-14T16:32:03Z

@Ousret I don't have any further concerns regarding this PR :) Thank you for your consideration!

Ousret · 2021-07-14T16:41:33Z

You are welcome, thanks for your contribution as well.

Ousret · 2021-07-14T22:50:51Z

Now available under https://github.com/Ousret/charset_normalizer/releases/tag/2.0.2

dopplershift · 2021-11-24T04:39:00Z

I just found this while trying to debug a problem, and I just wanted to make absolutely clear that the old way causes problems. With charset_normalizer 2.0.0 (and 2.0.1), I saw this behavior:

No import:

❯ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:59:23) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> compile(r's="$\N{DEGREE FAHRENHEIT}$"', 'foo.py', 'exec')
<code object <module> at 0x7fde000c4810, file "foo.py", line 1>
>>>

With import charset_normalizer:

❯ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:59:23) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import charset_normalizer
>>> compile(r's="$\N{DEGREE FAHRENHEIT}$"', 'foo.py', 'exec')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "foo.py", line 1
SyntaxError: (unicode error) \N escapes not supported (can't load unicodedata module)
>>>

With the fix merged here in 2.0.2 (and the latest 2.0.7), the problem disappears. (I have an entirely different issue to solve where conda-forge is still giving me 2.0.0 🤷‍♂️ )

Ousret · 2021-11-24T10:29:25Z

Thanks for the report. Indeed I saw this weird behavior at conda-forge giving by default 2.0.0 instead of latest. I do not know why and who to reach to explain that. https://anaconda.org/conda-forge/charset-normalizer/files The download tendency is revealing that something seems wrong. Could it be that the requests version requirement ~=2.0.* for this lib is applied in the wrong way?

dopplershift · 2021-11-24T17:32:15Z

Incorrect dependencies for requests was exactly the reason: conda-forge/requests-feedstock#48

Don't inject unicodedata2 into sys.modules

a30cf9f

Ousret reviewed Jul 14, 2021

View reviewed changes

charset_normalizer/utils.py Show resolved Hide resolved

Ousret merged commit 53b2dab into Ousret:master Jul 14, 2021

Ousret mentioned this pull request Jul 14, 2021

v2.0.2 #61

Merged

This was referenced Nov 24, 2021

TST: Update test tolerances for MPL 3.5 SciTools/cartopy#1938

Merged

Matplotlib Updates Unidata/MetPy#2179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't inject unicodedata2 into sys.modules #57

Don't inject unicodedata2 into sys.modules #57

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

sethmlarson commented Jul 14, 2021

akx commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 •

edited

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

Ousret commented Jul 14, 2021

dopplershift commented Nov 24, 2021

Ousret commented Nov 24, 2021

dopplershift commented Nov 24, 2021

Don't inject unicodedata2 into sys.modules #57

Don't inject unicodedata2 into sys.modules #57

Conversation

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

sethmlarson commented Jul 14, 2021

akx commented Jul 14, 2021

codecov-commenter commented Jul 14, 2021 • edited

Codecov Report

Ousret commented Jul 14, 2021

akx commented Jul 14, 2021

Ousret commented Jul 14, 2021

Ousret commented Jul 14, 2021

dopplershift commented Nov 24, 2021

Ousret commented Nov 24, 2021

dopplershift commented Nov 24, 2021

codecov-commenter commented Jul 14, 2021 •

edited