Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue #5

Open
ghost opened this issue Apr 12, 2022 · 8 comments
Open

Encoding issue #5

ghost opened this issue Apr 12, 2022 · 8 comments

Comments

@ghost
Copy link

ghost commented Apr 12, 2022

OS: Windows 10 21H2
Python: Python 3.10.4

Steps:

  • pip install pyquotes

example.py (file is saved as UTF-8 with/without a BOM)

print('こんにちは世界')
  • pyquotes example.py
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 11: character maps to <undefined>

If read_text() and write_text() are replaced with read_bytes() and write_bytes() equivalents, the quote processing is fixed (not tested on Linux). Although, additional new lines are erroneously added, I haven't had a chance to look into why.

Thank you for your time, this library has saved my many hours.

@ThiefMaster
Copy link
Owner

Is your default encoding UTF8? What does import sys; sys.getdefaultencoding() return?

@ghost
Copy link
Author

ghost commented Apr 12, 2022

Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.getdefaultencoding()
'utf-8'

@ThiefMaster
Copy link
Owner

Can you provide a git repo with your test file so I have the exact same content and not whatever I copy from the github issue?

@ThiefMaster
Copy link
Owner

Also, please pust full tracebacks, not just the last line...

@ghost
Copy link
Author

ghost commented Apr 12, 2022

example.txt

Switched to Python 3.9 to match your environment

pyquotes example.txt
Error while processing example.txt
Traceback (most recent call last):
  File "C:\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\pyquotes.exe\__main__.py", line 7, in <module>
  File "C:\Python39\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Python39\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Python39\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Python39\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "C:\Python39\lib\site-packages\pyquotes\cli.py", line 89, in main
    changed = _process_file(file, config=config)
  File "C:\Python39\lib\site-packages\pyquotes\cli.py", line 113, in _process_file
    old_code = file.read_text()
  File "C:\Python39\lib\pathlib.py", line 1267, in read_text
    return f.read()
  File "C:\Python39\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8: character maps to <undefined>

@ThiefMaster
Copy link
Owner

For some weird reason it's using cp1252 instead of utf-8, even though utf-8 should be the default in Python 3...

@ThiefMaster
Copy link
Owner

ThiefMaster commented Apr 12, 2022

after reading https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819 and asking on IRC apparently text files default to whatever encoding the OS tells python to use... and of course on windows you seem to get random crap instead of consistent utf8 :)

PR welcome that forces utf8 there (read_text has an encoding arg), I have not much interest in windows to be honest, but if I get a good PR i don't mind fixing it

@ghost
Copy link
Author

ghost commented Apr 12, 2022

I'm not very familiar with GitHub's interface - apologies.

I have confirmed the issue is fixed, when implementing your suggestion.
cli.py:113 old_code = file.read_text(encoding='UTF-8')
cli.py:148 tmp_file.write_text(content, encoding='UTF-8')

@ghost ghost mentioned this issue Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant