Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect encoding per yaml spec (fix #238) #240

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

spyoungtech
Copy link

Resolves #238

@spyoungtech spyoungtech reopened this Mar 25, 2020
@coveralls
Copy link

coveralls commented Mar 25, 2020

Coverage Status

Coverage increased (+0.03%) to 97.903% when pulling 37c7af9 on spyoungtech:patch-238 into 46ed0c0 on adrienverge:master.

@spyoungtech spyoungtech changed the title offload file reading to pyyaml (fix #238) Detect encoding per yaml spec (fix #238) Mar 25, 2020
Copy link
Contributor

@bz2 bz2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Is about what I had in mind as #238 (comment) option 4.

See a few inline comments with suggestions. Also may be possible to avoid opening the file twice with different modes, but it's probably fiddly and not worth the hassle.

yamllint/cli.py Outdated Show resolved Hide resolved
yamllint/cli.py Outdated Show resolved Hide resolved
yamllint/cli.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay!

Thanks @spyoungtech and @bz2 for your help on this. This solution looks good, I wasn't aware that the YAML spec states that UTF-8 is the default.

In this case, we need to add a test for non-UTF encoding . Could you add a fake file (e.g. non-ascii/iso-8859-1 containing éàöî) encoded in ISO-8859-1, and expect the UnicodeDecodeError?

@bz2 @spyoungtech do you know any easy solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL environment var?

tests/test_cli.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
tests/test_cli.py Outdated Show resolved Hide resolved
@spyoungtech
Copy link
Author

Just wanted to say I'm still working on this, but have had some things come up. I should get around to fixing up the review items this weekend. Maintainers are also free to commit directly to my branch :-)

@adrienverge
Copy link
Owner

Thanks @spyoungtech, no problem.

Sure, I could fix the few issues myself, but before merging I just want to be sure that:

@bz2
Copy link
Contributor

bz2 commented Apr 3, 2020

@adrienverge I think the right thing on non-unicode encodings from a user perspective is add a new error type but otherwise parse as well as possible. Could implement that either as falling back to parse via locale encoding, and/or replacing non-utf8 byte sequences on parse.

@spyoungtech
Copy link
Author

spyoungtech commented Apr 4, 2020

@adrienverge I think the right thing on non-unicode encodings from a user perspective is add a new error type but otherwise parse as well as possible. Could implement that either as falling back to parse via locale encoding, and/or replacing non-utf8 byte sequences on parse.

Agree. Perhaps it would be reasonable then to use something like chardet for encoding detection? It should handle the BOMs the same way, if present.

It may also be worth noting that the spec draft of YAML 1.2 makes UTF-32 support mandatory.

@spyoungtech
Copy link
Author

Okay, I went ahead and committed a new change to create a new function yamlopen that will use chardet for character detection if no encoding is provided explicitly (giving you the option to do this later, if you so choose!). Under the hood io.TextIOWrapper is used (instead of StringIO, for example) so you can inspect the encoding attribute, allowing you guys to write rules based on the detected encoding, if you want.

@spyoungtech
Copy link
Author

spyoungtech commented Apr 4, 2020

do you know any easy solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL environment var?

Simply put, UTF-8 is the most sensible default; it's also the default of the spec. I had considered using locale.getpreferredencoding() but it would basically guarantee errors on Windows and would probably often return encodings that more likely to be wrong, where UTF-8 is OK to use. sys.getdefaultencoding() is another option, but has similar drawbacks.

The solution I put forward fixes the bug in #238 and also allows for an explicit encoding to be provided, should you ever decide on a mechanism to provide that encoding (perhaps command line argument? ).

Another option would be to pass an appropriate argument into the errors parameter for open. For example replace or surrogateescape to at least allow an attempt at linting, despite encoding errors.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thanks for the update.

I'm very against adding dependencies (like chardet).
Moreover this last version complicates the code and I'm pretty sure we'll get other problems on some platforms. Also yamlopen() wrapper isn't very needed, I think.

Thinking about the problem, yamllint has always used open() to read input files, and since commit 91763f5 it uses io.open(). On Python 3 both these functions use locale.getpreferredencoding() as default encoding, if none provided. It seems like the correct way to do, and it's standard.

Problem: it's just not working with io.open() on Python 2. Since Python 2 will eventually be dropped, I propose to simply:

if sys.version_info.major < 3:
    # On Python 2, io.open() causes decoding problems:
    opener = open(file)
else:
    opener = io.open(file, newline='')
with opener as f:

@spyoungtech @bz2 what do you think?

About tests:

  • Again, could you add a test with a fake file non-ascii/iso-8859-1 containing éàöî encoded in ISO-8859-1?

  • Can you run yamllint on encoded files (not just with cli.yamlopen(path)), to check there isn't any crash from end to end?

path = os.path.join(self.wd, 'a.yaml')
with cli.yamlopen(path, encoding='windows-1252') as yaml_file:
yaml_file.read()
self.assertEqual(yaml_file.encoding, 'windows-1252')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not test anything because it was already opened with encoding='windows-1252', does it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests that yamlopen correctly accepts an explicit encoding, instead of detecting the encoding with chardet.

There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.

I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.

The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.

The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests that yamlopen correctly accepts an explicit encoding, instead of detecting the encoding with chardet.

OK, thanks for the explanation!

There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.

Sure. But still no tests for ISO-8859-1-encoded file, unless I missed something?

I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.

The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.

Sure, I understood that. But I'm very against adding new dependencies (pure Python or not).
Have you seen my proposal about a temporary Python-2-only solution? What do you think of it?
PS: I opened this related PR: #249

The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.

Yes I've seen it, but I don't think it helps readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failure to read file in 1.21.0 with Python 2.7 and non-ascii yaml files.
4 participants