Detect encoding per yaml spec (fix #238) #240

spyoungtech · 2020-03-25T03:50:40Z

Resolves #238

coveralls · 2020-03-25T04:27:52Z

Coverage increased (+0.03%) to 97.903% when pulling 37c7af9 on spyoungtech:patch-238 into 46ed0c0 on adrienverge:master.

bz2

Thanks! Is about what I had in mind as #238 (comment) option 4.

See a few inline comments with suggestions. Also may be possible to avoid opening the file twice with different modes, but it's probably fiddly and not worth the hassle.

yamllint/cli.py

tests/test_cli.py

adrienverge

Sorry for the delay!

Thanks @spyoungtech and @bz2 for your help on this. This solution looks good, I wasn't aware that the YAML spec states that UTF-8 is the default.

In this case, we need to add a test for non-UTF encoding . Could you add a fake file (e.g. non-ascii/iso-8859-1 containing éàöî) encoded in ISO-8859-1, and expect the UnicodeDecodeError?

@bz2 @spyoungtech do you know any easy solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL environment var?

tests/test_cli.py

spyoungtech · 2020-04-03T00:15:44Z

Just wanted to say I'm still working on this, but have had some things come up. I should get around to fixing up the review items this weekend. Maintainers are also free to commit directly to my branch :-)

adrienverge · 2020-04-03T05:16:55Z

Thanks @spyoungtech, no problem.

Sure, I could fix the few issues myself, but before merging I just want to be sure that:

nothings needs to be changed regarding https://github.com/adrienverge/yamllint/pull/240/files#r397801149,
Do you know any solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL environment var?

@bz2 maybe you have an opinion on that?

bz2 · 2020-04-03T11:42:41Z

@adrienverge I think the right thing on non-unicode encodings from a user perspective is add a new error type but otherwise parse as well as possible. Could implement that either as falling back to parse via locale encoding, and/or replacing non-utf8 byte sequences on parse.

spyoungtech · 2020-04-04T02:38:43Z

@adrienverge I think the right thing on non-unicode encodings from a user perspective is add a new error type but otherwise parse as well as possible. Could implement that either as falling back to parse via locale encoding, and/or replacing non-utf8 byte sequences on parse.

Agree. Perhaps it would be reasonable then to use something like chardet for encoding detection? It should handle the BOMs the same way, if present.

It may also be worth noting that the spec draft of YAML 1.2 makes UTF-32 support mandatory.

spyoungtech · 2020-04-04T05:07:44Z

Okay, I went ahead and committed a new change to create a new function yamlopen that will use chardet for character detection if no encoding is provided explicitly (giving you the option to do this later, if you so choose!). Under the hood io.TextIOWrapper is used (instead of StringIO, for example) so you can inspect the encoding attribute, allowing you guys to write rules based on the detected encoding, if you want.

spyoungtech · 2020-04-04T07:05:10Z

do you know any easy solution to accept custom user encodings, depending on the system yamllint is run from? E.g. reading LC_ALL environment var?

Simply put, UTF-8 is the most sensible default; it's also the default of the spec. I had considered using locale.getpreferredencoding() but it would basically guarantee errors on Windows and would probably often return encodings that more likely to be wrong, where UTF-8 is OK to use. sys.getdefaultencoding() is another option, but has similar drawbacks.

The solution I put forward fixes the bug in #238 and also allows for an explicit encoding to be provided, should you ever decide on a mechanism to provide that encoding (perhaps command line argument? ).

Another option would be to pass an appropriate argument into the errors parameter for open. For example replace or surrogateescape to at least allow an attempt at linting, despite encoding errors.

adrienverge

Hello, thanks for the update.

I'm very against adding dependencies (like chardet).
Moreover this last version complicates the code and I'm pretty sure we'll get other problems on some platforms. Also yamlopen() wrapper isn't very needed, I think.

Thinking about the problem, yamllint has always used open() to read input files, and since commit 91763f5 it uses io.open(). On Python 3 both these functions use locale.getpreferredencoding() as default encoding, if none provided. It seems like the correct way to do, and it's standard.

Problem: it's just not working with io.open() on Python 2. Since Python 2 will eventually be dropped, I propose to simply:

if sys.version_info.major < 3:
    # On Python 2, io.open() causes decoding problems:
    opener = open(file)
else:
    opener = io.open(file, newline='')
with opener as f:

@spyoungtech @bz2 what do you think?

About tests:

Again, could you add a test with a fake file non-ascii/iso-8859-1 containing éàöî encoded in ISO-8859-1?
Can you run yamllint on encoded files (not just with cli.yamlopen(path)), to check there isn't any crash from end to end?

adrienverge · 2020-04-08T09:05:58Z

tests/test_cli.py

+        path = os.path.join(self.wd, 'a.yaml')
+        with cli.yamlopen(path, encoding='windows-1252') as yaml_file:
+            yaml_file.read()
+        self.assertEqual(yaml_file.encoding, 'windows-1252')


This does not test anything because it was already opened with encoding='windows-1252', does it?

It tests that yamlopen correctly accepts an explicit encoding, instead of detecting the encoding with chardet.

There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.

I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.

The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.

The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.

It tests that yamlopen correctly accepts an explicit encoding, instead of detecting the encoding with chardet.

OK, thanks for the explanation!

There are tests that run the linter on all the new non-ascii files, except for the file that is made up of random bytes that cannot be decoded and exists just to be sure UTF-8 is used as a default if chardet can't detect any encoding.

Sure. But still no tests for ISO-8859-1-encoded file, unless I missed something?

I understand not wanting to add dependencies. If it's any consolation chardet itself is pure python with no external dependencies. I can also go back and revert to just checking the BOM, per the yaml spec and expecting failure or malformed data in any other case.

The motivation for using chardet was the comment from bz2 about making a best-effort to parse files even if they're not an acceptable encoding.

Sure, I understood that. But I'm very against adding new dependencies (pure Python or not).
Have you seen my proposal about a temporary Python-2-only solution? What do you think of it?
PS: I opened this related PR: #249

The yamlopen context manager was also implemented per request on review. I can remove that, too, if you prefer not to use it.

Yes I've seen it, but I don't think it helps readability.

spyoungtech closed this Mar 25, 2020

spyoungtech force-pushed the patch-238 branch from 26e8d94 to 542ae75 Compare March 25, 2020 04:23

use codecs

cde5896

spyoungtech reopened this Mar 25, 2020

spyoungtech added 2 commits March 24, 2020 21:51

determine encoding

11fa9ed

add newline

d3cd29e

spyoungtech changed the title ~~offload file reading to pyyaml (fix #238)~~ Detect encoding per yaml spec (fix #238) Mar 25, 2020

spyoungtech added 4 commits March 24, 2020 22:12

add files with boms to test dir

10daf4c

don't read whole file to detect encoding

d60e9fd

add encoding tests

b0d6d55

make flake8 happy

4e6c030

bz2 reviewed Mar 25, 2020

View reviewed changes

yamllint/cli.py Outdated Show resolved Hide resolved

yamllint/cli.py Outdated Show resolved Hide resolved

yamllint/cli.py Outdated Show resolved Hide resolved

tests/test_cli.py Outdated Show resolved Hide resolved

adrienverge reviewed Mar 31, 2020

View reviewed changes

tests/test_cli.py Outdated Show resolved Hide resolved

tests/test_cli.py Outdated Show resolved Hide resolved

tests/test_cli.py Outdated Show resolved Hide resolved

spyoungtech added 5 commits April 3, 2020 21:21

use chardet for encoding detection

a68a801

make things ugly to make flake8 happy

efd597f

fLakE8

f3d5654

new filename now matches

940f954

avoid reading the file twice

b6d4df4

spyoungtech added 2 commits April 3, 2020 22:11

dont shadow builtin names

5ad0ebf

tests

15f7cc3

spyoungtech requested a review from adrienverge April 4, 2020 05:47

adrienverge reviewed Apr 8, 2020

View reviewed changes

Merge branch 'master' into patch-238

37c7af9

adrienverge mentioned this pull request Dec 18, 2020

Character set issue on Windows #347

Open

adrienverge mentioned this pull request Jan 11, 2021

Can not parse utf-8 strings #218

Open

adrienverge mentioned this pull request Oct 20, 2021

Ignore Errors when reading Files #414

Closed

Jayman2000 mentioned this pull request Jan 3, 2024

Automatically detect character encoding of YAML files and ignore files #630

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect encoding per yaml spec (fix #238) #240

Detect encoding per yaml spec (fix #238) #240

spyoungtech commented Mar 25, 2020

coveralls commented Mar 25, 2020 •

edited

bz2 left a comment

adrienverge left a comment

spyoungtech commented Apr 3, 2020

adrienverge commented Apr 3, 2020

bz2 commented Apr 3, 2020

spyoungtech commented Apr 4, 2020 •

edited

spyoungtech commented Apr 4, 2020

spyoungtech commented Apr 4, 2020 •

edited

adrienverge left a comment

adrienverge Apr 8, 2020

spyoungtech Apr 8, 2020

adrienverge Apr 9, 2020 •

edited

Detect encoding per yaml spec (fix #238) #240

Are you sure you want to change the base?

Detect encoding per yaml spec (fix #238) #240

Conversation

spyoungtech commented Mar 25, 2020

coveralls commented Mar 25, 2020 • edited

bz2 left a comment

Choose a reason for hiding this comment

adrienverge left a comment

Choose a reason for hiding this comment

spyoungtech commented Apr 3, 2020

adrienverge commented Apr 3, 2020

bz2 commented Apr 3, 2020

spyoungtech commented Apr 4, 2020 • edited

spyoungtech commented Apr 4, 2020

spyoungtech commented Apr 4, 2020 • edited

adrienverge left a comment

Choose a reason for hiding this comment

adrienverge Apr 8, 2020

Choose a reason for hiding this comment

spyoungtech Apr 8, 2020

Choose a reason for hiding this comment

adrienverge Apr 9, 2020 • edited

Choose a reason for hiding this comment

coveralls commented Mar 25, 2020 •

edited

spyoungtech commented Apr 4, 2020 •

edited

spyoungtech commented Apr 4, 2020 •

edited

adrienverge Apr 9, 2020 •

edited