New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError with mixed-character-set diffs #19
Comments
Thank you for the report @jsm28 . I'll look into it as soon as I can, but I wanted to say that I'm really busy right now, and I don't think I'll be able to find time to work on this for some weeks at least. Perhaps one way to work around this before this is fixed is to exploit the fact that you have a commit email reformatter hook. You could make catch those issues there by always returning all pieces of the email, diff included, after having doing the transformation to utf-8. That way, there is no decoding on the git-hooks side that could trigger than unwanted exception. Would that work? |
I don't think it's particularly important to have a workaround before then. This particular case of mixed-character-set diffs is unusual; even most .po updates wouldn't be affected (only two of the libcpp .po files are not in UTF-8; most .pot updates only affect gcc.pot not cpplib.pot and all the gcc .po files are in UTF-8, so typically it would be one commit per year that does a bulk .po update that involves mixed-character-set diffs). |
Hi @jsm28 , I was able to reproduce the problem above using commit 7d524a5de33 in the GCC Git repository. The good news is that the issue will be gone once we upgrade the hooks on sourceware to the current HEAD. This will require us to use a Python 3.8+ interpreter, however, so I wanted to do some additional testing directly on sourceware before we plan the deployment. The way things are supposed to work is that the hooks will now decode git's output and store the decoded output in strings, which with Python 3.x are now unicode strings. For the diff, I believe we decode the output line by line, so a commit touching multiple files with different encodings can be handled. As for the export (to the project's hooks or to the mailer), we encode everything into UTF-8. So the email's body should use a consistent encoding, and the Content-Type field should be providing the correct value for decoding of the email's body. The limitation, however, is that the git-hooks only tries a limited number of encodings when decoding data: ASCII, UTF-8 and iso8859-15 (I could have dropped ASCII, but anyways). I felt that with UTF-8 and iso8859-15, we really were covered, but we can always add more if needed. Similarly, when trying to export, if some characters are not compatible with UTF-8, we're probably going to have problems. Same answer here: I really feel that UTF-8 is so prevalent that there is no need to worry about this limitation. I'll try to come up with a small reproducer next, so as to make a regression testcase out of it. |
For the record, this is what it looks like when I call the
|
Note that I verified that this testcase reproduces the issue reported at #19 when using the legacy-python2 branch: | $ echo d408c32bba64a79fc60dbda8fa047524f353e7cd \ | 28fbd651f8161bcba746f5dbcecf5ec757241c1d \ | refs/heads/master | \ | ./hooks/post-receive | [...] | UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in | position 425: invalid continuation byte This testcase allows us to confirm that this issue was resolved during the transition to Python 3.x. Change-Id: Ia94dc584577e3b4d145fad7b7f89429a1b031849
I added a regression testcase to confirm this is now fixed: See commit 7bc1c15. |
I saw the following error pushing a GCC commit involving diffs (to .po files) in mixed character sets.
If diffs are in mixed character sets (different .po file in different character sets, in this case) it might be necessary to omit some of them or inaccurately mark the character set of the email in its headers, but there should not be a UnicodeDecodeError and a commit email should always be sent.
I don't have a self-contained reproducer (so the GCC hook postprocessing the email might be involved, but since that hook is in Python 3 it's unlikely to be producing anything badly encoded itself).
The text was updated successfully, but these errors were encountered: