UnicodeDecodeError with mixed-character-set diffs #19

jsm28 · 2021-02-19T19:02:11Z

I saw the following error pushing a GCC commit involving diffs (to .po files) in mixed character sets.

$ git push origin master
Enumerating objects: 46, done.
Counting objects: 100% (46/46), done.
Delta compression using up to 12 threads
Compressing objects: 100% (46/46), done.
Writing objects: 100% (46/46), 5.63 MiB | 1.96 MiB/s, done.
Total 46 (delta 33), reused 0 (delta 0)
remote: Traceback (most recent call last):
remote:   File "hooks/post_receive.py", line 118, in <module>
remote:     post_receive(refs_data, args.submitter_email)
remote:   File "hooks/post_receive.py", line 65, in post_receive
remote:     submitter_email)
remote:   File "hooks/post_receive.py", line 47, in post_receive_one
remote:     update.send_email_notifications()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189, in send_email_notifications
remote:     self.__email_new_commits()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1031, in __email_new_commits
remote:     commit, self.get_standard_commit_email(commit))
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1011, in __send_commit_email
remote:     default_diff=email.diff)
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946, in __maybe_get_email_custom_contents
remote:     hook_input=json.dumps(hooks_data),
remote:   File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps
remote:     return _default_encoder.encode(obj)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
remote:     chunks = self.iterencode(o, _one_shot=True)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
remote:     return _iterencode(o, 0)
remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 47260652: invalid continuation byte
To git+ssh://gcc.gnu.org/git/gcc.git
   3599ecb6a01..7d524a5de33  master -> master

If diffs are in mixed character sets (different .po file in different character sets, in this case) it might be necessary to omit some of them or inaccurately mark the character set of the email in its headers, but there should not be a UnicodeDecodeError and a commit email should always be sent.

I don't have a self-contained reproducer (so the GCC hook postprocessing the email might be involved, but since that hook is in Python 3 it's unlikely to be producing anything badly encoded itself).

The text was updated successfully, but these errors were encountered:

brobecke · 2021-02-20T07:20:59Z

Thank you for the report @jsm28 .

I'll look into it as soon as I can, but I wanted to say that I'm really busy right now, and I don't think I'll be able to find time to work on this for some weeks at least. Perhaps one way to work around this before this is fixed is to exploit the fact that you have a commit email reformatter hook. You could make catch those issues there by always returning all pieces of the email, diff included, after having doing the transformation to utf-8. That way, there is no decoding on the git-hooks side that could trigger than unwanted exception. Would that work?

jsm28 · 2021-02-22T18:58:47Z

I don't think it's particularly important to have a workaround before then. This particular case of mixed-character-set diffs is unusual; even most .po updates wouldn't be affected (only two of the libcpp .po files are not in UTF-8; most .pot updates only affect gcc.pot not cpplib.pot and all the gcc .po files are in UTF-8, so typically it would be one commit per year that does a bulk .po update that involves mixed-character-set diffs).

brobecke · 2021-11-27T14:15:07Z

Hi @jsm28 ,

I was able to reproduce the problem above using commit 7d524a5de33 in the GCC Git repository. The good news is that the issue will be gone once we upgrade the hooks on sourceware to the current HEAD. This will require us to use a Python 3.8+ interpreter, however, so I wanted to do some additional testing directly on sourceware before we plan the deployment.

The way things are supposed to work is that the hooks will now decode git's output and store the decoded output in strings, which with Python 3.x are now unicode strings. For the diff, I believe we decode the output line by line, so a commit touching multiple files with different encodings can be handled. As for the export (to the project's hooks or to the mailer), we encode everything into UTF-8. So the email's body should use a consistent encoding, and the Content-Type field should be providing the correct value for decoding of the email's body.

The limitation, however, is that the git-hooks only tries a limited number of encodings when decoding data: ASCII, UTF-8 and iso8859-15 (I could have dropped ASCII, but anyways). I felt that with UTF-8 and iso8859-15, we really were covered, but we can always add more if needed. Similarly, when trying to export, if some characters are not compatible with UTF-8, we're probably going to have problems. Same answer here: I really feel that UTF-8 is so prevalent that there is no need to worry about this limitation.

I'll try to come up with a small reproducer next, so as to make a regression testcase out of it.

brobecke · 2021-11-27T14:18:55Z

For the record, this is what it looks like when I call the post-receive hook by hand while in debug mode, confirm that the email sending works, and that the charset is "utf-8":

$ echo 3599ecb6a0145a428def5314d2d67d2e5a88f3c4 7d524a5de33910d7df7033c793afbad42444d2f5 refs/heads/master | ./hooks/post-receive --submitter-email='Joel Brobecker <brobecker@adacore.com>'
DEBUG: Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
From: Joel Brobecker <brobecker@adacore.com>
To: brobecker@adacore.com
Subject: [new-py3-gcc-public-bare] Update .po files.
X-Act-Checkin: new-py3-gcc-public-bare
X-Git-Author: Joseph Myers <joseph@codesourcery.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: 3599ecb6a0145a428def5314d2d67d2e5a88f3c4
X-Git-Newrev: 7d524a5de33910d7df7033c793afbad42444d2f5

commit 7d524a5de33910d7df7033c793afbad42444d2f5
Author: Joseph Myers <joseph@codesourcery.com>
Date:   Fri Feb 19 18:23:36 2021 +0000

    Update .po files.

    gcc/po/
            * be.po, da.po, de.po, el.po, es.po, fi.po, fr.po, hr.po, id.po,
            ja.po, nl.po, ru.po, sr.po, sv.po, tr.po, uk.po, vi.po, zh_CN.po,
            zh_TW.po: Update.

    libcpp/po/
            * be.po, ca.po, da.po, de.po, el.po, eo.po, es.po, fi.po, fr.po,
            id.po, ja.po, nl.po, pt_BR.po, ru.po, sr.po, sv.po, tr.po, uk.po,
            vi.po, zh_CN.po, zh_TW.po: Update.

Note that I verified that this testcase reproduces the issue reported at #19 when using the legacy-python2 branch: | $ echo d408c32bba64a79fc60dbda8fa047524f353e7cd \ | 28fbd651f8161bcba746f5dbcecf5ec757241c1d \ | refs/heads/master | \ | ./hooks/post-receive | [...] | UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in | position 425: invalid continuation byte This testcase allows us to confirm that this issue was resolved during the transition to Python 3.x. Change-Id: Ia94dc584577e3b4d145fad7b7f89429a1b031849

brobecke · 2021-12-01T05:40:35Z

I added a regression testcase to confirm this is now fixed: See commit 7bc1c15.

brobecke closed this as completed Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError with mixed-character-set diffs #19

UnicodeDecodeError with mixed-character-set diffs #19

jsm28 commented Feb 19, 2021

brobecke commented Feb 20, 2021

jsm28 commented Feb 22, 2021 •

edited

brobecke commented Nov 27, 2021

brobecke commented Nov 27, 2021

brobecke commented Dec 1, 2021

UnicodeDecodeError with mixed-character-set diffs #19

UnicodeDecodeError with mixed-character-set diffs #19

Comments

jsm28 commented Feb 19, 2021

brobecke commented Feb 20, 2021

jsm28 commented Feb 22, 2021 • edited

brobecke commented Nov 27, 2021

brobecke commented Nov 27, 2021

brobecke commented Dec 1, 2021

jsm28 commented Feb 22, 2021 •

edited