Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with mixed-character-set diffs #19

Closed
jsm28 opened this issue Feb 19, 2021 · 5 comments
Closed

UnicodeDecodeError with mixed-character-set diffs #19

jsm28 opened this issue Feb 19, 2021 · 5 comments

Comments

@jsm28
Copy link

jsm28 commented Feb 19, 2021

I saw the following error pushing a GCC commit involving diffs (to .po files) in mixed character sets.

$ git push origin master
Enumerating objects: 46, done.
Counting objects: 100% (46/46), done.
Delta compression using up to 12 threads
Compressing objects: 100% (46/46), done.
Writing objects: 100% (46/46), 5.63 MiB | 1.96 MiB/s, done.
Total 46 (delta 33), reused 0 (delta 0)
remote: Traceback (most recent call last):
remote:   File "hooks/post_receive.py", line 118, in <module>
remote:     post_receive(refs_data, args.submitter_email)
remote:   File "hooks/post_receive.py", line 65, in post_receive
remote:     submitter_email)
remote:   File "hooks/post_receive.py", line 47, in post_receive_one
remote:     update.send_email_notifications()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 189, in send_email_notifications
remote:     self.__email_new_commits()
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1031, in __email_new_commits
remote:     commit, self.get_standard_commit_email(commit))
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 1011, in __send_commit_email
remote:     default_diff=email.diff)
remote:   File "/sourceware1/projects/src-home/git-hooks/hooks/updates/__init__.py", line 946, in __maybe_get_email_custom_contents
remote:     hook_input=json.dumps(hooks_data),
remote:   File "/usr/lib64/python2.7/json/__init__.py", line 244, in dumps
remote:     return _default_encoder.encode(obj)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 207, in encode
remote:     chunks = self.iterencode(o, _one_shot=True)
remote:   File "/usr/lib64/python2.7/json/encoder.py", line 270, in iterencode
remote:     return _iterencode(o, 0)
remote: UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 47260652: invalid continuation byte
To git+ssh://gcc.gnu.org/git/gcc.git
   3599ecb6a01..7d524a5de33  master -> master

If diffs are in mixed character sets (different .po file in different character sets, in this case) it might be necessary to omit some of them or inaccurately mark the character set of the email in its headers, but there should not be a UnicodeDecodeError and a commit email should always be sent.

I don't have a self-contained reproducer (so the GCC hook postprocessing the email might be involved, but since that hook is in Python 3 it's unlikely to be producing anything badly encoded itself).

@brobecke
Copy link
Member

Thank you for the report @jsm28 .

I'll look into it as soon as I can, but I wanted to say that I'm really busy right now, and I don't think I'll be able to find time to work on this for some weeks at least. Perhaps one way to work around this before this is fixed is to exploit the fact that you have a commit email reformatter hook. You could make catch those issues there by always returning all pieces of the email, diff included, after having doing the transformation to utf-8. That way, there is no decoding on the git-hooks side that could trigger than unwanted exception. Would that work?

@jsm28
Copy link
Author

jsm28 commented Feb 22, 2021

I don't think it's particularly important to have a workaround before then. This particular case of mixed-character-set diffs is unusual; even most .po updates wouldn't be affected (only two of the libcpp .po files are not in UTF-8; most .pot updates only affect gcc.pot not cpplib.pot and all the gcc .po files are in UTF-8, so typically it would be one commit per year that does a bulk .po update that involves mixed-character-set diffs).

@brobecke
Copy link
Member

Hi @jsm28 ,

I was able to reproduce the problem above using commit 7d524a5de33 in the GCC Git repository. The good news is that the issue will be gone once we upgrade the hooks on sourceware to the current HEAD. This will require us to use a Python 3.8+ interpreter, however, so I wanted to do some additional testing directly on sourceware before we plan the deployment.

The way things are supposed to work is that the hooks will now decode git's output and store the decoded output in strings, which with Python 3.x are now unicode strings. For the diff, I believe we decode the output line by line, so a commit touching multiple files with different encodings can be handled. As for the export (to the project's hooks or to the mailer), we encode everything into UTF-8. So the email's body should use a consistent encoding, and the Content-Type field should be providing the correct value for decoding of the email's body.

The limitation, however, is that the git-hooks only tries a limited number of encodings when decoding data: ASCII, UTF-8 and iso8859-15 (I could have dropped ASCII, but anyways). I felt that with UTF-8 and iso8859-15, we really were covered, but we can always add more if needed. Similarly, when trying to export, if some characters are not compatible with UTF-8, we're probably going to have problems. Same answer here: I really feel that UTF-8 is so prevalent that there is no need to worry about this limitation.

I'll try to come up with a small reproducer next, so as to make a regression testcase out of it.

@brobecke
Copy link
Member

For the record, this is what it looks like when I call the post-receive hook by hand while in debug mode, confirm that the email sending works, and that the charset is "utf-8":

$ echo 3599ecb6a0145a428def5314d2d67d2e5a88f3c4 7d524a5de33910d7df7033c793afbad42444d2f5 refs/heads/master | ./hooks/post-receive --submitter-email='Joel Brobecker <brobecker@adacore.com>'
DEBUG: Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
From: Joel Brobecker <brobecker@adacore.com>
To: brobecker@adacore.com
Subject: [new-py3-gcc-public-bare] Update .po files.
X-Act-Checkin: new-py3-gcc-public-bare
X-Git-Author: Joseph Myers <joseph@codesourcery.com>
X-Git-Refname: refs/heads/master
X-Git-Oldrev: 3599ecb6a0145a428def5314d2d67d2e5a88f3c4
X-Git-Newrev: 7d524a5de33910d7df7033c793afbad42444d2f5

commit 7d524a5de33910d7df7033c793afbad42444d2f5
Author: Joseph Myers <joseph@codesourcery.com>
Date:   Fri Feb 19 18:23:36 2021 +0000

    Update .po files.
    
    gcc/po/
            * be.po, da.po, de.po, el.po, es.po, fi.po, fr.po, hr.po, id.po,
            ja.po, nl.po, ru.po, sr.po, sv.po, tr.po, uk.po, vi.po, zh_CN.po,
            zh_TW.po: Update.
    
    libcpp/po/
            * be.po, ca.po, da.po, de.po, el.po, eo.po, es.po, fi.po, fr.po,
            id.po, ja.po, nl.po, pt_BR.po, ru.po, sr.po, sv.po, tr.po, uk.po,
            vi.po, zh_CN.po, zh_TW.po: Update.

adacore-bot pushed a commit that referenced this issue Dec 1, 2021
Note that I verified that this testcase reproduces the issue
reported at #19
when using the legacy-python2 branch:

    | $ echo d408c32bba64a79fc60dbda8fa047524f353e7cd \
    |        28fbd651f8161bcba746f5dbcecf5ec757241c1d \
    |        refs/heads/master | \
    |        ./hooks/post-receive
    | [...]
    | UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in
    |   position 425: invalid continuation byte

This testcase allows us to confirm that this issue was resolved
during the transition to Python 3.x.

Change-Id: Ia94dc584577e3b4d145fad7b7f89429a1b031849
@brobecke
Copy link
Member

brobecke commented Dec 1, 2021

I added a regression testcase to confirm this is now fixed: See commit 7bc1c15.

@brobecke brobecke closed this as completed Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants