Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled exception when cleaning message with unicode/emoji in (From:) headers. #433

Open
Leftium opened this issue Aug 14, 2023 · 3 comments

Comments

@Leftium
Copy link

Leftium commented Aug 14, 2023

Full steps to reproduce the issue:

  1. Backup email with message that is not saved in UTF8 format has unicode/emoji in From: header.
  2. Restore email using --cleanup.

Expected outcome: GYB gracefully handles unicode/emoji in headers, either:

  • Detecting/reading non UTF8 messages with appropriate encoding.
  • Skipping message.

Actual outcome: GYB exits with unhandled exception:

Traceback (most recent call last):166783)
  File "gyb.py", line 2767, in <module>
  File "gyb.py", line 2239, in main
  File "gyb.py", line 1947, in message_hygiene
  File "gyb.py", line 1891, in cleanup_from
  File "email\utils.py", line 215, in parseaddr
  File "email\_parseaddr.py", line 517, in __init__
  File "email\_parseaddr.py", line 260, in getaddrlist
TypeError: object of type 'Header' has no len()
[31420] Failed to execute script 'gyb' due to unhandled exception!

Work-around:

  • Convert offending .eml file to UTF8 format. Doesn't always work...
  • Rename .eml file so GYB skips this message.

Suggested alternative fix: always convert non UTF8 files to UTF8 when saving backup.

Notes:

  • The offending email is restored without error if --cleanup is not used. (Did not confirm if text was mangled after restore.)
  • The .eml file was generated by gyb --action backup.
  • Vim tries to open the file with latin1 encoding, but the text is mangled.
  • Notepad.exe tries to open the file with UTF8 encoding, but the text is mangled.
  • The Gmail 'Download Original' file does not work: text is still mangled.
  • Instead, the Gmail 'View Original' text had to be manually copied and saved as a text file (with encoding UTF-16 LE).
  • (Creating a blank UTF8 file and pasting Gmail 'View Original' text seems to work, too)
  • The problematic text is in Korean.
  • I was able to create a minimal repro of this issue in the python REPL:
Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import email
>>> f = open('2021/8/14/17b437668e8b5c17.eml', 'rb')
>>> bytes = f.read()
>>> m = email.message_from_bytes(bytes)
>>> m['to']
'J***********y<j***@l*****m.com>'
>>> m['from']
<email.header.Header object at 0x000002B33DED8410>
>>> len(m['from'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'Header' has no len()

Mangled text:

From: "(주)한웰�쇼핑"<help@daisomall.co.kr>
To: J***********y<j***@l*****m.com>
Subject: [´ÙÀ̼Ҹô] °³ÀÎÁ¤º¸ À¯È¿±â°£Á¦¿¡ µû¸¥ ÈÞ¸é°èÁ¤ Àüȯ ¾È³»µå¸³´Ï´Ù.

Proper text:

From: "(주)한웰이쇼핑" <help@daisomall.co.kr>
To: "J***********y" <j***@l*****m.com>
Subject: [다이소몰] 개인정보 유효기간제에 따른 휴면계정 전환 안내드립니다.
@Leftium
Copy link
Author

Leftium commented Aug 14, 2023

update: This issue isn't limited to non-UTF8 files.

Some UTF8 encoded files also throw this exception. For example, if the From header has emoji:

From:🔥Keto_Rapid_Diet🔥 <xafnsbqsmgniwdztev@twhzbt.drivefact.org>

There were also more emails from the the Korean address (From: "(주)한웰이쇼핑" <help@daisomall.co.kr>) that failed to restore even after converting the .eml file to UTF8 and ensuring there were no mangled characters.

The best work-around seems to be to rename these .eml files so gyb skips them.

@Leftium Leftium changed the title Unhandled exception when cleaning non UTF8 message during restore. Unhandled exception when cleaning message with unicode/emoji in (From:) headers. Aug 14, 2023
@Leftium
Copy link
Author

Leftium commented Aug 14, 2023

I modified my gyb.py to catch these exceptions, printing the problem message info and continuing with the remaining messages:

  if options.cleanup:
      try:
          full_message = message_hygiene(full_message)
      except TypeError as error:
          print(
              f'WARNING! error cleaning message {message_num} ({message_filename})')
          print(f'  {error}')
          print(f'  this message will be skipped.')
          continue

Compare to original code.

@Leftium
Copy link
Author

Leftium commented Aug 14, 2023

Got the fix on StackOverflow: policy=email.policy.SMTPUTF8

I confirmed Korean was restored without mangling, but the emoji ended up being mangled. Perhaps because the emoji from name not wrapped in quotes? Not a big deal since emoji was from a spam email.

def message_hygiene(msg):
    '''Ensure Message-Id, Date and From headers are valid. Replace if not.'''
    omsg = email.message_from_bytes(msg, policy=email.policy.SMTPUTF8)
    orig_id = omsg['message-id']
    orig_date = omsg['date']
    orig_from = omsg['from']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant