Handle messages as bytes internally in order to support multiple encoding types #48

jishac · 2021-02-05T17:23:15Z

Feature Request

It would be great if OfflineIMAP could leverage the built-in email libraries to store messages as a byte array and avoid unnecessary conversions to strings. Currently, OfflineIMAP will read the input file as text and then does a hard conversion to bytes assuming utf-8 encoding when interfacing to the IMAP server. This results in exceptions when something other than plain ascii or utf-8 are present in the email (see bugs #43 and #44 ). Furthermore, an email can contain multiple encodings (see attached test mail) and thus a solution that searches for one form of encoding would be problematic.

The built in library can help keep track of the multiple encodings if needed and do conversions based on the capabilities of the IMAP server. I imagine it should even simplify the \r\n and \n conversions sprinkled through the code.

I have attached a test email and simple python script (both with txt extensions) that I used as a simple proof of concept.

Test Message containing several different encodings (mbox format for easy viewing in a mail client)

testmail-mime.txt

Simple Python3 script that attempts to keep the message intact while still adding the X-OfflineIMAP header

py3.email-copy.txt

jishac · 2021-02-09T20:17:37Z

I have started a possible patch that makes the changes I was trying to describe, using the built-in email library to simplify processing the messages (adding/searching for headers, deleting them, etc) It also handles the conversion between line break types when outputting the message, negating the need to mangle it for certain functions. The changes impact what is passed to some of the function calls, so it will require some testing and I have yet to standup a dummy email server to test with. Furthermore, I have not yet made the changes to the GMAIL classes, and would have further difficulty testing those changes eventually.

A fork of offlineimap3 with the changes so far can be found here: https://github.com/jishac/offlineimap3/tree/multiple_encoding_support

Should finalize implementation of enhancement OfflineIMAP#48 OfflineIMAP#48 And fix issues OfflineIMAP#43 and OfflineIMAP#44 OfflineIMAP#43 OfflineIMAP#44

…s well and I reviewed the code several times. However, I cannot test it, testers wanted! This commit: Minor bug fixes from testing Should finalize implementation of enhancement OfflineIMAP#48 OfflineIMAP#48 And fix issues OfflineIMAP#43 and OfflineIMAP#44 OfflineIMAP#43 OfflineIMAP#44 Signed-off-by: Joseph Ishac <jishac@nasa.gov> Tested-by: Joseph Ishac <jishac@nasa.gov>

sudipm-mukherjee · 2021-02-25T23:20:29Z

Thanks for this @jishac. Your PR is definitely a huge improvement and this almost works. But I am seeing one problem in gmail. Not sure about other mail servers.
I sent this message using mutt with send_charset as us-ascii:iso-8859-1:utf-8:
Subject - Test
Message body:

This is
ä, ö, ü

à è á

This message is fetched properly by offlineimap3 and if I open using vim I can see all the characters properly and it also gives me Content-Type: text/plain; charset=iso-8859-1

But if I reply to that above mail from Gmail web interface and the reply is:
Subject - Re: test - ä, ö, ü
Message body:

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> ä, ö, ü
>
> à è á
>
>

Then it is not fetched properly and if I see the fetched message using vim I get:

Subject: =?UTF-8?B?UmU6IHRlc3QgdW0gLSDDpCwgw7YsIMO8?=
To: XXXXX (removed on purpose)
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Keywords: \Important,\Inbox

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> =C3=A4, =C3=B6, =C3=BC
>
> =C3=A0 =C3=A8 =C3=A1
>
>

Tried sending the same reply message using mutt and the message body is correct but the subject is:
Subject: test - =?iso-8859-1?B?5Cwg9iwg/A==?=

But all the messages are displayed properly in Gmail interface.
That means if I am an offlineimap3 user and send a mail containing german umlauts or spanish diéresis using mutt and the receipient replies back using the Gmail web interface then I will not be able to read the reply after its fetched using offlineimap3.

sudipm-mukherjee · 2021-02-25T23:27:02Z

I just checked with old offlineimap (python 2) and it had the same behaviour. So, I will say the regression is now fixed and the problem I mentioned can be a feature request.

jishac · 2021-02-26T01:48:00Z

@sudipm-mukherjee I'm not sure I understand the concern/issue. The content you posted:

Subject: =?UTF-8?B?UmU6IHRlc3QgdW0gLSDDpCwgw7YsIMO8?=
To: XXXXX (removed on purpose)
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Keywords: \Important,\Inbox

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> =C3=A4, =C3=B6, =C3=BC
>
> =C3=A0 =C3=A8 =C3=A1
>
>

I imagine if you view the source or original in gmail, you will probably see the same text verbatim. That source is valid and would open correctly in mutt.

This is what I get plugging the above into mutt:

From nobody Thu Feb 25 20:27:53 2021
To: removed on purpose <XXXXX>
Subject: Re: test um - ä, ö, ü

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> ä, ö, ü
>
> à è á
>
>

As for the difference in subject lines, both are valid RFC 1522 encodings in the format of =?charset?encoding?encoded-text?= These encodings can change as the email traverses the network. I have seen that often when intermediaries like to add bracket tags (ie: [list-name] or [SPAM] etc.), but I have noticed manipulation even on non-mangled headers. I have even seen the encoding type of the message body altered.

sudipm-mukherjee · 2021-02-26T10:51:12Z

Thanks @jishac. I am not seeing any issue if I open those mails using mutt, I only got the problem as I tried to open them using vim instead of mutt. So, no issues or concern. Please ignore what I said.
And with that I can confirm it works with my gmail.

thekix · 2021-08-09T08:45:09Z

Hello!!

@sudipm-mukherjee could we close this issue too?

Thanks a lot!
Regards,
kix

sudipm-mukherjee · 2021-08-09T09:24:49Z

@thekix I think so. And this is already merged so should be ok to close. @jishac ?

jishac · 2021-08-09T13:15:42Z

Correct, this has been completed/closed.

jishac mentioned this issue Feb 19, 2021

crash when uploading locally created no-utf-8 encoded message #43

Closed

jishac mentioned this issue Feb 24, 2021

Multiple encoding support #56

Merged

5 tasks

jishac closed this as completed Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle messages as bytes internally in order to support multiple encoding types #48

Handle messages as bytes internally in order to support multiple encoding types #48

jishac commented Feb 5, 2021

jishac commented Feb 9, 2021 •

edited

Loading

sudipm-mukherjee commented Feb 25, 2021

sudipm-mukherjee commented Feb 25, 2021

jishac commented Feb 26, 2021 •

edited

Loading

sudipm-mukherjee commented Feb 26, 2021

thekix commented Aug 9, 2021

sudipm-mukherjee commented Aug 9, 2021

jishac commented Aug 9, 2021

Handle messages as bytes internally in order to support multiple encoding types #48

Handle messages as bytes internally in order to support multiple encoding types #48

Comments

jishac commented Feb 5, 2021

Feature Request

Test Message containing several different encodings (mbox format for easy viewing in a mail client)

Simple Python3 script that attempts to keep the message intact while still adding the X-OfflineIMAP header

jishac commented Feb 9, 2021 • edited Loading

sudipm-mukherjee commented Feb 25, 2021

sudipm-mukherjee commented Feb 25, 2021

jishac commented Feb 26, 2021 • edited Loading

sudipm-mukherjee commented Feb 26, 2021

thekix commented Aug 9, 2021

sudipm-mukherjee commented Aug 9, 2021

jishac commented Aug 9, 2021

jishac commented Feb 9, 2021 •

edited

Loading

jishac commented Feb 26, 2021 •

edited

Loading