Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle messages as bytes internally in order to support multiple encoding types #48

Closed
jishac opened this issue Feb 5, 2021 · 8 comments

Comments

@jishac
Copy link
Contributor

jishac commented Feb 5, 2021

Feature Request

It would be great if OfflineIMAP could leverage the built-in email libraries to store messages as a byte array and avoid unnecessary conversions to strings. Currently, OfflineIMAP will read the input file as text and then does a hard conversion to bytes assuming utf-8 encoding when interfacing to the IMAP server. This results in exceptions when something other than plain ascii or utf-8 are present in the email (see bugs #43 and #44 ). Furthermore, an email can contain multiple encodings (see attached test mail) and thus a solution that searches for one form of encoding would be problematic.

The built in library can help keep track of the multiple encodings if needed and do conversions based on the capabilities of the IMAP server. I imagine it should even simplify the \r\n and \n conversions sprinkled through the code.

I have attached a test email and simple python script (both with txt extensions) that I used as a simple proof of concept.

Test Message containing several different encodings (mbox format for easy viewing in a mail client)

testmail-mime.txt

Simple Python3 script that attempts to keep the message intact while still adding the X-OfflineIMAP header

py3.email-copy.txt

@jishac
Copy link
Contributor Author

jishac commented Feb 9, 2021

I have started a possible patch that makes the changes I was trying to describe, using the built-in email library to simplify processing the messages (adding/searching for headers, deleting them, etc) It also handles the conversion between line break types when outputting the message, negating the need to mangle it for certain functions. The changes impact what is passed to some of the function calls, so it will require some testing and I have yet to standup a dummy email server to test with. Furthermore, I have not yet made the changes to the GMAIL classes, and would have further difficulty testing those changes eventually.

A fork of offlineimap3 with the changes so far can be found here: https://github.com/jishac/offlineimap3/tree/multiple_encoding_support

jishac added a commit to jishac/offlineimap3 that referenced this issue Feb 24, 2021
jishac added a commit to jishac/offlineimap3 that referenced this issue Feb 24, 2021
…s well and I reviewed the code several times. However, I cannot test it, testers wanted!

This commit: Minor bug fixes from testing

Should finalize implementation of enhancement OfflineIMAP#48
OfflineIMAP#48

And fix issues OfflineIMAP#43 and OfflineIMAP#44
OfflineIMAP#43
OfflineIMAP#44

Signed-off-by: Joseph Ishac <jishac@nasa.gov>
Tested-by: Joseph Ishac <jishac@nasa.gov>
@jishac jishac mentioned this issue Feb 24, 2021
5 tasks
@sudipm-mukherjee
Copy link
Contributor

Thanks for this @jishac. Your PR is definitely a huge improvement and this almost works. But I am seeing one problem in gmail. Not sure about other mail servers.
I sent this message using mutt with send_charset as us-ascii:iso-8859-1:utf-8:
Subject - Test
Message body:

This is
ä, ö, ü

à è á

This message is fetched properly by offlineimap3 and if I open using vim I can see all the characters properly and it also gives me Content-Type: text/plain; charset=iso-8859-1

But if I reply to that above mail from Gmail web interface and the reply is:
Subject - Re: test - ä, ö, ü
Message body:

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> ä, ö, ü
>
> à è á
>
>

Then it is not fetched properly and if I see the fetched message using vim I get:

Subject: =?UTF-8?B?UmU6IHRlc3QgdW0gLSDDpCwgw7YsIMO8?=
To: XXXXX (removed on purpose)
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Keywords: \Important,\Inbox

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> =C3=A4, =C3=B6, =C3=BC
>
> =C3=A0 =C3=A8 =C3=A1
>
>

Tried sending the same reply message using mutt and the message body is correct but the subject is:
Subject: test - =?iso-8859-1?B?5Cwg9iwg/A==?=

But all the messages are displayed properly in Gmail interface.
That means if I am an offlineimap3 user and send a mail containing german umlauts or spanish diéresis using mutt and the receipient replies back using the Gmail web interface then I will not be able to read the reply after its fetched using offlineimap3.

@sudipm-mukherjee
Copy link
Contributor

I just checked with old offlineimap (python 2) and it had the same behaviour. So, I will say the regression is now fixed and the problem I mentioned can be a feature request.

@jishac
Copy link
Contributor Author

jishac commented Feb 26, 2021

@sudipm-mukherjee I'm not sure I understand the concern/issue. The content you posted:

Subject: =?UTF-8?B?UmU6IHRlc3QgdW0gLSDDpCwgw7YsIMO8?=
To: XXXXX (removed on purpose)
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Keywords: \Important,\Inbox

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> =C3=A4, =C3=B6, =C3=BC
>
> =C3=A0 =C3=A8 =C3=A1
>
>

I imagine if you view the source or original in gmail, you will probably see the same text verbatim. That source is valid and would open correctly in mutt.

This is what I get plugging the above into mutt:

From nobody Thu Feb 25 20:27:53 2021
To: removed on purpose <XXXXX>
Subject: Re: test um - ä, ö, ü

On Thu, Feb 25, 2021 at 7:26 PM Sudip Mukherjee
<sudipm.mukherjee@gmail.com> wrote:
>
> This is
> ä, ö, ü
>
> à è á
>
>

As for the difference in subject lines, both are valid RFC 1522 encodings in the format of =?charset?encoding?encoded-text?= These encodings can change as the email traverses the network. I have seen that often when intermediaries like to add bracket tags (ie: [list-name] or [SPAM] etc.), but I have noticed manipulation even on non-mangled headers. I have even seen the encoding type of the message body altered.

@sudipm-mukherjee
Copy link
Contributor

Thanks @jishac. I am not seeing any issue if I open those mails using mutt, I only got the problem as I tried to open them using vim instead of mutt. So, no issues or concern. Please ignore what I said.
And with that I can confirm it works with my gmail.

@thekix
Copy link
Member

thekix commented Aug 9, 2021

Hello!!

@sudipm-mukherjee could we close this issue too?

Thanks a lot!
Regards,
kix

@sudipm-mukherjee
Copy link
Contributor

@thekix I think so. And this is already merged so should be ok to close. @jishac ?

@jishac
Copy link
Contributor Author

jishac commented Aug 9, 2021

Correct, this has been completed/closed.

@jishac jishac closed this as completed Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants