text_plain returns all parts that are not text/html #52

phoerious · 2019-04-24T11:45:19Z

Describe the bug
MailParser.text_plain returns all parts that are not text/html.

To Reproduce

>>> import mailparser

>>> mail = mailparser.parse_from_bytes(b'''From: example@example.com
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: rcpt@example.com

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--
''')

>>> mail.text_html
['<!doctype html>\n<title>Foo</title>\n<meta charset="utf-8">\n\nHTML here']

>>> mail.text_plain
['PNG here', 'Plaintext here.']

Expected behavior
text_plain should only return parts with Content-Type text/plain.

Raw mail

From: example@example.com
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: rcpt@example.com

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--

Environment:

OS: Linux
Docker: no
mail-parser version 3.9.3

Additional context
It is impossible to sort out non-text parts (without heuristics), because everything is parsed into a list of strings and Content-Type information is thrown away.

The text was updated successfully, but these errors were encountered:

fedelemantuano · 2019-05-12T21:07:45Z

Hi @phoerious,

in text part there are all not html parts, I know. I have to change this part.

mail-parser was born as SpamScope core and I wanted to know all text parts to looking for phishing.
But now it's correct to go on.

Give me time and a will change this part.

Pixel-Jack · 2019-12-10T14:30:07Z

Hi @fedelemantuano ,
The issue seems to come from this place mailparser.py line 395

if payload:
    if p.get_content_subtype() == 'html':
        self._text_html.append(payload)
    else:
        self._text_plain.append(payload)

To change to

if payload:
    if p.get_content_subtype() == 'html':
        self._text_html.append(payload)
    elif p.get_content_subtype() == 'plain':
        self._text_plain.append(payload)
    else:
        log.warn(f'Email content {p.get_content_subtype()} not handled')

fedelemantuano · 2020-01-12T14:41:35Z

Hi @phoerious and @Pixel-Jack thanks for support.
I handled this issue as previous snipped. I added a field this all text not managed.
The body field has all text like previous version.

fedelemantuano self-assigned this Apr 29, 2019

fedelemantuano added the needs_triage label Apr 29, 2019

fedelemantuano added enhancement and removed needs_triage labels May 12, 2019

fedelemantuano added a commit that referenced this issue Jan 12, 2020

Added mail.text_not_managed: issue #52

da0d706

fedelemantuano closed this as completed Jan 12, 2020

reupen mentioned this issue Jan 13, 2020

Bump mail-parser from 3.9.3 to 3.11.0 uktrade/data-hub-api#2453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_plain returns all parts that are not text/html #52

text_plain returns all parts that are not text/html #52

phoerious commented Apr 24, 2019

fedelemantuano commented May 12, 2019

Pixel-Jack commented Dec 10, 2019

fedelemantuano commented Jan 12, 2020

text_plain returns all parts that are not text/html #52

text_plain returns all parts that are not text/html #52

Comments

phoerious commented Apr 24, 2019

fedelemantuano commented May 12, 2019

Pixel-Jack commented Dec 10, 2019

fedelemantuano commented Jan 12, 2020