Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text_plain returns all parts that are not text/html #52

Closed
phoerious opened this issue Apr 24, 2019 · 3 comments
Closed

text_plain returns all parts that are not text/html #52

phoerious opened this issue Apr 24, 2019 · 3 comments
Assignees

Comments

@phoerious
Copy link

Describe the bug
MailParser.text_plain returns all parts that are not text/html.

To Reproduce

>>> import mailparser

>>> mail = mailparser.parse_from_bytes(b'''From: example@example.com
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: rcpt@example.com

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--
''')

>>> mail.text_html
['<!doctype html>\n<title>Foo</title>\n<meta charset="utf-8">\n\nHTML here']

>>> mail.text_plain
['PNG here', 'Plaintext here.']

Expected behavior
text_plain should only return parts with Content-Type text/plain.

Raw mail

From: example@example.com
Subject: Test
Date: Wed, 24 Apr 2019 10:05:02 +0200 (CEST)
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============8544575414772382491=="
To: rcpt@example.com

--===============8544575414772382491==
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit

<!doctype html>
<title>Foo</title>
<meta charset="utf-8">

HTML here

--===============8544575414772382491==
Content-Type: image/png
Content-Transfer-Encoding: base64
Content-Disposition: inline

UE5HIGhlcmU=
--===============8544575414772382491==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Plaintext here.
--===============8544575414772382491==--

Environment:

  • OS: Linux
  • Docker: no
  • mail-parser version 3.9.3

Additional context
It is impossible to sort out non-text parts (without heuristics), because everything is parsed into a list of strings and Content-Type information is thrown away.

@fedelemantuano
Copy link
Contributor

Hi @phoerious,

in text part there are all not html parts, I know. I have to change this part.

mail-parser was born as SpamScope core and I wanted to know all text parts to looking for phishing.
But now it's correct to go on.

Give me time and a will change this part.

@Pixel-Jack
Copy link

Hi @fedelemantuano ,
The issue seems to come from this place mailparser.py line 395

if payload:
    if p.get_content_subtype() == 'html':
        self._text_html.append(payload)
    else:
        self._text_plain.append(payload)

To change to

if payload:
    if p.get_content_subtype() == 'html':
        self._text_html.append(payload)
    elif p.get_content_subtype() == 'plain':
        self._text_plain.append(payload)
    else:
        log.warn(f'Email content {p.get_content_subtype()} not handled')

@fedelemantuano
Copy link
Contributor

Hi @phoerious and @Pixel-Jack thanks for support.
I handled this issue as previous snipped. I added a field this all text not managed.
The body field has all text like previous version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants