Enhancements #84

nitishkansal · 2021-02-12T18:39:11Z

Few enhancements in Regex used to parse the header values.
Get the filename using python email module instead of getting it from content id or content disposition as we have seen some email services don't follow the standard and content id is not present at all in the attachment part.
Stopped some noise in error logging as that error was already handled but it was still adding some unnecessary noise in the logs, Ideally there should be logger setting which can be used to suppress some specific log levels.
Change the way how attachment is identified as we have seen some email services send the email with html and plain text with content id and they are also treated as attachment but they should be treated as email body.
Change the way how payload is handled, because when get_payload() decodes 7bit or 8bit or without any Content-Transfer-Encoding then it encodes them with raw-unicode-escape which leave the message unreadable and makes it some gibberish characters. so when we call get_payload() with decode, we are again checking if Content-Transfer-Encoding was one of the culprits then we decode them again with raw-unicode-escape so that we get the message as it was sent before sending it to ported_string().

We checked the test suite as well added in the library but that test suite just check if email parsing is working fine or not but it does not check if encoding was maintained or not. So test suite is working but emails were left with gibberish characters.

PS: We have been using this library for almost 8-9 months now and it has been great to use. But we were still facing encoding issues so we had to investigate and made some changes to library and we have been using these updates for almost 6-7 months now and now our encoding complaints are reduced by 95%-96% Main problems we were facing with unicode characters and different Content-Type, which is mostly resolved now.

… of getting it from content id or content disposition as some of the emails are breaking because they send body with content-id

…m by mail.domain.com with esmtp envelope-from <support@domain.com> id 1jt7Nz-0000Da-by for xyz@domain.com; Wed, 08 Jul 2020 10:33:11 +0000

…id by mail.domain.com with esmtps TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 envelope-from <local@domain.co.id> id 1jr0oT-0006e6-Mx for local@domain.com; Thu, 02 Jul 2020 15:07:51 +0000

…6 by smtpd.kaskus.co.id Postfix with ESMTP id 8C02C2E063E for <formail@ctemplar.com>; Wed, 8 Jul 2020 18:40:03 +0700 WIB

…om with XMail 1.2 password ESMTP Server id <S000000> for <local@domain.com> from <local@domain.com>; Mon, 6 Jul 2020 01:09:35 +0900

…ught and handled already and we dont even need to parse this header, just stopping some noise in our error logger

…get_payload

… all encoding which are not decoded properly by python

…failing emails and cover most of the cases

…icking Content-Type

…python encoded to keep encodings intact

The-Hidden-Hand · 2021-02-13T22:01:09Z

@fedelemantuano
Hello, are you accepting pull requests?

fedelemantuano · 2021-02-15T17:08:37Z

The PR look good. I'm looking inside it.

nitishkansal added 12 commits June 26, 2020 17:32

Use Default python email module function to retrieve filename instead…

c7418bc

… of getting it from content id or content disposition as some of the emails are breaking because they send body with content-id

Regex update, Failing String: from ::1 port=44088 helo=mail.domain.co…

b214713

…m by mail.domain.com with esmtp envelope-from <support@domain.com> id 1jt7Nz-0000Da-by for xyz@domain.com; Wed, 08 Jul 2020 10:33:11 +0000

Regex update, Failing String: from 0.0.0.0 port=35756 helo=domain.co.…

98f1965

…id by mail.domain.com with esmtps TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256 envelope-from <local@domain.co.id> id 1jr0oT-0006e6-Mx for local@domain.com; Thu, 02 Jul 2020 15:07:51 +0000

Regex update, Failing String: from kk-worker4-prod unknown 172.16.0.5…

a901563

…6 by smtpd.kaskus.co.id Postfix with ESMTP id 8C02C2E063E for <formail@ctemplar.com>; Wed, 8 Jul 2020 18:40:03 +0700 WIB

Regex update, Failing String: from 0.0.0.0 1.1.1.1 :56905 by domain.c…

9cb9eb3

…om with XMail 1.2 password ESMTP Server id <S000000> for <local@domain.com> from <local@domain.com>; Mon, 6 Jul 2020 01:09:35 +0900

commented error logging in parsing received header as exception is ca…

66c3dae

…ught and handled already and we dont even need to parse this header, just stopping some noise in our error logger

handle decoding of payload separately which are not truly decoded by …

9b81bf9

…get_payload

handle special case of 8bit and 7bit instead of just trying to decode…

56494c6

… all encoding which are not decoded properly by python

content-id cant define if its really an attachment

d20dfba

make a distinction between attachments and body and reduce number of …

fe1b091

…failing emails and cover most of the cases

Typo Fixed: Needed to get the Content-Transfer-Encoding but we were p…

79e5732

…icking Content-Type

If CTE is not available then also we should decode the payload which …

0a8b2a4

…python encoded to keep encodings intact

fedelemantuano approved these changes Feb 15, 2021

View reviewed changes

fedelemantuano merged commit 3fc6b07 into SpamScope:develop Feb 15, 2021

fedelemantuano added a commit that referenced this pull request Feb 19, 2021

Fixed issue #83. Fixed PR #84 (received regex and filename attachs)

ece9b7f

Itay4 mentioned this pull request May 21, 2021

parse_from_bytes raises UnicodeDecodeError #88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancements #84

Enhancements #84

nitishkansal commented Feb 12, 2021

The-Hidden-Hand commented Feb 13, 2021

fedelemantuano commented Feb 15, 2021

Enhancements #84

Enhancements #84

Conversation

nitishkansal commented Feb 12, 2021

The-Hidden-Hand commented Feb 13, 2021

fedelemantuano commented Feb 15, 2021