Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UUencoded attachment parsing #80

Closed
sylencecc opened this issue Nov 24, 2020 · 8 comments · Fixed by #81
Closed

UUencoded attachment parsing #80

sylencecc opened this issue Nov 24, 2020 · 8 comments · Fixed by #81

Comments

@sylencecc
Copy link
Contributor

When dealing with attachments encoded via uuencoding (Content-transfer-encoding is uuencode or x-uuencode), mail-parser treats them as text, as can be seen in parse() (mailparser.py:378):

if transfer_encoding == "base64" or (
  transfer_encoding == "quoted-\
  printable" and "application" in mail_content_type):
    ...
else:
  payload = ported_string(p.get_payload(decode=True), encoding=charset)
  log.debug("Filename {!r} part {!r} is not binary".format(filename, i))

Within the else block, the payload is correctly decoded with p.get_payload(decode=True), but then passed to ported_string() which attempts to encode the returned bytes to UTF-8 in utils.py:85:

def ported_string(raw_data, encoding='utf-8', errors='ignore'):
...
  try:
    return six.text_type(raw_data, encoding).strip()
  except (LookupError, UnicodeDecodeError):
    return six.text_type(raw_data, "utf-8", errors).strip()

Since errors are ignored, encoding doesn't fail, but returns a attachment stripped of all bytes that can't be encoded in utf-8 (that can be easily verified by attempting to write that binary to disk with write_attachments).

I encountered this issue while porting SpamScope to Python3, which has a test test_store_samples_unicode_error that parses and saves a uuencoded attachment. According to the test, the resulting file should have a MD5 checksum of 2ea90c996ca28f751d4841e6c67892b8. That test passes with Python2, because the incorrectly parsed payload does indeed have that hash. However, with Python3 the hash changes due to differences in unicode handling. However, the correct checksum is actually 4f2cf891e7cfb349fca812091f184ecc.

@sylencecc
Copy link
Contributor Author

sylencecc commented Nov 24, 2020

I added the respective uuencoded mail sample from the SpamScope repository (its name is mail_test_10 over there) and a test case that highlights the issue. That test fails with py2 while attempting to write the binary data as utf-8:

File "mailparser/utils.py", line 562, in write_sample
  f.write(payload)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0612' in position 57: ordinal not in range(128

SpamScope doesn't run into that exception, because it ships with its own write method that succeeds in writing the (now broken) file.
For py3, the test fails as expected due to an invalid hash:

AssertionError: 'c2d03e7cb36e48c625d9243ee3fdc401' != '4f2cf891e7cfb349fca812091f184ecc'
- c2d03e7cb36e48c625d9243ee3fdc401
+ 4f2cf891e7cfb349fca812091f184ecc

As for the fix, I decided to simply re-encode the uuencoded attachment in base64 and call it a day. That's because the binary flag is mostly treated as base64 all over the place (even in SpamScope), even though an attachment could also be encoded as uuencode (amongst others).
To highlight a few examples:

  • In mailparser/utils.py:557 binary-encoded attachments are always treated as if they were base64
if binary:
  with open(sample, "wb") as f:
    f.write(base64.b64decode(payload))
  • Same for SpamScope, e.g. in src/modules/attachments/attachments.py:259:
if raw_dict["binary"]:
  try:
    payload = base64.b64decode(raw_dict["payload"])

However, just keeping binary set to False causes other issues, because then the payload will be treated as utf-8 string in various places.

For a better solution it might be beneficial to simply treat payload as a byte string or bytearray if binary is True instead of expecting a base64-encoded string. Since that would require various rewrites all over mail-parser and SpamScope, I went with this easier fix. The obvious drawback of my solution is that the transfer encoding is now incorrectly reported as base64 even though the actual attachment was encoded with uuencode. That's still better than not being able to parse those attachments at all.

@fedelemantuano
Copy link
Contributor

Hi @sylencecc,

thanks a lot for your PR. Are you working on SpamScope porting?

@sylencecc
Copy link
Contributor Author

sylencecc commented Nov 25, 2020

Yeah, we have some py3-only dependencies, which is why I'm currently in the process of porting it over. If you're interested in the results, I'll happily send a pull request as soon as I'm done. However, I'm not keeping backwards compatibility: it won't run with py2 anymore. In addition to that, I'm only using the Docker-based version, so I won't touch the Ansible stuff for now. Docker-wise, since the SpamScope image depends on fmantuano/spamscope-deps (which I didn't find a repository for) and this again depends on fmantuano/apache-storm (which I DID find a repository for), I did the following:

  • Forked the repository for fmantuano/apache-storm and updated Apache Storm to version 2.2.0 (see here)
  • Merged the Dockerfile for spamscope-deps into the Dockerfile within the SpamScope repository and updated all dependencies. The drawback of that merge is that building the v8 engine for thug takes forever, however now users have the chance to manage their dependencies which I find better than depending on the outdated soamscope-deps from Docker Hub.
  • Conversion process itself is still underway (it's just a side project), but SpamScope tests are passing now with py3.

Let me know if I should send you PR requests for all that stuff. A separate branch might be appropriate.

@fedelemantuano
Copy link
Contributor

Yes you can send me all PRs.

I'm not keeping backwards compatibility: it won't run with py2 anymore.
Me too.

Good job.

@sylencecc
Copy link
Contributor Author

sylencecc commented Dec 1, 2020

Please have a look at my SpamScope fork and the Storm Dockerfile which the new SpamScope image depends on. So far, the included tests run fine, as does the default debug topology. The project we're using SpamScope in also seems to run on py3 without further issues. However, due to the mediocre test coverage we can't be all too confident that this update doesn't break anything. Moreover, I didn't update the Ansible playbooks due to a lack of time.

  • Do you want to merge that info master/develop or create a separate branch just for py3 support?
  • Should I squash my commits into a single "py3 update" commit?
  • Did you encounter any issues with my changes, anything I forgot etc.?

@fedelemantuano
Copy link
Contributor

Hi @sylencecc,

I will check your code. Now I can't.
I prefer to use a custum branch, git flow style feature/py3.
Yes, it better squash your commits.

For code I will test it.

@sylencecc
Copy link
Contributor Author

Any news on that? Did you find time to test the py3 changes?

@fedelemantuano
Copy link
Contributor

Hi @sylencecc I will test it. I'm working on it best effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants