Fixed a lot of parsing errors #11

terhechte · 2021-09-29T14:39:30Z

Hey! I recently ran email-parser on a batch of ~650.000 emails from 2004 - 2021. Initially roughly half of those emails were marked as invalid. I slowly went through the issues and fixed one after the other in order to be able to parse as many emails a possible. All the changes can be found in this PR.
For me, it was important that I'm able to parse the majority of my mails. With these changes, all mails except for ~100 are parsed properly (and those are broken beyond repair, I had a brief look). However, I'm not sure if all of my changes are sensible additions. I added one commit for each fix so that you can easily figure out which ones are interesting. I also added tests (in the last commit) that involve most of the fixed issues.

Finally, I've added two feature flags in order to improve parsing:

allow-duplicate-headers: Gmail seems to add multiple To fields if forwarding from one Gmail account to another. Similarly, I had many emails with multiple Message-ID fields and so on. I've encapsulated this into a flag. If active, multiple headers are allowed, and in case the value is a list of items (e.g. Reply-To) they're all merged together. If inactive, the previous behavior of not allowing multiple headers stays as before.
decode-mime-body: I do need mime support for the parsing of Subject fields, but I'm not interested in bodies. Given that parsing of bodies might be expensive, I added a flag to specifically disable the parsing of bodies if mime is active.

I also added one dependency. This could also be made into a feature flag I guess. This library converts timezone abbreviations (such as GMT) to proper timezone information. I build that yesterday for the specific purpose of parsing timezone information in email-parser.

Cheers & Thanks for this nice library!

… during forwarding

…oined, others ignored.

…eir values

…g more resilient)

Mubelotix

Hey! Thank you very much for this PR! Your library crate for timezones is nice!
I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere?
I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes is explicitly enabled.

Mubelotix · 2021-09-30T12:46:53Z

email-parser/src/parsing/common.rs

@@ -75,6 +88,16 @@ pub fn word(input: &[u8]) -> Res<Cow<str>> {
    )
 }

+pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> {


Suggested change

pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> {

pub fn in_quotes(input: &[u8]) -> Res<Vec<Cow<str>>> {

Mubelotix · 2021-09-30T12:47:42Z

email-parser/src/parsing/common.rs

@@ -121,12 +145,30 @@ pub fn unstructured(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {
    Ok((input, output))
 }

+pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {


Suggested change

pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {

pub fn unstructured_until_linebreak(input: &[u8]) -> Res<Cow<str>> {

Mubelotix · 2021-09-30T12:57:03Z

email-parser/Cargo.toml

@@ -12,6 +12,7 @@ keywords = ["email", "mail", "mime", "parser"]

 [dependencies]
 textcode = {version="0.2", optional=true}
+timezone-abbreviations = "0.1.0"


Correct me if I am wrong but custom timezone is defined by RFC 822 and this crate is focusing on RFC 5322. There is a feature named compatibility-fixes to allow older syntaxes. Please put everything timezone-related under this feature gate (including the new dependency).

Mubelotix · 2021-09-30T13:00:12Z

email-parser/Cargo.toml

@@ -35,6 +36,8 @@ compatibility-fixes = []
 content-disposition = ["mime"]
 unrecognized-headers = ["mime"]
 mime = ["textcode"]
+allow-duplicate-headers = []


I don't think we need a feature for that. I would like to leave the Email struct untouched, and rather create a new PermissiveEmail struct storing headers as Vec of their values so that it allows duplicate and even missing headers. But Email should designate a compliant email.

terhechte · 2021-10-01T06:28:29Z

Hey! Thank you very much for this PR! Your library crate for timezones is nice!
I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere?
I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes is explicitly enabled.

Ah no, these are my personal emails. I did a gmail download with the intention of generating statistics on them. I can probably dig out a couple of the worst offenders and remove some personal information from them (e.g. the body) to share.

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

Mubelotix · 2021-10-02T09:25:18Z

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

That would indeed be nice. keywords enables parsing for the keywords header and trace was supposed to parse all the trace-related headers. Unfortunately, the RFC is insanely vague and wrong about these headers. So I have no idea how I am supposed to implement them.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

I will do it if you can't :)

I've moved this under compatibility-fixes, but maybe that could also move under its own feature flag for emlx

terhechte added 20 commits September 29, 2021 09:23

Added support for timezone abbreviations

c06ca03

Add support for the 9:23:47 time format

eef0baf

Fix compile issue when enabling MIME

c3607e7

Support for latin1, shiftjs, etc codepage chars in subjects

fd66dae

Support emails with no sender but multiple from addresses

ce4c06f

Support '.' character in from names (e.g. "From: hey.io <info@heo.io>"

9c2caf6

Support more address chars in From. Add '@'

6d8b94d

Support for years in dates with only two digits. E.g. 11 for 2011

1767c3a

Support display names in quotes with any characters inside

a0614de

Add support for '+00:00' timezone format

7c7acc0

Some emails add a comment at the end of the date to indicate a timezone

3e644dc

Allow multiple 'to' values in headers as this is something gmail does…

305329f

… during forwarding

Add feature flag to allow multiple headers for all fields. Vecs are j…

c6199ef

…oined, others ignored.

Some addresses have additional whitespace after the closing angle

b2c93e0

Fix for an issue where unknown fields had weird unicode letters in th…

2cdebe6

…eir values

Support broken headers by moving them into unsupported (making parsin…

ff8e2a5

…g more resilient)

Add feature to disable mime body decoding

2100e15

Add option to get the number for a month

88ba128

Added tests for many of the issues that were fixed in the past commits

21021c9

Update dependency

da8582a

Mubelotix requested changes Sep 30, 2021

View reviewed changes

Support Apple Mail messages which use LF instead of CRLF

dba59d8

I've moved this under compatibility-fixes, but maybe that could also move under its own feature flag for emlx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed a lot of parsing errors #11

Fixed a lot of parsing errors #11

terhechte commented Sep 29, 2021 •

edited

Mubelotix left a comment

Mubelotix Sep 30, 2021

Mubelotix Sep 30, 2021

Mubelotix Sep 30, 2021

Mubelotix Sep 30, 2021

terhechte commented Oct 1, 2021

Mubelotix commented Oct 2, 2021

	pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> {
	pub fn in_quotes(input: &[u8]) -> Res<Vec<Cow<str>>> {

	pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {
	pub fn unstructured_until_linebreak(input: &[u8]) -> Res<Cow<str>> {

Fixed a lot of parsing errors #11

Are you sure you want to change the base?

Fixed a lot of parsing errors #11

Conversation

terhechte commented Sep 29, 2021 • edited

Mubelotix left a comment

Choose a reason for hiding this comment

Mubelotix Sep 30, 2021

Choose a reason for hiding this comment

Mubelotix Sep 30, 2021

Choose a reason for hiding this comment

Mubelotix Sep 30, 2021

Choose a reason for hiding this comment

Mubelotix Sep 30, 2021

Choose a reason for hiding this comment

terhechte commented Oct 1, 2021

Mubelotix commented Oct 2, 2021

terhechte commented Sep 29, 2021 •

edited