Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a lot of parsing errors #11

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

terhechte
Copy link

@terhechte terhechte commented Sep 29, 2021

Hey! I recently ran email-parser on a batch of ~650.000 emails from 2004 - 2021. Initially roughly half of those emails were marked as invalid. I slowly went through the issues and fixed one after the other in order to be able to parse as many emails a possible. All the changes can be found in this PR.
For me, it was important that I'm able to parse the majority of my mails. With these changes, all mails except for ~100 are parsed properly (and those are broken beyond repair, I had a brief look). However, I'm not sure if all of my changes are sensible additions. I added one commit for each fix so that you can easily figure out which ones are interesting. I also added tests (in the last commit) that involve most of the fixed issues.

Finally, I've added two feature flags in order to improve parsing:

  • allow-duplicate-headers: Gmail seems to add multiple To fields if forwarding from one Gmail account to another. Similarly, I had many emails with multiple Message-ID fields and so on. I've encapsulated this into a flag. If active, multiple headers are allowed, and in case the value is a list of items (e.g. Reply-To) they're all merged together. If inactive, the previous behavior of not allowing multiple headers stays as before.
  • decode-mime-body: I do need mime support for the parsing of Subject fields, but I'm not interested in bodies. Given that parsing of bodies might be expensive, I added a flag to specifically disable the parsing of bodies if mime is active.

I also added one dependency. This could also be made into a feature flag I guess. This library converts timezone abbreviations (such as GMT) to proper timezone information. I build that yesterday for the specific purpose of parsing timezone information in email-parser.

Cheers & Thanks for this nice library!

Copy link
Owner

@Mubelotix Mubelotix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thank you very much for this PR! Your library crate for timezones is nice!
I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere?
I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes is explicitly enabled.

@@ -75,6 +88,16 @@ pub fn word(input: &[u8]) -> Res<Cow<str>> {
)
}

pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> {
pub fn in_quotes(input: &[u8]) -> Res<Vec<Cow<str>>> {

@@ -121,12 +145,30 @@ pub fn unstructured(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {
Ok((input, output))
}

pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> {
pub fn unstructured_until_linebreak(input: &[u8]) -> Res<Cow<str>> {

@@ -12,6 +12,7 @@ keywords = ["email", "mail", "mime", "parser"]

[dependencies]
textcode = {version="0.2", optional=true}
timezone-abbreviations = "0.1.0"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong but custom timezone is defined by RFC 822 and this crate is focusing on RFC 5322. There is a feature named compatibility-fixes to allow older syntaxes. Please put everything timezone-related under this feature gate (including the new dependency).

@@ -35,6 +36,8 @@ compatibility-fixes = []
content-disposition = ["mime"]
unrecognized-headers = ["mime"]
mime = ["textcode"]
allow-duplicate-headers = []
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a feature for that. I would like to leave the Email struct untouched, and rather create a new PermissiveEmail struct storing headers as Vec of their values so that it allows duplicate and even missing headers. But Email should designate a compliant email.

@terhechte
Copy link
Author

Hey! Thank you very much for this PR! Your library crate for timezones is nice!
I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere?
I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes is explicitly enabled.

Ah no, these are my personal emails. I did a gmail download with the intention of generating statistics on them. I can probably dig out a couple of the worst offenders and remove some personal information from them (e.g. the body) to share.

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

@Mubelotix
Copy link
Owner

Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what keywords and trace do.

That would indeed be nice. keywords enables parsing for the keywords header and trace was supposed to parse all the trace-related headers. Unfortunately, the RFC is insanely vague and wrong about these headers. So I have no idea how I am supposed to implement them.

Regarding the PermissiveEmail I see the appeal of that. I'll see how much time I can invest to implement this. Right now the parser works good enough for the problem I'm trying to solve so I'll probably focus on that first and afterwards look into this PR again :)

I will do it if you can't :)

I've moved this under compatibility-fixes, but maybe that could
also move under its own feature flag for emlx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants