New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a lot of parsing errors #11
base: master
Are you sure you want to change the base?
Conversation
… during forwarding
…oined, others ignored.
…g more resilient)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Thank you very much for this PR! Your library crate for timezones is nice!
I'm quite surprised by the share of invalid emails in your dataset. Is this a public dataset I could download somewhere?
I made some review comments mostly about cargo features. The thing is the performance of parsing valid emails should not be affected by the workarounds required by non-compliant emails unless the compatibility-fixes
is explicitly enabled.
@@ -75,6 +88,16 @@ pub fn word(input: &[u8]) -> Res<Cow<str>> { | |||
) | |||
} | |||
|
|||
pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn in_quotes(input: &[u8]) -> Result<(&[u8], Vec<Cow<str>>), Error> { | |
pub fn in_quotes(input: &[u8]) -> Res<Vec<Cow<str>>> { |
@@ -121,12 +145,30 @@ pub fn unstructured(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> { | |||
Ok((input, output)) | |||
} | |||
|
|||
pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn unstructured_until_linebreak(input: &[u8]) -> Result<(&[u8], Cow<str>), Error> { | |
pub fn unstructured_until_linebreak(input: &[u8]) -> Res<Cow<str>> { |
@@ -12,6 +12,7 @@ keywords = ["email", "mail", "mime", "parser"] | |||
|
|||
[dependencies] | |||
textcode = {version="0.2", optional=true} | |||
timezone-abbreviations = "0.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong but custom timezone is defined by RFC 822 and this crate is focusing on RFC 5322. There is a feature named compatibility-fixes
to allow older syntaxes. Please put everything timezone-related under this feature gate (including the new dependency).
@@ -35,6 +36,8 @@ compatibility-fixes = [] | |||
content-disposition = ["mime"] | |||
unrecognized-headers = ["mime"] | |||
mime = ["textcode"] | |||
allow-duplicate-headers = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a feature for that. I would like to leave the Email
struct untouched, and rather create a new PermissiveEmail
struct storing headers as Vec of their values so that it allows duplicate and even missing headers. But Email
should designate a compliant email.
Ah no, these are my personal emails. I did a gmail download with the intention of generating statistics on them. I can probably dig out a couple of the worst offenders and remove some personal information from them (e.g. the body) to share. Somehow I missed the compatibility-fixes feature flag. I suppose it would be nice to have a brief explanation of all the existing feature flags. I also wasn't sure what Regarding the |
That would indeed be nice.
I will do it if you can't :) |
I've moved this under compatibility-fixes, but maybe that could also move under its own feature flag for emlx
Hey! I recently ran
email-parser
on a batch of ~650.000 emails from 2004 - 2021. Initially roughly half of those emails were marked as invalid. I slowly went through the issues and fixed one after the other in order to be able to parse as many emails a possible. All the changes can be found in this PR.For me, it was important that I'm able to parse the majority of my mails. With these changes, all mails except for ~100 are parsed properly (and those are broken beyond repair, I had a brief look). However, I'm not sure if all of my changes are sensible additions. I added one commit for each fix so that you can easily figure out which ones are interesting. I also added tests (in the last commit) that involve most of the fixed issues.
Finally, I've added two feature flags in order to improve parsing:
allow-duplicate-headers
: Gmail seems to add multipleTo
fields if forwarding from one Gmail account to another. Similarly, I had many emails with multipleMessage-ID
fields and so on. I've encapsulated this into a flag. If active, multiple headers are allowed, and in case the value is a list of items (e.g.Reply-To
) they're all merged together. If inactive, the previous behavior of not allowing multiple headers stays as before.decode-mime-body
: I do need mime support for the parsing ofSubject
fields, but I'm not interested in bodies. Given that parsing of bodies might be expensive, I added a flag to specifically disable the parsing of bodies if mime is active.I also added one dependency. This could also be made into a feature flag I guess. This library converts timezone abbreviations (such as
GMT
) to proper timezone information. I build that yesterday for the specific purpose of parsing timezone information inemail-parser
.Cheers & Thanks for this nice library!