Skip to content

Methode-dev/EmailParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

emailparser

A C extension for Python that splits email reply chains into individual segments and extracts structured data from each one.

This README has been generated with Claude Code. The code itself has been partially generated with Claude Code, especially the guards and error management as it turns out I find it highly efficient at those kind of tasks. The core logic is man-made but there is no strong re-reads of the modifications Claude made so there might be errors in there.

Please find CONTRIBUTING.md, as it is a deep-dive documentation into the code.


Table of contents

  1. Installation & Quick Start
  2. Building
  3. Concepts
  4. The Email iterator
  5. parse_headers
  6. extract_body
  7. find_signature
  8. strip_signature
  9. Putting it all together
  10. Source layout

0. Installation

pip install fastemailparser

Quick Start:

from emailparser import Email

chain_mail = Email(open('mail.html', 'r').read())
print(next(chain_mail))

See a more detailed usage in Section 8


1. Building

pip install setuptools
python3 setup.py build_ext --inplace

This compiles emailparser.cpython-*.so into the project root. All tests can then be run with:

python3 -m pytest functional_tests.py test_emailparser.py -v

2. Concepts

An email reply chain looks like this on disk (plain text or HTML):

Latest reply body…

From: Alice <alice@example.com>
Sent: Monday, 2 June 2025 10:00 AM
To: Bob
Subject: RE: Project update

Previous reply body…

From: Bob <bob@example.com>
…

emailparser finds every From: / De: separator and yields the content between them as segments. Each segment begins with the separator header block and ends just before the next one.

Key facts:

  • Segment 0 is the content before the first separator — the body of the most recent (top) reply. It has no inline headers of its own.
  • Segments 1, 2, … each start with their own From:/Sent:/To:/ Subject: block, followed by a blank line and then the reply body.
  • For raw MIME emails (.eml, Date: … at the top of the file), the outer header block is skipped automatically and exposed via outer_headers.

3. The Email iterator

import emailparser

for segment in emailparser.Email("path/to/email.html"):
    print(segment)

Email accepts a file path, a raw string, or bytes. It is an iterator: calling next() on it returns one segment at a time.

Input formats

Input Behaviour
Email("path/to/file.html") Opens and reads the file
Email("<html>…</html>") String with no matching file → used as raw content
Email(b"raw bytes") Always treated as raw content
# File path
for seg in emailparser.Email("mail.txt"):
    ...

# Raw string
content = open("mail.txt").read()
for seg in emailparser.Email(content):
    ...

# Bytes
raw = open("mail.txt", "rb").read()
for seg in emailparser.Email(raw):
    ...

All three produce identical results.


plain_text

emailparser.Email(source, plain_text=True)

Strips HTML tags and decodes entities (&lt;, &nbsp;, …) via libxml2 before yielding each segment. Block-level elements (<p>, <div>, <br>, …) are replaced with newlines.

segs = list(emailparser.Email("mail.txt", plain_text=True))

# Without plain_text:
# '<div data-test-id="mailMessageBodyContainer">Dear Ms. De Pedro…'

# With plain_text:
# '\n\n\nDear Ms. De Pedro,\nGood day,\n…'

Use plain_text=True whenever you want to process the text content rather than render the HTML.


standalone

emailparser.Email(source, standalone=True)

Wraps each segment in a complete, self-contained HTML document so it can be saved to a file and opened directly in a browser:

<!DOCTYPE html>
<html><head>
  <meta charset="UTF-8">
  <style>/* base CSS + all <style> blocks extracted from the source */</style>
</head>
<body>
  <!-- segment content -->
</body></html>
  • HTML segments are embedded as-is.
  • Plain-text / quoted-printable segments are decoded and wrapped in <pre>.
  • Ignored when combined with plain_text=True.
segs = list(emailparser.Email("test_emails/test2.html", standalone=True))

with open("segment_0.html", "w") as f:
    f.write(segs[0])   # open in browser and it renders correctly

strip_headers

emailparser.Email(source, plain_text=True, strip_headers=True)

Removes the From:/To:/Subject:/Date: header block from the top of each segment, leaving only the reply body.

  • Has no effect on segment 0 (which starts with the reply body directly, not with a header block).
  • Works for both English headers (From:, Sent:) and French headers (De :, Envoyé :, À :).
# Without strip_headers:
# '\nFrom: D9A (Branko Olic)…\nSent: Wednesday…\nTo: docs\n\nHi Abby,…'

# With strip_headers:
# '\nHi Abby,…'

You can combine all three flags:

for body in emailparser.Email("chain.html", plain_text=True, strip_headers=True):
    print(body)

outer_headers

For raw MIME emails the outer header block (the metadata of the most recent email) is skipped during iteration but remains accessible as a property:

email = emailparser.Email("test_emails/test2.html")

print(email.outer_headers)
# {
#   "from":    '"D9A (Branko Olic) Marlow CD-D9A" <d9a@marlowgroup.com>',
#   "to":      ['docs <docs@interportfrance.fr>'],
#   "cc":      ['"g2.mnph@marlowgroup.com" <g2.mnph@marlowgroup.com>', …],
#   "bcc":     [],
#   "subject": "MV RANGER - Schengen Visa - OS BALABAT",
#   "date":    "Wed, 11 Jun 2025 13:25:14 +0200"
# }

Returns None for pure HTML emails (e.g. mail.txt) that have no outer MIME header block.

if email.outer_headers:
    sender = email.outer_headers["from"]

4. parse_headers

emailparser.parse_headers(segment) -> dict

Extracts the header fields from any segment string.

Returns a dict with these keys — always present, defaults shown:

Key Type Default
"from" str | None None
"to" list[str] []
"cc" list[str] []
"bcc" list[str] []
"subject" str | None None
"date" str | None None

Recognised field names (case-insensitive):

Language Fields
English From, To, CC, BCC, Subject, Date, Sent, Reply-To
French De, À / à, Cci, Objet, Envoyé

Handles HTML segments and quoted-printable encoding automatically.

segs = list(emailparser.Email("test_emails/test2.html", plain_text=True))
seg  = segs[3]   # a quoted reply starting with "From: D9A…"

h = emailparser.parse_headers(seg)
# {
#   "from":    "D9A (Branko Olic) Marlow CD-D9A",
#   "to":      ["docs"],
#   "cc":      ["g2.mnph@marlowgroup.com", "Info - INTERPORT", …],
#   "bcc":     [],
#   "subject": "RE: MV RANGER - Schengen Visa - OS BALABAT",
#   "date":    "Wednesday, June 11, 2025 11:49 AM"
# }

print(h["from"])     # "D9A (Branko Olic) Marlow CD-D9A"
print(h["to"])       # ["docs"]
print(h["subject"])  # "RE: MV RANGER - Schengen Visa - OS BALABAT"

Note: parse_headers on segment 0 returns all None/[] because segment 0 is the latest reply body with no inline header block. Use email.outer_headers to access the metadata of that first email.


5. extract_body

emailparser.extract_body(segment) -> str

Returns the segment with the header block stripped — symmetric counterpart to parse_headers.

  • If the segment starts with a recognised header field, scans to the first blank line and returns everything after it.
  • If the segment does not start with a recognised header (e.g. segment 0), it is returned unchanged.
seg  = segs[3]   # starts with "From: D9A…\nSent:…\nTo:…\n\nHi Abby,…"
body = emailparser.extract_body(seg)
# '\nHi Abby,\n\nGm,\n\nPls see blw…'

# Segment 0 has no header block — returned as-is
body_0 = emailparser.extract_body(segs[0])
# '\n\nDear Ms. De Pedro,\nGood day,…'

Pair it with parse_headers to access both parts independently:

headers = emailparser.parse_headers(seg)
body    = emailparser.extract_body(seg)

6. find_signature

emailparser.find_signature(text) -> int

Returns the character index where the signature block starts, or -1 if no signature is found.

Detects:

Pattern Example
RFC 3676 delimiter -- or -- on its own line
English closings Kind regards, Best regards, Sincerely, Thanks, …
French closings Cordialement, Bien cordialement, Merci, Salutations

Works on both plain-text and raw HTML segments (libxml2 DOM path for HTML, line-scan fallback for plain text).

body = emailparser.extract_body(segs[3])

idx = emailparser.find_signature(body)
# 801

if idx >= 0:
    message   = body[:idx]   # reply body without signature
    signature = body[idx:]   # "Kind regards,\n\nBranko Olic…"

7. strip_signature

emailparser.strip_signature(text) -> str

Returns text with the signature block removed. Shorthand for:

idx = emailparser.find_signature(text)
clean = text[:idx] if idx >= 0 else text

Returns the input unchanged if no signature is found.

body  = emailparser.extract_body(segs[3])
clean = emailparser.strip_signature(body)
# reply body text only, no "Kind regards,\n\nBranko Olic…"

8. Putting it all together

Complete example — iterate over a MIME email chain and extract every piece of structured data:

import emailparser

source = "test_emails/test2.html"
email  = emailparser.Email(source, plain_text=True)

# ── Most recent email (segment 0 has no inline headers) ──────────────────
first_headers = email.outer_headers          # dict or None
first_body    = None                         # collected below

# ── Iterate over all segments ─────────────────────────────────────────────
for i, seg in enumerate(email):
    headers = emailparser.parse_headers(seg)
    body    = emailparser.extract_body(seg)
    clean   = emailparser.strip_signature(body)
    sig_idx = emailparser.find_signature(body)
    sig     = body[sig_idx:] if sig_idx >= 0 else ""

    if i == 0:
        first_body = body   # segment 0 is already just the body

    print(f"── Segment {i} ──────────────────────")
    if i == 0:
        # segment 0: headers come from outer_headers
        if first_headers:
            print(f"  From:    {first_headers['from']}")
            print(f"  Subject: {first_headers['subject']}")
    else:
        print(f"  From:    {headers['from']}")
        print(f"  Date:    {headers['date']}")
        print(f"  Subject: {headers['subject']}")
    print(f"  Body:    {clean[:60].strip()!r}…")
    if sig:
        print(f"  Sig:     {sig[:40].strip()!r}…")

9. Source layout

emailparser.c          Python type (EmailObject) and module init
email.h                email_t struct definition
setup.py               build script — compiles all src/*.c files
main.c                 minimal standalone C binary
src/
  buf.h                strbuf_t type and sb_push (header-only)
  mime.h / mime.c      decode_qp · skip_mime_headers · has_html_mime_part
  html.h / html.c      walk_text · segment_to_text · html_to_plain_c
  standalone.h / .c    extract_css · wrap_standalone
  email_iter.h / .c    SEPARATOR_REGEX · new_email · get_next_val
  headers.h / .c       canonical_key · py_parse_headers
  body.h / .c          find_body_start · py_extract_body
  signature.h / .c     py_find_signature · py_strip_signature
test_emailparser.py    unittest suite (mail.txt)
functional_tests.py    pytest parametrised suite (test_emails/)

Compile-time override

The separator regex can be overridden without changing the source:

python3 setup.py build_ext --inplace \
  build_ext --define SEPARATOR_REGEX='"(From|De) ?:"'

About

A very fast email splitting & parsing Python library

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors