Threading and knowledge

Everything on this page is computed from index.db — never by scripting Mail.app, which would be orders of magnitude slower for the same aggregate/graph queries.

JWZ threading

flowchart TD
    P["Phantom container<br/>mid=A<br/>(referenced, no row)"]:::phantom
    R1["Real message<br/>mid=B<br/>References: A"]:::real
    R2["Real message<br/>mid=C<br/>In-Reply-To: A"]:::real
    R3["Real reply<br/>mid=D<br/>References: A C"]:::real

    P --> R1
    P --> R2
    R2 --> R3

    N["get_or_create A<br/>row stays None -> phantom<br/>still groups its real children"]

    classDef phantom fill:#fff3cd,stroke:#b8860b,stroke-dasharray:5 3,color:#333
    classDef real fill:#d4edda,stroke:#28a745,color:#333

A phantom container (referenced Message-ID A with no indexed row) still groups its real children B, C, and D into one conversation tree.

read/threader.py implements the classic JWZ (Jamie Zawinski, 1997) algorithm, still the basis of most mail clients' thread reconstruction, run entirely against indexed message_id/ in_reply_to/references_ids columns (no .emlx reparse needed):

Build a container per message, keyed by Message-ID. A message's References (or, if absent, In-Reply-To) chain links containers parent→child in order; the message itself becomes a child of the last reference in its chain.
References to a Message-ID with no corresponding row become phantom containers — they still correctly group their real children together (e.g. two replies to a message we don't have, perhaps because it lives in an unindexed mailbox).
A root-level-only fallback (_merge_orphan_roots_by_subject()) merges orphan threads sharing a normalized subject (Re:/Fwd: stripped) — for mail with missing or broken References/In-Reply-To headers. Deliberately restricted to roots only; merging deeper in the tree risks conflating unrelated threads that happen to share a subject.
_link() refuses to attach a container as its own parent or ancestor (_creates_cycle() walks the candidate parent chain checking for the child before linking) — a guard against malformed/circular References data (real mail from misbehaving senders or gateways can produce a header chain that references itself) turning the tree into an infinite loop instead of just silently dropping the one bad link.

index_threads() recomputes thread_id/thread_root_id/thread_position for the whole index on every build that has any change (not just touched threads) — at personal-mailbox scale (well under a million messages) this is a few hundred milliseconds, so the complexity of a touched-threads-only incremental path wasn't worth it. thread_id is the emails.id of the earliest real message in the tree.

get_email_thread(message_id=... | thread_id=...) re-runs JWZ on just that thread's own rows (cheap — typically tens of messages) to reconstruct the actual parent/child tree for output, rather than persisting a separate parent-pointer column.

Triage heuristics

flowchart TD
    Start["Unread, unanswered,<br/>non-bulk inbox message"] --> Q{"text contains<br/>a question mark?"}
    Q -->|yes +3| Req
    Q -->|no| Req
    Req{"request cue?<br/>can you / please /<br/>review / confirm / deadline"} -->|yes +2| Urg
    Req -->|no| Urg
    Urg{"urgency cue or flagged?<br/>urgent / asap / eod"} -->|yes +3| Age
    Urg -->|no| Age
    Age["+ min 3, days unread"] --> Cmp{"score >= threshold<br/>(default 4)?"}
    Cmp -->|no| Drop["skip message"]
    Cmp -->|yes| Rank{"rank by score"}
    Rank -->|">= 7"| High["HIGH"]
    Rank -->|">= 5"| Med["MEDIUM"]
    Rank -->|else| Norm["NORMAL"]

get_needs_response accumulates +3 for a question, +2 for a request cue, +3 for urgency/flagged, plus min(3, days unread), then drops anything below the threshold and ranks the rest HIGH/MEDIUM/NORMAL.

No universally agreed definition of "needs a reply" exists in the literature — these are transparent, tunable heuristics, not a black-box classifier.

get_awaiting_reply(days_back=7, account=None) (knowledge/triage.py): scans Sent messages in the window, extracts the primary To recipient, skips anything that looks like a no-reply address (core/text.py::looks_like_noreply()), and checks whether any later message from that recipient has in_reply_to matching the sent message's id, references it, or shares its normalized subject — scoped to candidates from that recipient after the send date, to avoid a fragile substring LIKE over the references column. Sorted by longest-waiting first, capped at 20 results.

get_needs_response(days_back=7, account=None, threshold=4): scores unread, unanswered, non-bulk inbox messages —

+3 if the subject/snippet contains a question mark
+2 for a request-phrase cue (can you, please, review, confirm, deadline, ...)
+3 for an urgency cue (urgent, asap, eod, ...) or the message is flagged
+min(3, days unread)

— filters out anything from a no-reply address, ranks HIGH (≥7) / MEDIUM (≥5) / NORMAL, and reports the matched reasons alongside the score.

Both heuristics filter bulk/newsletter mail at parse time, not query time: read/emlx_parser.py:: _looks_bulk() checks List-Unsubscribe, List-Id, List-Post, Precedence: bulk|list|junk, and Auto-Submitted headers, persisted as emails.flag_bulk.

Analytics

knowledge/analytics.py — get_inbox_overview() (counts, top unread senders, needs-response/ awaiting-reply totals, newest unread), get_top_senders() (grouped/ranked by volume), get_statistics(scope ∈ {account_overview, sender_stats, mailbox_breakdown}) — all plain aggregate SQL over emails.

Contacts

knowledge/contacts.py::get_contact(address) — message count, last-contact date, and the 5 most recent messages from an address, derived purely from the index (no integration with macOS Contacts/AddressBook — out of scope).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threading and knowledge

Threading and knowledge

JWZ threading

Triage heuristics

Analytics

Contacts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally