-
Notifications
You must be signed in to change notification settings - Fork 0
Threading and knowledge
covers:
- src/cobos_apple_mail_mcp/read/threader.py
- src/cobos_apple_mail_mcp/knowledge/*.py last_verified: 2026-06-30
Everything on this page is computed from index.db — never by scripting Mail.app, which would
be orders of magnitude slower for the same aggregate/graph queries.
read/threader.py implements the classic JWZ (Jamie Zawinski, 1997) algorithm, still the basis
of most mail clients' thread reconstruction, run entirely against indexed message_id/
in_reply_to/references_ids columns (no .emlx reparse needed):
- Build a container per message, keyed by Message-ID. A message's
References(or, if absent,In-Reply-To) chain links containers parent→child in order; the message itself becomes a child of the last reference in its chain. - References to a Message-ID with no corresponding row become phantom containers — they still correctly group their real children together (e.g. two replies to a message we don't have, perhaps because it lives in an unindexed mailbox).
- A root-level-only fallback (
_merge_orphan_roots_by_subject()) merges orphan threads sharing a normalized subject (Re:/Fwd: stripped) — for mail with missing or broken References/In-Reply-To headers. Deliberately restricted to roots only; merging deeper in the tree risks conflating unrelated threads that happen to share a subject.
index_threads() recomputes thread_id/thread_root_id/thread_position for the whole
index on every build that has any change (not just touched threads) — at personal-mailbox scale
(well under a million messages) this is a few hundred milliseconds, so the complexity of a
touched-threads-only incremental path wasn't worth it. thread_id is the emails.id of the
earliest real message in the tree.
get_email_thread(message_id=... | thread_id=...) re-runs JWZ on just that thread's own rows
(cheap — typically tens of messages) to reconstruct the actual parent/child tree for output,
rather than persisting a separate parent-pointer column.
No universally agreed definition of "needs a reply" exists in the literature — these are transparent, tunable heuristics, not a black-box classifier.
get_awaiting_reply(days_back=7, account=None) (knowledge/triage.py): scans Sent messages
in the window, extracts the primary To recipient, skips anything that looks like a no-reply
address (core/text.py::looks_like_noreply()), and checks whether any later message from that
recipient has in_reply_to matching the sent message's id, references it, or shares its
normalized subject — scoped to candidates from that recipient after the send date, to avoid a
fragile substring LIKE over the references column. Sorted by longest-waiting first, capped at
20 results.
get_needs_response(days_back=7, account=None, threshold=4): scores unread, unanswered,
non-bulk inbox messages —
- +3 if the subject/snippet contains a question mark
- +2 for a request-phrase cue (
can you,please,review,confirm,deadline, ...) - +3 for an urgency cue (
urgent,asap,eod, ...) or the message is flagged - +min(3, days unread)
— filters out anything from a no-reply address, ranks HIGH (≥7) / MEDIUM (≥5) / NORMAL,
and reports the matched reasons alongside the score.
Both heuristics filter bulk/newsletter mail at parse time, not query time: read/emlx_parser.py:: _looks_bulk() checks List-Unsubscribe, List-Id, List-Post, Precedence: bulk|list|junk,
and Auto-Submitted headers, persisted as emails.flag_bulk.
knowledge/analytics.py — get_inbox_overview() (counts, top unread senders, needs-response/
awaiting-reply totals, newest unread), get_top_senders() (grouped/ranked by volume),
get_statistics(scope ∈ {account_overview, sender_stats, mailbox_breakdown}) — all plain
aggregate SQL over emails.
knowledge/contacts.py::get_contact(address) — message count, last-contact date, and the 5 most
recent messages from an address, derived purely from the index (no integration with macOS
Contacts/AddressBook — out of scope).