Performance and benchmarks

Performance & benchmarks

What's actually been measured vs. what's a design target

Honesty matters more than impressive numbers here. Early development timing claims were verified only against synthetic .emlx fixtures (a handful to a few dozen messages); those code paths were real but the numbers were extrapolated, not stress-tested. That gap has since been closed — below are real measurements against a real 7-account, 210,152-message mailbox (apple-mail-mcp index build --full run to completion, then queried live), alongside what's still a synthetic-only or extrapolated number.

Measured against a real 210,152-message mailbox

Full-mailbox search, realistic/selective query ("invoice", or "meeting" scoped to subject): 9-20ms, BM25-ranked. This is the common case, and confirms the sub-100ms design target — for queries that actually narrow the corpus.
Full-mailbox search, deliberately non-selective single-word query ("the", matching ~82,900 of 210,152 messages — 39% of the whole mailbox): 0.3-1.6s, not sub-100ms. Expected FTS5/BM25 behavior when a query barely narrows the candidate set at all — ranking tens of thousands of rows before returning the top page is inherently more expensive than ranking a few thousand. Real-world queries are essentially never this unselective, but the unqualified "sub-100ms at any scale" claim from before this measurement was wrong and has been corrected.
First full index build: 210,152 messages in 679.6s (IndexBuildResult.duration_sec) — about 3.2ms/message, one-time, includes HTML→text conversion and a full JWZ re-thread. failed: 0 — see Apple Mail on-disk format for the malformed-header sanitization that made this possible; the first attempt against this same mailbox crashed partway through before that fix.
get_inbox_overview, get_needs_response, get_email_thread: all sub-second, computed entirely from the local index with zero Mail.app/AppleScript involvement.
list_accounts/account fields on every read tool: resolved from raw UUID directory names to real human display names via read/account_names.py, verified against all 7 accounts on this mailbox — see Apple Mail on-disk format.
Attachment text extraction (optional [attachments] extra): a bounded backfill over 100 real attachment-bearing messages took 3.15s — ~31.5 ms/message on this mailbox (average includes fast rejection of the many .pdf-named attachments that aren't real PDFs; 26 of the 100 yielded extractable PDF/DOCX text). This is meaningfully slower per message than the ~3.2 ms/msg headline index build, which is exactly why extraction is a separate, explicitly-triggered low-priority backfill (index extract-attachments / a couple of batches per --watch tick), not part of every build. The mailbox has 6,665 attachment-bearing messages (2,193 with a PDF/DOCX name), so a full first extraction is a one-time cost on the order of a few minutes, after which it's incremental.

Measured (synthetic fixtures, low message counts)

search() (SearchResult.timing_ms) on a handful of indexed messages: consistently sub-millisecond to low-single-digit milliseconds — e.g. 0.24ms–0.30ms observed in end-to-end MCP tool-call tests. Consistent with the low-thousands-of-messages case of the real mailbox above.
--read-only blocking a write tool: ~1-6ms (a real regression was caught and fixed here — an earlier version took ~20 seconds because it resolved the message via JXA before checking --read-only; see Safety, confirmation & undo).

Still a design target, not directly measured

Index build vs. scripting Mail.app for the same scan: this project has not run a literal head-to-head against an AppleScript-based scanner. The ~3.2ms/message figure above is real; the claim that this beats AppleScript scripting by roughly an order of magnitude or more rests on the architectural difference (one process-per-AppleScript-call vs. direct file I/O in a single process) rather than a benchmark run in this repo.
--watch latency: new mail typically reflected within a couple of seconds, bounded by the 500ms debounce window plus indexing time for the batch — not yet measured against sustained real mail arrival over time.

Verified for real (not synthetic): the `osascript` subprocess mechanics

tests/test_jxa_executor.py runs real osascript calls (no Mail.app interaction — scripts that don't touch Application("Mail"), so no Automation permission needed) to verify the timeout/ process-group-kill behavior actually works: a deliberately hung script (while (true) {}) is killed and returns control within the configured timeout, not Apple Events' own ~2-minute default wait.

Why it's fast — the complexity math

Not "trust us, it's fast" — every claim below either derives from the actual algorithm each subsystem runs, or is checked against the real 210,152-message measurements above. N = total indexed messages, M = mailbox size as Mail.app's own JXA layer would see it, m = a search's candidate-set size, K = requested result-page size (limit, default 25), Δ = messages changed since the last index build.

Operation	Complexity	Why	Real evidence
`get_email` (single)	`O(log I)` index lookup + `O(1)` file read	B-tree row lookup by indexed column, then one bounded-size `.emlx` read — the read cost doesn't grow with mailbox size	sub-ms, measured
`search`, selective query	`O(m log K)`, `m ≪ N`	BM25 ranks only the m documents containing the (rare) term, via a size-K min-heap, not a full sort	9-20ms, measured (`"invoice"`, `m`=2,745)
`search`, non-selective query	`O(N log K)`	a common term's candidate set ≈ the whole corpus — see Search for the IDF math	289.6ms-1.6s, measured (`"the"`, `m`=82,893)
`mode=hybrid` (RRF fusion)	`O(K log K)`	merging two already-small top-K lists by rank, not by re-scoring anything	negligible vs. either underlying search — see Search
`index build --full`	`O(N)`	one parse + one UPSERT per message, batched	3.23ms/message, measured (679.6s / 210,152)
`index build` (incremental / `--watch`)	`O(D+I)` diff + `O(Δ)` work	enumerating the current state is linear in mailbox size; only the changed subset gets (expensive) reparsing — see Indexing and watch	design target (not yet load-tested against sustained arrivals)
JWZ re-threading (full)	`O(I)` average, `O(I log I)` worst case	one pass building containers + hash-map linking by Message-ID/References; the subject-fallback merge needs a sort of the orphan roots	see Threading and knowledge
`resolve()` (write-target lookup)	`O(1)` amortized (hint/cache/seed hit), `O(M)` worst case (broad scan)	a scoped JXA call costs Mail.app one mailbox's worth of searching; only the last-resort broad scan is unscoped	see Identity & resolution
`--read-only` rejection	`O(1)`	a config check runs before resolution or any JXA call	~1-6ms, measured (was ~20s before a real regression fix)

Throughput, derived from the one real data point we have

c = T / N = 679.5977 s / 210,152 messages ≈ 3.233 ms/message
throughput = N / T ≈ 309.2 messages/second

Treating the indexer as T(N) ≈ c·N (justified by the O(N) complexity above, and by failed: 0 meaning no per-message retries skewed the constant) gives a linear projection — explicitly not a second measurement — for mailbox sizes not tested here:

xychart-beta
    title "First full-build time vs. mailbox size (1 measured point, rest linear projection)"
    x-axis ["100k (proj.)", "210k (measured)", "500k (proj.)", "1M (proj.)"]
    y-axis "minutes" 0 --> 60
    bar [5.4, 11.3, 26.9, 53.9]

If a real mailbox at one of the projected sizes is ever benchmarked, replace the corresponding bar with a measured value rather than trusting the line — c was fit to exactly one machine, one disk, one mailbox's content mix (attachment sizes, HTML-vs-plaintext ratio all affect it).

Running your own benchmark

apple-mail-mcp index build --full --verbose
apple-mail-mcp index status        # total_indexed, dead_letter_count, embed coverage
time apple-mail-mcp search "some term" --highlight

If you benchmark against your own mailbox and the numbers differ meaningfully from the design targets above, that's genuinely useful signal — please open an issue with index status output and mailbox size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance and benchmarks

Performance & benchmarks

What's actually been measured vs. what's a design target

Measured against a real 210,152-message mailbox

Measured (synthetic fixtures, low message counts)

Still a design target, not directly measured

Verified for real (not synthetic): the `osascript` subprocess mechanics

Why it's fast — the complexity math

Throughput, derived from the one real data point we have

Running your own benchmark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Performance and benchmarks

Performance & benchmarks

What's actually been measured vs. what's a design target

Measured against a real 210,152-message mailbox

Measured (synthetic fixtures, low message counts)

Still a design target, not directly measured

Verified for real (not synthetic): the osascript subprocess mechanics

Why it's fast — the complexity math

Throughput, derived from the one real data point we have

Running your own benchmark

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Verified for real (not synthetic): the `osascript` subprocess mechanics