Improve onboarding imports and graph summaries#6362
Conversation
Greptile SummaryThis PR improves the onboarding Gmail import by merging a Gmail-bootstrap-page scan with per-label Atom feeds for higher email coverage, adds branded connector icons (Google Calendar, Gmail), and adds a "Who you are" footer panel to the second-brain graph pane.
Confidence Score: 4/5Safe to merge after resolving the dedup key mismatch in readRecentEmails. One P1 logic issue: duplicate emails can survive the ID-based merge and be sent to the LLM synthesis step, producing inflated memory counts. All other findings are P2 style/cleanup items that don't affect correctness. GmailReaderService.swift — the readRecentEmails merge block and fetchGmailViaLabelFeeds. Important Files Changed
Sequence DiagramsequenceDiagram
participant OC as OnboardingCoordinator
participant GRS as GmailReaderService
participant BS as Bootstrap (Gmail HTML)
participant LF as Label Feeds (13 Atom feeds)
participant LLM as LLM (synthesizeFromEmails)
OC->>GRS: readRecentEmails(maxResults:300, query:"newer_than:365d")
GRS->>BS: fetchGmailViaAtomFeedSingle(allowBootstrap:true)
BS-->>GRS: bootstrapEmails [real Gmail IDs]
GRS->>LF: fetchGmailViaLabelFeeds(maxResults:300)
loop each of 13 feed paths
LF->>BS: fetchGmailViaAtomFeedSingle(feedPath, allowBootstrap:false)
BS-->>LF: atom emails [atom_sha1 IDs if no message_id in URL]
end
LF-->>GRS: labelEmails [mixed ID formats]
GRS->>GRS: merge by email.id (ID mismatch = duplicates survive)
GRS-->>OC: merged emails (may contain duplicates)
OC->>LLM: synthesizeFromEmails(emails.prefix(120))
LLM-->>OC: memories, tasks, profileSummary
|
| var merged: [String: GmailEmail] = [:] | ||
| for email in bootstrapEmails + labelEmails { | ||
| let existing = merged[email.id] | ||
| if existing == nil || existing!.date < email.date { | ||
| merged[email.id] = email | ||
| } | ||
| } | ||
| emails = Array(merged.values) | ||
| .sorted { $0.date > $1.date } | ||
| .prefix(maxResults) | ||
| .map(\.self) |
There was a problem hiding this comment.
Dedup key mismatch between bootstrap and Atom-feed emails
The bootstrapEmails path extracts real Gmail thread/message IDs (hex strings like 1958abc…), while the Atom-feed path (fetchGmailViaLabelFeeds) generates atom_<sha1> IDs for any entry whose link URL lacks /message_id=. Because the two sources produce structurally incompatible IDs for the same email, the dictionary merge keyed on email.id will not detect those duplicates—the same message can end up in the result set twice. When the merged array is later passed to synthesizeFromEmails, the LLM receives duplicate content, inflating memory counts and wasting tokens.
| private func fetchGmailViaLabelFeeds(maxResults: Int) throws -> [GmailEmail] { | ||
| guard maxResults > 0 else { return [] } | ||
|
|
||
| let feedPaths = [ | ||
| "atom/all", | ||
| "atom/inbox", | ||
| "atom/sent", | ||
| "atom/starred", | ||
| "atom/important", | ||
| "atom/trash", | ||
| "atom/spam", | ||
| "atom/unread", | ||
| "atom/social", | ||
| "atom/promotions", | ||
| "atom/updates", | ||
| "atom/forums", | ||
| "atom/personal", | ||
| ] | ||
|
|
||
| var merged: [String: GmailEmail] = [:] | ||
| for feedPath in feedPaths { | ||
| let feedEmails = try fetchGmailViaAtomFeedSingle( | ||
| maxResults: min(20, maxResults), | ||
| query: "newer_than:1d", | ||
| feedPath: feedPath, | ||
| allowBootstrap: false | ||
| ) | ||
| for email in feedEmails { |
There was a problem hiding this comment.
fetchGmailViaLabelFeeds ignores the caller-supplied query
Every call inside the loop passes query: "newer_than:1d" as a hard-coded string. However, because feedPath is also set, the Python script takes the feedPath branch and builds the URL from the feed path alone—the query argument is never used. The hard-coded value is dead code but also misleading: a reader might assume these feeds are limited to the past day, whereas they actually return the N most recent items in each Gmail folder regardless of date. Consider removing the query argument from these calls (pass query: "" or add a dedicated parameter) to make the intent explicit.
* Read Gmail bootstrap inbox before Atom fallback * Improve onboarding imports and graph summaries
Summary
Testing