fix: prevent Sentry noise from embedding failures and fix surrogate pair truncation#344
Merged
Conversation
…air truncation Add isAvailable() guard to fire-and-forget embedding functions (embedKnowledgeEntry, embedDistillation, embedTemporalMessage) so they short-circuit when the provider is broken instead of spamming log.error on every call. Break backfill loops on LocalProviderUnavailableError — no point retrying remaining batches when the provider is dead. Extract safeLocalTruncate() helper that avoids splitting UTF-16 surrogate pairs at the truncation boundary.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #343. Addresses Sentry noise and a surrogate pair edge case found during self-review.
isAvailable()guard to fire-and-forget embedding functions to short-circuit when provider is brokenLocalProviderUnavailableErrorto avoid O(batches) Sentry events per startupsafeLocalTruncate()helper that avoids splitting UTF-16 surrogate pairs at the truncation boundaryProblem
PR #343 upgraded
log.infotolog.errorin fire-and-forget embedding catches (embedKnowledgeEntry,embedDistillation,embedTemporalMessage). But when the local provider is broken, every single call to these functions would throw and firelog.error→captureException()— potentially 50-200+ Sentry events per session.Similarly, the backfill loops would retry every batch even after the first one fails with
LocalProviderUnavailableError, producing O(items/batchSize) Sentry events on startup.The
String.slice()truncation could also split a UTF-16 surrogate pair (emoji, CJK supplementary chars), producing an invalid lone surrogate passed to the tokenizer.Changes
embedKnowledgeEntry()if (!isAvailable()) return;early exitembedDistillation()if (!isAvailable()) return;early exitembedTemporalMessage()if (!isAvailable()) return;early exitbackfillEmbeddings()catchbreakonLocalProviderUnavailableErrorbackfillDistillationEmbeddings()catchbreakonLocalProviderUnavailableErrorLocalProvider.embed()safeLocalTruncate()helper instead of rawslice()