Skip to content

fix: prevent Sentry noise from embedding failures and fix surrogate pair truncation#344

Merged
BYK merged 1 commit into
mainfrom
fix/embedding-sentry-noise
May 15, 2026
Merged

fix: prevent Sentry noise from embedding failures and fix surrogate pair truncation#344
BYK merged 1 commit into
mainfrom
fix/embedding-sentry-noise

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 15, 2026

Summary

Follow-up to #343. Addresses Sentry noise and a surrogate pair edge case found during self-review.

  • Add isAvailable() guard to fire-and-forget embedding functions to short-circuit when provider is broken
  • Break backfill loops on LocalProviderUnavailableError to avoid O(batches) Sentry events per startup
  • Extract safeLocalTruncate() helper that avoids splitting UTF-16 surrogate pairs at the truncation boundary

Problem

PR #343 upgraded log.info to log.error in fire-and-forget embedding catches (embedKnowledgeEntry, embedDistillation, embedTemporalMessage). But when the local provider is broken, every single call to these functions would throw and fire log.errorcaptureException() — potentially 50-200+ Sentry events per session.

Similarly, the backfill loops would retry every batch even after the first one fails with LocalProviderUnavailableError, producing O(items/batchSize) Sentry events on startup.

The String.slice() truncation could also split a UTF-16 surrogate pair (emoji, CJK supplementary chars), producing an invalid lone surrogate passed to the tokenizer.

Changes

Location Fix
embedKnowledgeEntry() Add if (!isAvailable()) return; early exit
embedDistillation() Add if (!isAvailable()) return; early exit
embedTemporalMessage() Add if (!isAvailable()) return; early exit
backfillEmbeddings() catch break on LocalProviderUnavailableError
backfillDistillationEmbeddings() catch break on LocalProviderUnavailableError
LocalProvider.embed() Use safeLocalTruncate() helper instead of raw slice()

…air truncation

Add isAvailable() guard to fire-and-forget embedding functions
(embedKnowledgeEntry, embedDistillation, embedTemporalMessage) so they
short-circuit when the provider is broken instead of spamming log.error
on every call.

Break backfill loops on LocalProviderUnavailableError — no point
retrying remaining batches when the provider is dead.

Extract safeLocalTruncate() helper that avoids splitting UTF-16
surrogate pairs at the truncation boundary.
@BYK BYK merged commit ee56309 into main May 15, 2026
7 checks passed
@BYK BYK deleted the fix/embedding-sentry-noise branch May 15, 2026 15:16
This was referenced May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant