feat(llma): unify provider sanitisation under a generic binary content redactor#3539
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
packages/ai/tests/binary_content_redactor.test.ts:19-38
**Non-parameterised data-URL tests**
The four `it` blocks (image, audio, video, application) each verify the same invariant — "data URL is redacted and the MIME type is preserved" — with only the input MIME changing. Per the team's rule, these should be a single `it.each`. The same pattern repeats for the three sibling-MIME hint cases on lines 42–55 (`mediaType`, `media_type`, `mimeType`), which could similarly collapse into one parameterised block.
### Issue 2 of 2
packages/ai/src/sanitization/base64_recognizer.ts:1-9
**Parameterised data URLs with MIME params not recognised**
`DATA_URL_PREFIX_RE` matches `data:<type>;base64,<data>` but not data URLs where extra parameters appear before the `base64` token, e.g. `data:audio/L16;codec=pcm;rate=24000;base64,<data>`. For such a value the regex fails, the string falls through to the raw-base64 alphabet check, and if it's shorter than `minLength` it escapes redaction entirely. The test suite already contains `audio/L16;codec=pcm;rate=24000` as a real MIME hint (via a sibling `mimeType` field), so the practical exposure is low — but if a caller ever sends this media type embedded in a data URL, it will be silently missed.
Reviews (1): Last reviewed commit: "feat: changeset" | Re-trigger Greptile |
|
Size Change: +30.4 kB (+0.19%) Total Size: 15.9 MB
ℹ️ View Unchanged
|
| import { Base64Recognizer } from './base64_recognizer' | ||
| import { MediaTypeContext } from './media_type_context' | ||
|
|
||
| const STRONG_CONTEXT_MIN_LENGTH = 64 |
There was a problem hiding this comment.
Why the minimum length? Could we have smaller but still base64?
There was a problem hiding this comment.
It's very unlikely to have such small base64 files and if we did it wouldn't be so bad to ingest them. On the flip side without this, we could have some fields that look like or are base64 that we want to keep, like IDs or encoded JSON payloads, or even alphanumeric strings without spaces, that we might end up redacting by accident.
| export const sanitizeLangChain = (data: unknown): unknown => { | ||
| return processMessages(data, sanitizeLangChainImage) | ||
| } | ||
| export const sanitizeOpenAI = (data: unknown): unknown => redactor.redact(data) |
There was a problem hiding this comment.
Is this being done for streaming OpenAI too?
There was a problem hiding this comment.
Yes, I didn't change any callsites for these functions, just their internal implementation.
There was a problem hiding this comment.
Here it is in chat completions and here it is in the responses API , both for streaming and in master.
We don't touch these files on this PR, does this still look off?
Problem
sanitization.tscarried six near-duplicate provider walkers — one per shape we encounter (image_url,inlineData,source.data, etc.). Every new provider needed another walker and every new content shape needed each walker patched, so coverage gaps were inevitable. The placeholder also lied: every redaction came out as[base64 image redacted], even for audio, video, or PDF data.Several users were already encountering cases where redaction was not happening, causing events to fail due to size limits.
Changes
New
sanitization/module that consolidates the redaction logic:Base64Recognizer— single recognizer for data URLs (with media-type capture) and raw base64. Uses a confidence-prefix alphabet check onminLengthchars, with astartsWith('data:')fast-reject up front.MediaTypeContext— derives the most specific MIME from sibling/parent metadata or the field name. Inference ladder is split into public methods (inferFromSiblingMime,inferFromSiblingFormat,inferFromParentType,inferFromKey).BinaryContentRedactor— generic walker. Redacts data URLs, raw base64 (length floor 64 with strong context, 1024 without), andUint8Array/Buffervalues. Cycle-safe via aWeakSetit owns. Skips redaction when_INTERNAL_LLMA_MULTIMODALis set.Provider sanitizers (
sanitizeOpenAI,sanitizeAnthropic,sanitizeGemini,sanitizeLangChain,sanitizeOpenAIResponse,redactBase64DataUrl) collapse into thin wrappers over a shared redactor instance — call sites inopenai/,anthropic/,gemini/,langchain/,vercel/,utils.tsare unchanged.The entire original test suite of these sanitisers has been kept almost intact and still passing. The only minor modifications are to accommodate the new placeholders with more information, and to adapt to an issue that's been fixed which meant short alphanumeric strings, like single words, could be accidentally redacted.
Placeholders now carry the actual content type:
[base64 image/jpeg redacted],[base64 image/png redacted],[base64 video/mp4 redacted]for data URLs[base64 application/pdf redacted],[base64 audio/L16;codec=pcm;rate=24000 redacted]from sibling MIME hints[base64 audio redacted]when only the family is known[base64 redacted]for bare-string entry points with no parent contextCommits are split for review:
feat: implement generic binary content redactor— adds the new classes + 37 unit tests in isolation, no callers changed.feat: switch sanitisiation to generic redactor— rewiressanitization.tsto delegate, updates legacy test placeholder assertions to the precise outputs.Release info Sub-libraries affected
Libraries affected
Checklist
If releasing new changes
pnpm changesetto generate a changeset file🤖 Agent context
Built collaboratively with Claude Code over multiple iterations. Notable design calls during the session:
BinaryContentRedactorowns itsWeakSetand resets it perredact()call rather than threadingvisitedthroughwalk().MediaTypeContext.forChildwas removed; callers use the constructor directly.[base64 image redacted]so audio/video/PDF data is no longer mislabelled "image".