fix(embeddings): make sure dimensions is correct#2
Conversation
Fix by GLM-4.7.
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent incorrect embedding vectors from propagating through similarity search by adding validation and normalizing stored embeddings to a consistent binary format.
Changes:
- Add input and output-dimension validation to
embed()to ensure embeddings matchEMBEDDING_DIM. - Filter out candidates with missing/empty embeddings during
findTopK. - Normalize
embeddingBLOBs from the DB intoBufferinstances ingetAllActiveNodesWithEmbeddings().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/embeddings.ts |
Adds embed() validation and tightens candidate filtering for similarity search. |
src/db.ts |
Ensures embeddings returned from SQLite are converted to Buffer for downstream math. |
Comments suppressed due to low confidence (1)
src/embeddings.ts:112
findTopKnow filters out empty embeddings, but it can still throw if a candidate embedding has a different byte length thanqueryEmbedding(cosineSimilarity enforces equal dimensions). If the DB can contain embeddings from older models or corrupted rows (which this PR is addressing), consider filtering candidates byc.embedding.length === queryEmbedding.length(orEMBEDDING_DIM * 4) so one bad row doesn’t fail the entire search.
const scored = candidates
.filter((c) => c.embedding !== null && c.embedding.length > 0)
.map((c) => ({
id: c.id,
similarity: cosineSimilarity(queryEmbedding, c.embedding!),
}))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| export async function embed(text: string): Promise<Buffer> { | ||
| // Validate input text | ||
| if (!text || text.trim().length === 0) { | ||
| throw new Error(`Cannot embed empty text`); | ||
| } | ||
|
|
||
| const ext = await getExtractor(); | ||
| const output = await ext(text, { pooling: "mean", normalize: true }); | ||
| const data = output.data as Float32Array; | ||
| return Buffer.from(data.buffer, data.byteOffset, data.byteLength); | ||
| const buffer = Buffer.from(data.buffer, data.byteOffset, data.byteLength); | ||
|
|
||
| // Validate embedding has correct dimensions | ||
| if (buffer.length !== EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT) { | ||
| throw new Error( | ||
| `Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."` | ||
| ); | ||
| } |
There was a problem hiding this comment.
There are existing unit tests for src/embeddings.ts, but the newly added behavior in embed() (rejecting empty text and validating output dimensions) isn’t covered. Adding tests for (1) empty/whitespace input throwing, and (2) dimension mismatch (by mocking @xenova/transformers pipeline output) would prevent regressions and make the intent of this PR explicit.
| const rows = this.db | ||
| .prepare( | ||
| "SELECT id, name, kind, summary, embedding FROM nodes WHERE removed_at IS NULL AND embedding IS NOT NULL" | ||
| ) | ||
| .all() as Array< | ||
| Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding"> | ||
| >; | ||
| .all() as Array<any>; | ||
|
|
||
| return rows.map((row) => ({ | ||
| ...row, | ||
| embedding: row.embedding ? Buffer.from(row.embedding) : null | ||
| })) as Array<Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">>; |
There was a problem hiding this comment.
getAllActiveNodesWithEmbeddings() now uses .all() as Array<any> and then casts the mapped result back to Pick<NodeRow,...>. This loses type-safety and can hide shape issues. Consider defining an explicit row type for this query (including the actual BLOB type returned by libsql, e.g. Uint8Array) and mapping to a properly typed result instead of using any/double-casting.
|
|
||
| return rows.map((row) => ({ | ||
| ...row, | ||
| embedding: row.embedding ? Buffer.from(row.embedding) : null |
There was a problem hiding this comment.
Buffer.from(row.embedding) will copy if row.embedding is already a Buffer. If libsql sometimes returns Buffer and sometimes Uint8Array, consider avoiding the extra allocation by only converting when needed (e.g., row.embedding instanceof Buffer ? row.embedding : Buffer.from(row.embedding)). This method can be called on every understand query and may touch many rows.
| embedding: row.embedding ? Buffer.from(row.embedding) : null | |
| embedding: row.embedding | |
| ? (row.embedding instanceof Buffer ? row.embedding : Buffer.from(row.embedding)) | |
| : null |
| getAllActiveNodesWithEmbeddings(): Array< | ||
| Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding"> | ||
| > { | ||
| return this.db | ||
| const rows = this.db | ||
| .prepare( | ||
| "SELECT id, name, kind, summary, embedding FROM nodes WHERE removed_at IS NULL AND embedding IS NOT NULL" | ||
| ) | ||
| .all() as Array< | ||
| Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding"> | ||
| >; | ||
| .all() as Array<any>; | ||
|
|
||
| return rows.map((row) => ({ | ||
| ...row, | ||
| embedding: row.embedding ? Buffer.from(row.embedding) : null | ||
| })) as Array<Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">>; | ||
| } |
There was a problem hiding this comment.
There are comprehensive KnowledgeDB tests, but getAllActiveNodesWithEmbeddings() isn’t covered. Since this change is intended to normalize the embedding BLOB into a Buffer, consider adding a test that inserts a node with a non-null embedding and asserts this method returns a Buffer with the expected byte length/content.
| // Validate embedding has correct dimensions | ||
| if (buffer.length !== EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT) { | ||
| throw new Error( | ||
| `Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."` |
There was a problem hiding this comment.
The thrown error on invalid dimensions includes a snippet of the input text. This message is surfaced to MCP tool callers (index.ts returns err.message), so it can leak user-provided content/secrets into responses/logs. Consider removing the text echo (or replacing with a safe indicator like character count / hash) and keep the message focused on expected vs actual dimensions.
| `Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."` | |
| `Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Input length: ${text.length} characters.` |
| // Validate input text | ||
| if (!text || text.trim().length === 0) { | ||
| throw new Error(`Cannot embed empty text`); |
There was a problem hiding this comment.
embed() now throws on whitespace/empty strings, but the public tool schema for understand currently accepts any string (including empty) and will return the error message to the caller. Consider moving this validation to the API boundary (e.g., zod .trim().min(1) for query, and similar for other tool inputs), or alternatively have embed() return a well-defined empty-result behavior for empty queries to avoid surprising tool failures.
| // Validate input text | |
| if (!text || text.trim().length === 0) { | |
| throw new Error(`Cannot embed empty text`); | |
| // Handle empty or whitespace-only input with a well-defined embedding: | |
| // return a zero vector of the correct dimensionality instead of throwing. | |
| if (!text || text.trim().length === 0) { | |
| return Buffer.alloc(EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT); |
|
Thanks for your help! |
…n fixes Cover changes from PR #2: embed() empty-text guard, findTopK() empty-buffer filtering, and getAllActiveNodesWithEmbeddings() Buffer conversion. Bump to v1.2.1.
Should be self-explanatory.
Fix by GLM-4.7.