Skip to content

fix(embeddings): make sure dimensions is correct#2

Merged
0xK3vin merged 1 commit into0xK3vin:mainfrom
mikaelj:issue-1-fix-read-embeddings
Feb 8, 2026
Merged

fix(embeddings): make sure dimensions is correct#2
0xK3vin merged 1 commit into0xK3vin:mainfrom
mikaelj:issue-1-fix-read-embeddings

Conversation

@mikaelj
Copy link
Copy Markdown
Contributor

@mikaelj mikaelj commented Feb 8, 2026

Should be self-explanatory.

Fix by GLM-4.7.

Copilot AI review requested due to automatic review settings February 8, 2026 07:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent incorrect embedding vectors from propagating through similarity search by adding validation and normalizing stored embeddings to a consistent binary format.

Changes:

  • Add input and output-dimension validation to embed() to ensure embeddings match EMBEDDING_DIM.
  • Filter out candidates with missing/empty embeddings during findTopK.
  • Normalize embedding BLOBs from the DB into Buffer instances in getAllActiveNodesWithEmbeddings().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/embeddings.ts Adds embed() validation and tightens candidate filtering for similarity search.
src/db.ts Ensures embeddings returned from SQLite are converted to Buffer for downstream math.
Comments suppressed due to low confidence (1)

src/embeddings.ts:112

  • findTopK now filters out empty embeddings, but it can still throw if a candidate embedding has a different byte length than queryEmbedding (cosineSimilarity enforces equal dimensions). If the DB can contain embeddings from older models or corrupted rows (which this PR is addressing), consider filtering candidates by c.embedding.length === queryEmbedding.length (or EMBEDDING_DIM * 4) so one bad row doesn’t fail the entire search.
  const scored = candidates
    .filter((c) => c.embedding !== null && c.embedding.length > 0)
    .map((c) => ({
      id: c.id,
      similarity: cosineSimilarity(queryEmbedding, c.embedding!),
    }))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/embeddings.ts
Comment on lines 24 to +40
export async function embed(text: string): Promise<Buffer> {
// Validate input text
if (!text || text.trim().length === 0) {
throw new Error(`Cannot embed empty text`);
}

const ext = await getExtractor();
const output = await ext(text, { pooling: "mean", normalize: true });
const data = output.data as Float32Array;
return Buffer.from(data.buffer, data.byteOffset, data.byteLength);
const buffer = Buffer.from(data.buffer, data.byteOffset, data.byteLength);

// Validate embedding has correct dimensions
if (buffer.length !== EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT) {
throw new Error(
`Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."`
);
}
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for src/embeddings.ts, but the newly added behavior in embed() (rejecting empty text and validating output dimensions) isn’t covered. Adding tests for (1) empty/whitespace input throwing, and (2) dimension mismatch (by mocking @xenova/transformers pipeline output) would prevent regressions and make the intent of this PR explicit.

Copilot uses AI. Check for mistakes.
Comment thread src/db.ts
Comment on lines +322 to +331
const rows = this.db
.prepare(
"SELECT id, name, kind, summary, embedding FROM nodes WHERE removed_at IS NULL AND embedding IS NOT NULL"
)
.all() as Array<
Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">
>;
.all() as Array<any>;

return rows.map((row) => ({
...row,
embedding: row.embedding ? Buffer.from(row.embedding) : null
})) as Array<Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">>;
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAllActiveNodesWithEmbeddings() now uses .all() as Array<any> and then casts the mapped result back to Pick<NodeRow,...>. This loses type-safety and can hide shape issues. Consider defining an explicit row type for this query (including the actual BLOB type returned by libsql, e.g. Uint8Array) and mapping to a properly typed result instead of using any/double-casting.

Copilot uses AI. Check for mistakes.
Comment thread src/db.ts

return rows.map((row) => ({
...row,
embedding: row.embedding ? Buffer.from(row.embedding) : null
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer.from(row.embedding) will copy if row.embedding is already a Buffer. If libsql sometimes returns Buffer and sometimes Uint8Array, consider avoiding the extra allocation by only converting when needed (e.g., row.embedding instanceof Buffer ? row.embedding : Buffer.from(row.embedding)). This method can be called on every understand query and may touch many rows.

Suggested change
embedding: row.embedding ? Buffer.from(row.embedding) : null
embedding: row.embedding
? (row.embedding instanceof Buffer ? row.embedding : Buffer.from(row.embedding))
: null

Copilot uses AI. Check for mistakes.
Comment thread src/db.ts
Comment on lines 319 to 332
getAllActiveNodesWithEmbeddings(): Array<
Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">
> {
return this.db
const rows = this.db
.prepare(
"SELECT id, name, kind, summary, embedding FROM nodes WHERE removed_at IS NULL AND embedding IS NOT NULL"
)
.all() as Array<
Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">
>;
.all() as Array<any>;

return rows.map((row) => ({
...row,
embedding: row.embedding ? Buffer.from(row.embedding) : null
})) as Array<Pick<NodeRow, "id" | "name" | "kind" | "summary" | "embedding">>;
}
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are comprehensive KnowledgeDB tests, but getAllActiveNodesWithEmbeddings() isn’t covered. Since this change is intended to normalize the embedding BLOB into a Buffer, consider adding a test that inserts a node with a non-null embedding and asserts this method returns a Buffer with the expected byte length/content.

Copilot uses AI. Check for mistakes.
Comment thread src/embeddings.ts
// Validate embedding has correct dimensions
if (buffer.length !== EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT) {
throw new Error(
`Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."`
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thrown error on invalid dimensions includes a snippet of the input text. This message is surfaced to MCP tool callers (index.ts returns err.message), so it can leak user-provided content/secrets into responses/logs. Consider removing the text echo (or replacing with a safe indicator like character count / hash) and keep the message focused on expected vs actual dimensions.

Suggested change
`Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Text: "${text.substring(0, 50)}..."`
`Invalid embedding dimension: expected ${EMBEDDING_DIM} floats (${EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT} bytes), got ${data.length} floats (${buffer.length} bytes). Input length: ${text.length} characters.`

Copilot uses AI. Check for mistakes.
Comment thread src/embeddings.ts
Comment on lines +25 to +27
// Validate input text
if (!text || text.trim().length === 0) {
throw new Error(`Cannot embed empty text`);
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embed() now throws on whitespace/empty strings, but the public tool schema for understand currently accepts any string (including empty) and will return the error message to the caller. Consider moving this validation to the API boundary (e.g., zod .trim().min(1) for query, and similar for other tool inputs), or alternatively have embed() return a well-defined empty-result behavior for empty queries to avoid surprising tool failures.

Suggested change
// Validate input text
if (!text || text.trim().length === 0) {
throw new Error(`Cannot embed empty text`);
// Handle empty or whitespace-only input with a well-defined embedding:
// return a zero vector of the correct dimensionality instead of throwing.
if (!text || text.trim().length === 0) {
return Buffer.alloc(EMBEDDING_DIM * Float32Array.BYTES_PER_ELEMENT);

Copilot uses AI. Check for mistakes.
@0xK3vin 0xK3vin merged commit 8f3c9ef into 0xK3vin:main Feb 8, 2026
5 of 6 checks passed
@0xK3vin
Copy link
Copy Markdown
Owner

0xK3vin commented Feb 8, 2026

Thanks for your help!

0xK3vin added a commit that referenced this pull request Feb 8, 2026
…n fixes

Cover changes from PR #2: embed() empty-text guard, findTopK() empty-buffer
filtering, and getAllActiveNodesWithEmbeddings() Buffer conversion.

Bump to v1.2.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants