Skip to content

feat: re-embed migration tool (PR-D of #1)#8

Merged
42U merged 4 commits into
masterfrom
feat/embedding-provider-reembed
Apr 25, 2026
Merged

feat: re-embed migration tool (PR-D of #1)#8
42U merged 4 commits into
masterfrom
feat/embedding-provider-reembed

Conversation

@42U
Copy link
Copy Markdown
Owner

@42U 42U commented Apr 25, 2026

Re-opened after PR-C merged. Original PR (#5) was auto-closed when its base branch was deleted. Same 4 commits, rebased onto master.

Refs #1 — final PR in the configurable embedding provider stack. After this lands, issue #1 is fully addressed.

See #5 for the original review thread.

42U added 4 commits April 25, 2026 15:43
The final piece of #1: lets users actually move existing data into a new
provider's vector space when they swap embedding.provider.

src/migrate-reembed.ts — new file, ~250 lines
- reembedAll(store, embeddings, opts) iterates the 8 vector tables,
  batches 256 rows, embeds with the active provider, and writes back
  with the new embedding_provider tag in a single UPDATE.
- Resumable by virtue of the WHERE filter: each batch flips
  embedding_provider so processed rows leave the FROM filter. A re-run
  picks up where a previous run stopped — no checkpoint file needed.
- Per-table text extractors that match the original write site's
  composition (e.g. skill = "name: description" matching skills.ts).
- Dry-run mode counts rows + estimates cost (chars/4 token heuristic,
  reports both -3-small and -3-large pricing).
- Handles blank-text rows by stripping the embedding and clearing the
  tag so the loop terminates instead of spinning on inputs the
  provider can't embed.
- Self-checks: refuses if from === to, if either service is
  uninitialized, or if the provider returns the wrong vector count.

bin/kongbrain-reembed.ts — new CLI
- Bootstraps SurrealStore + EmbeddingService from env/config the same
  way the plugin does.
- Flags: --from <provider-id>, --dry-run, --tables, --batch, --help.
- Live progress reporting per batch.
- package.json bin field exposes it as `npx kongbrain-reembed`.

src/embeddings-openai.ts — bonus fix surfaced by live testing
- OpenAI returns HTTP 429 for both transient rate limits AND
  "out of credits" (insufficient_quota). The original retry loop
  treated them identically and wasted 50 seconds on a hopeless retry
  cycle when a key had no credits.
- Now: peek at the response body on 429 and hard-fail immediately if
  insufficient_quota appears, with a clear message pointing at the
  provider's billing page. Real rate limits still retry as before.

Tests:
- test/migrate-reembed.test.ts — 8 tests covering full migration, table
  filter, dry-run, blank-text edge case, multi-batch chunking, refusal
  on identical from/to, empty-DB no-op, formatResult shape.
- test/embeddings-openai.test.ts — 1 new test for the quota fast-fail
  path (verifies it does NOT retry on insufficient_quota).
- test/embeddings-openai.live.test.ts — gated live smoke test against a
  real OpenAI-compat endpoint. Skipped unless KONGBRAIN_LIVE_OPENAI=1
  AND OPENAI_API_KEY are both set.
- 465 unit tests pass, 3 live tests skip cleanly.

Refs #1
…All)

Closes the last verification gap on PR-D. The unit suite covers the
migration logic with mocks, and the live OpenAI smoke test covers the
HTTP path, but neither exercises the full pipeline end-to-end.

This test seeds 4 memory rows tagged with a fake old provider, runs
reembedAll against the real OpenAI text-embedding-3-small endpoint with
dimensions=1024, then asserts:
  - dry-run counts the rows without writing
  - real run flips every row's tag and replaces the placeholder
    embedding with one in the new vector space
  - re-running on the now-clean DB is a no-op

Throwaway test database (kong_test/reembed_e2e_<timestamp>) is dropped
in afterAll so this never touches production data.

Gated by KONGBRAIN_LIVE_OPENAI=1 + OPENAI_API_KEY. Skipped by default,
so the regular test suite still passes 465/465 in CI without API access.

Verified against a real SurrealDB on localhost:8000 and a funded OpenAI
key — 3/3 tests pass, total cost under $0.001.

Refs #1
Adds 4 integration tests that specifically rehearse the upgrade path
existing v0.4.4 users will follow when this stack lands:

  1. Pre-existing rows with embedding_provider = NONE get tagged to
     "local-bge-m3" by the schema's idempotent UPDATE.
  2. Rows already carrying a custom provider tag are NEVER overwritten —
     the WHERE clause skips them.
  3. Re-running the backfill is a no-op (the WHERE filter excludes
     already-tagged rows on subsequent runs).
  4. Rows without an embedding stay untagged — we don't lie about
     vector spaces for rows that have no vector.

Tests run against a real SurrealDB on localhost:8000. Skipped (along
with the rest of the integration suite) when the DB is unavailable, so
they don't block contributors without docker. 19/19 pass against the
running surrealdb-aitest container.

Refs #1
GitHub README:
- Test count badge: 415 → 469
- Config table extended with provider / openaiCompat fields
- New "Embedding Providers" section comparing local vs openai-compat,
  with switching instructions for OpenAI, Ollama, and vLLM
- Intro paragraph reframed to mention the provider choice

npm README:
- Test count badge: 88 → 469
- Same config table additions
- Condensed "Embedding Providers" section appropriate for npm scope

Refs #1
@42U 42U merged commit c285da8 into master Apr 25, 2026
2 checks passed
@42U 42U deleted the feat/embedding-provider-reembed branch April 25, 2026 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant