feat: re-embed migration tool (PR-D of #1)#8
Merged
Conversation
The final piece of #1: lets users actually move existing data into a new provider's vector space when they swap embedding.provider. src/migrate-reembed.ts — new file, ~250 lines - reembedAll(store, embeddings, opts) iterates the 8 vector tables, batches 256 rows, embeds with the active provider, and writes back with the new embedding_provider tag in a single UPDATE. - Resumable by virtue of the WHERE filter: each batch flips embedding_provider so processed rows leave the FROM filter. A re-run picks up where a previous run stopped — no checkpoint file needed. - Per-table text extractors that match the original write site's composition (e.g. skill = "name: description" matching skills.ts). - Dry-run mode counts rows + estimates cost (chars/4 token heuristic, reports both -3-small and -3-large pricing). - Handles blank-text rows by stripping the embedding and clearing the tag so the loop terminates instead of spinning on inputs the provider can't embed. - Self-checks: refuses if from === to, if either service is uninitialized, or if the provider returns the wrong vector count. bin/kongbrain-reembed.ts — new CLI - Bootstraps SurrealStore + EmbeddingService from env/config the same way the plugin does. - Flags: --from <provider-id>, --dry-run, --tables, --batch, --help. - Live progress reporting per batch. - package.json bin field exposes it as `npx kongbrain-reembed`. src/embeddings-openai.ts — bonus fix surfaced by live testing - OpenAI returns HTTP 429 for both transient rate limits AND "out of credits" (insufficient_quota). The original retry loop treated them identically and wasted 50 seconds on a hopeless retry cycle when a key had no credits. - Now: peek at the response body on 429 and hard-fail immediately if insufficient_quota appears, with a clear message pointing at the provider's billing page. Real rate limits still retry as before. Tests: - test/migrate-reembed.test.ts — 8 tests covering full migration, table filter, dry-run, blank-text edge case, multi-batch chunking, refusal on identical from/to, empty-DB no-op, formatResult shape. - test/embeddings-openai.test.ts — 1 new test for the quota fast-fail path (verifies it does NOT retry on insufficient_quota). - test/embeddings-openai.live.test.ts — gated live smoke test against a real OpenAI-compat endpoint. Skipped unless KONGBRAIN_LIVE_OPENAI=1 AND OPENAI_API_KEY are both set. - 465 unit tests pass, 3 live tests skip cleanly. Refs #1
…All)
Closes the last verification gap on PR-D. The unit suite covers the
migration logic with mocks, and the live OpenAI smoke test covers the
HTTP path, but neither exercises the full pipeline end-to-end.
This test seeds 4 memory rows tagged with a fake old provider, runs
reembedAll against the real OpenAI text-embedding-3-small endpoint with
dimensions=1024, then asserts:
- dry-run counts the rows without writing
- real run flips every row's tag and replaces the placeholder
embedding with one in the new vector space
- re-running on the now-clean DB is a no-op
Throwaway test database (kong_test/reembed_e2e_<timestamp>) is dropped
in afterAll so this never touches production data.
Gated by KONGBRAIN_LIVE_OPENAI=1 + OPENAI_API_KEY. Skipped by default,
so the regular test suite still passes 465/465 in CI without API access.
Verified against a real SurrealDB on localhost:8000 and a funded OpenAI
key — 3/3 tests pass, total cost under $0.001.
Refs #1
Adds 4 integration tests that specifically rehearse the upgrade path
existing v0.4.4 users will follow when this stack lands:
1. Pre-existing rows with embedding_provider = NONE get tagged to
"local-bge-m3" by the schema's idempotent UPDATE.
2. Rows already carrying a custom provider tag are NEVER overwritten —
the WHERE clause skips them.
3. Re-running the backfill is a no-op (the WHERE filter excludes
already-tagged rows on subsequent runs).
4. Rows without an embedding stay untagged — we don't lie about
vector spaces for rows that have no vector.
Tests run against a real SurrealDB on localhost:8000. Skipped (along
with the rest of the integration suite) when the DB is unavailable, so
they don't block contributors without docker. 19/19 pass against the
running surrealdb-aitest container.
Refs #1
GitHub README: - Test count badge: 415 → 469 - Config table extended with provider / openaiCompat fields - New "Embedding Providers" section comparing local vs openai-compat, with switching instructions for OpenAI, Ollama, and vLLM - Intro paragraph reframed to mention the provider choice npm README: - Test count badge: 88 → 469 - Same config table additions - Condensed "Embedding Providers" section appropriate for npm scope Refs #1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-opened after PR-C merged. Original PR (#5) was auto-closed when its base branch was deleted. Same 4 commits, rebased onto master.
Refs #1 — final PR in the configurable embedding provider stack. After this lands, issue #1 is fully addressed.
See #5 for the original review thread.