feat: re-embed migration tool (PR-D of #1) by 42U · Pull Request #8 · 42U/kongbrain

42U · 2026-04-25T19:43:51Z

Re-opened after PR-C merged. Original PR (#5) was auto-closed when its base branch was deleted. Same 4 commits, rebased onto master.

Refs #1 — final PR in the configurable embedding provider stack. After this lands, issue #1 is fully addressed.

See #5 for the original review thread.

The final piece of #1: lets users actually move existing data into a new provider's vector space when they swap embedding.provider. src/migrate-reembed.ts — new file, ~250 lines - reembedAll(store, embeddings, opts) iterates the 8 vector tables, batches 256 rows, embeds with the active provider, and writes back with the new embedding_provider tag in a single UPDATE. - Resumable by virtue of the WHERE filter: each batch flips embedding_provider so processed rows leave the FROM filter. A re-run picks up where a previous run stopped — no checkpoint file needed. - Per-table text extractors that match the original write site's composition (e.g. skill = "name: description" matching skills.ts). - Dry-run mode counts rows + estimates cost (chars/4 token heuristic, reports both -3-small and -3-large pricing). - Handles blank-text rows by stripping the embedding and clearing the tag so the loop terminates instead of spinning on inputs the provider can't embed. - Self-checks: refuses if from === to, if either service is uninitialized, or if the provider returns the wrong vector count. bin/kongbrain-reembed.ts — new CLI - Bootstraps SurrealStore + EmbeddingService from env/config the same way the plugin does. - Flags: --from <provider-id>, --dry-run, --tables, --batch, --help. - Live progress reporting per batch. - package.json bin field exposes it as `npx kongbrain-reembed`. src/embeddings-openai.ts — bonus fix surfaced by live testing - OpenAI returns HTTP 429 for both transient rate limits AND "out of credits" (insufficient_quota). The original retry loop treated them identically and wasted 50 seconds on a hopeless retry cycle when a key had no credits. - Now: peek at the response body on 429 and hard-fail immediately if insufficient_quota appears, with a clear message pointing at the provider's billing page. Real rate limits still retry as before. Tests: - test/migrate-reembed.test.ts — 8 tests covering full migration, table filter, dry-run, blank-text edge case, multi-batch chunking, refusal on identical from/to, empty-DB no-op, formatResult shape. - test/embeddings-openai.test.ts — 1 new test for the quota fast-fail path (verifies it does NOT retry on insufficient_quota). - test/embeddings-openai.live.test.ts — gated live smoke test against a real OpenAI-compat endpoint. Skipped unless KONGBRAIN_LIVE_OPENAI=1 AND OPENAI_API_KEY are both set. - 465 unit tests pass, 3 live tests skip cleanly. Refs #1

…All) Closes the last verification gap on PR-D. The unit suite covers the migration logic with mocks, and the live OpenAI smoke test covers the HTTP path, but neither exercises the full pipeline end-to-end. This test seeds 4 memory rows tagged with a fake old provider, runs reembedAll against the real OpenAI text-embedding-3-small endpoint with dimensions=1024, then asserts: - dry-run counts the rows without writing - real run flips every row's tag and replaces the placeholder embedding with one in the new vector space - re-running on the now-clean DB is a no-op Throwaway test database (kong_test/reembed_e2e_<timestamp>) is dropped in afterAll so this never touches production data. Gated by KONGBRAIN_LIVE_OPENAI=1 + OPENAI_API_KEY. Skipped by default, so the regular test suite still passes 465/465 in CI without API access. Verified against a real SurrealDB on localhost:8000 and a funded OpenAI key — 3/3 tests pass, total cost under $0.001. Refs #1

Adds 4 integration tests that specifically rehearse the upgrade path existing v0.4.4 users will follow when this stack lands: 1. Pre-existing rows with embedding_provider = NONE get tagged to "local-bge-m3" by the schema's idempotent UPDATE. 2. Rows already carrying a custom provider tag are NEVER overwritten — the WHERE clause skips them. 3. Re-running the backfill is a no-op (the WHERE filter excludes already-tagged rows on subsequent runs). 4. Rows without an embedding stay untagged — we don't lie about vector spaces for rows that have no vector. Tests run against a real SurrealDB on localhost:8000. Skipped (along with the rest of the integration suite) when the DB is unavailable, so they don't block contributors without docker. 19/19 pass against the running surrealdb-aitest container. Refs #1

GitHub README: - Test count badge: 415 → 469 - Config table extended with provider / openaiCompat fields - New "Embedding Providers" section comparing local vs openai-compat, with switching instructions for OpenAI, Ollama, and vLLM - Intro paragraph reframed to mention the provider choice npm README: - Test count badge: 88 → 469 - Same config table additions - Condensed "Embedding Providers" section appropriate for npm scope Refs #1

42U added 4 commits April 25, 2026 15:43

42U merged commit c285da8 into master Apr 25, 2026
2 checks passed

42U deleted the feat/embedding-provider-reembed branch April 25, 2026 19:44

42U mentioned this pull request Apr 25, 2026

Feature request: configurable embedding provider (OpenAI, etc.) #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: re-embed migration tool (PR-D of #1)#8

feat: re-embed migration tool (PR-D of #1)#8
42U merged 4 commits into
masterfrom
feat/embedding-provider-reembed

42U commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

42U commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant