-
Notifications
You must be signed in to change notification settings - Fork 181
Feat: add Google AI embedding provider support for vector db #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: add Google AI embedding provider support for vector db #107
Conversation
Signed-off-by: SyedZawwarAhmed <zawwar.ahmed12@gmail.com>
WalkthroughAdds a GoogleEmbeds provider implementing Google Generative AI embeddings, registers it under a new SupportedProvider 'GoogleAI', and re-exports it from the core index. Implements credential resolution (getLLMCredentials or GOOGLE_AI_API_KEY), batching, response validation, and error handling. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Factory as EmbeddingsFactory
participant Provider as GoogleEmbeds
participant Creds as getLLMCredentials
participant Env as Env (GOOGLE_AI_API_KEY)
participant Google as GoogleGenerativeAI
Caller->>Factory: create({ provider: "GoogleAI", model? })
Factory-->>Caller: GoogleEmbeds instance
Caller->>Provider: embedTexts(texts[], candidate)
Note right of Provider: chunk texts into batches (preserve order)
loop per batch
Provider->>Creds: getLLMCredentials(provider="GoogleAI", model)
Creds-->>Provider: credentials? (maybe apiKey)
alt no creds.apiKey
Provider->>Env: read GOOGLE_AI_API_KEY
Env-->>Provider: apiKey?
end
alt missing apiKey
Provider-->>Caller: throw "Please provide an API key..."
else have apiKey
Provider->>Google: new GoogleGenerativeAI({apiKey}) / getGenerativeModel(model)
Provider->>Google: model.embedContent(text)
Google-->>Provider: { result.embedding.values } or error
alt invalid embedding
Provider-->>Caller: throw "Invalid embedding response..."
end
end
end
Provider-->>Caller: number[][] embeddings (ordered)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/core/src/subsystems/IO/VectorDB.service/embed/index.ts (1)
1-26: Update default connector dimensions to 3072
Defaultparams.dimensionsin the VectorDB connectors is hard-coded to 1024, but thetext-embedding-3-largemodel produces 3072-dimensional embeddings. Update these defaults (RAMVectorDB.class.ts:67–69; MilvusVectorDB.class.ts:66–67; PineconeVectorDB.class.ts:75–76) to 3072 or require passing the correctdimensionsin configuration to prevent runtime insert failures.
🧹 Nitpick comments (6)
packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts (5)
12-13: Avoid duplicating the model stringReference the
DEFAULT_MODELconstant to keep the list in sync.- public static models = ['gemini-embedding-001']; + public static models = [DEFAULT_MODEL];
27-35: Flatten batches more simplyThis loop can be replaced with a single flatten while preserving order.
- const embeddings: number[][] = []; - for (let i = 0; i < batchResponses.length; i += 1) { - const batch = batches[i]; - const batchResponse = batchResponses[i]; - for (let j = 0; j < batch.length; j += 1) { - embeddings.push(batchResponse[j]); - } - } - return embeddings; + return batchResponses.flat();
73-87: Batch inner calls concurrently (optional, with care for rate limits)Inside a batch, calls are sequential. If quotas allow, parallelize within the batch to cut latency.
- const embeddings: number[][] = []; - - for (const text of texts) { - const result = await model.embedContent(text); - if (result?.embedding?.values) { - embeddings.push(result.embedding.values); - } else { - throw new Error('Invalid embedding response from Google AI'); - } - } - - return embeddings; + const results = await Promise.all(texts.map((text) => model.embedContent(text))); + const embeddings = results.map((result) => { + if (result?.embedding?.values) return result.embedding.values; + throw new Error('Invalid embedding response from Google AI'); + }); + return embeddings;
88-90: Preserve original error as cause and type-narrowThis keeps stack/context while providing a clean message.
- } catch (e) { - throw new Error(`Google Embeddings API error: ${e.message || e}`); - } + } catch (e) { + const err = e instanceof Error ? e : new Error(String(e)); + throw new Error('Google Embeddings API error', { cause: err }); + }
61-63: Add fallback to generic GOOGLE_API_KEY in GoogleEmbedding.ts
Extend the existing fallback so thatapiKeyis set to:apiKey = process.env.GOOGLE_AI_API_KEY ?? process.env.GOOGLE_API_KEY;This aligns with the CLI’s use of
GOOGLE_API_KEYin create.index.ts:40 and reduces config friction.packages/core/src/subsystems/IO/VectorDB.service/embed/index.ts (1)
22-25: Constructor param consistency (optional)
EmbeddingsFactory.createpassesTEmbeddings, whileGoogleEmbedsacceptsPartial<TEmbeddings>. Consider aligning constructor signatures across embedders for uniformity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/core/src/index.ts(1 hunks)packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts(1 hunks)packages/core/src/subsystems/IO/VectorDB.service/embed/index.ts(2 hunks)
🔇 Additional comments (3)
packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts (1)
49-56: Verify GoogleEmbedding model ID and credentials mapping
Confirm thatthis.modelis set togemini-embedding-001(3072-dim by default, adjustable viaoutput_dimensionality) rather than a legacy ID, and ensure the credentials helper maps provider"GoogleAI"to the correct Google GenAI connector.packages/core/src/index.ts (1)
169-169: Public re-export looks goodMakes the new embedder discoverable at the root index.
packages/core/src/subsystems/IO/VectorDB.service/embed/index.ts (1)
11-14: Provider registration is correct
GoogleAIwired toGoogleEmbedswith models from the class. Types flow viaas const.
| const DEFAULT_MODEL = 'gemini-embedding-001'; | ||
|
|
||
| export class GoogleEmbeds extends BaseEmbedding { | ||
| protected client: GoogleGenerativeAI; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Initialize client field safely
Mark client as optional to satisfy strict init checks and match the lazy init pattern used below.
- protected client: GoogleGenerativeAI;
+ protected client?: GoogleGenerativeAI;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| protected client: GoogleGenerativeAI; | |
| // packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts | |
| class GoogleEmbedding { | |
| protected client?: GoogleGenerativeAI; | |
| // …rest of class… | |
| } |
🤖 Prompt for AI Agents
In packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts
around line 10, the protected client is declared without allowing uninitialized
state which fails strict init checks and conflicts with the lazy-init pattern;
change the declaration to make it optional (e.g., use "protected client?:
GoogleGenerativeAI" or "protected client: GoogleGenerativeAI | undefined") and
ensure existing code paths that access this.client either initialize it first or
guard for undefined before use.
| public static models = ['gemini-embedding-001']; | ||
| public canSpecifyDimensions = true; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Dimensions flag: verify end-to-end behavior
canSpecifyDimensions = true implies callers may request non-default dims, but no dimension is passed to the API here. Ensure BaseEmbedding (or downstream) handles resizing/truncation to requested dims or pass the API parameter if supported.
🏁 Script executed:
#!/bin/bash
# Where are embedding dimensions enforced?
rg -n -C2 -e 'dimensions?|dimensionality|vector.?size|numDimensions' packages/core/srcLength of output: 13157
🏁 Script executed:
#!/bin/bash
# Show GoogleEmbedding class and its methods
rg -nP --type=ts 'class GoogleEmbedding' -C5 packages/core/src/subsystems/IO/VectorDB.service/embedLength of output: 101
🏁 Script executed:
#!/bin/bash
# List files in embed directory to locate GoogleEmbedding file
ls -1 packages/core/src/subsystems/IO/VectorDB.service/embedLength of output: 126
🏁 Script executed:
#!/bin/bash
# Display the contents of GoogleEmbedding.ts to inspect dimension handling
sed -n '1,200p' packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.tsLength of output: 3614
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' packages/core/src/subsystems/IO/VectorDB.service/embed/BaseEmbedding.tsLength of output: 3912
Disable dimension override in GoogleEmbedding
- In
packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts:12,canSpecifyDimensions = trueimplies custom‐dim support but neither the API call norBaseEmbeddinguses or enforcesthis.dimensions, leading to mismatched vector lengths in downstream connectors. - Either set
canSpecifyDimensions = falseor add truncation/padding logic aftermodel.embedContentto enforcethis.dimensions.
🤖 Prompt for AI Agents
In packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts
around lines 12 to 14, the class sets canSpecifyDimensions = true but neither
the Google API call nor BaseEmbedding enforces this.dimensions, causing
potential vector-length mismatches; either set canSpecifyDimensions = false to
disallow dimension overrides, or keep it true and after receiving
model.embedContent apply deterministic truncation or zero-padding to the
returned embedding to exactly this.dimensions (and validate the value is numeric
and >0), then return the adjusted vector so downstream connectors always receive
vectors of the expected length.
| constructor(private settings?: Partial<TEmbeddings>) { | ||
| super({ model: settings?.model ?? DEFAULT_MODEL, ...settings }); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default model can be unintentionally overridden to undefined
Spreading settings after setting model can overwrite the default when settings.model is undefined. Put the spread first and the model last.
- constructor(private settings?: Partial<TEmbeddings>) {
- super({ model: settings?.model ?? DEFAULT_MODEL, ...settings });
- }
+ constructor(private settings?: Partial<TEmbeddings>) {
+ super({ ...settings, model: settings?.model ?? DEFAULT_MODEL });
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| constructor(private settings?: Partial<TEmbeddings>) { | |
| super({ model: settings?.model ?? DEFAULT_MODEL, ...settings }); | |
| } | |
| constructor(private settings?: Partial<TEmbeddings>) { | |
| super({ ...settings, model: settings?.model ?? DEFAULT_MODEL }); | |
| } |
🤖 Prompt for AI Agents
In packages/core/src/subsystems/IO/VectorDB.service/embed/GoogleEmbedding.ts
around lines 15 to 17, the constructor currently spreads settings after setting
model which allows settings.model === undefined to overwrite the default; change
the order so you spread settings first and then set model (or set model using
nullish-coalescing on the already-spread object) so the default model is applied
when settings.model is undefined.
|
Thank you, @SyedZawwarAhmed, for your valuable contribution! This will add value to vector db along with google ai embedding. |
|
@zubairirfan96 Sure, I will do it. |
Signed-off-by: SyedZawwarAhmed <zawwar.ahmed12@gmail.com>
|
@zubairirfan96 I have also added unit tests for this integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
packages/core/tests/unit/embeddings/GoogleEmbedding.test.ts (7)
7-16: Module mocks look good; consider hardening test isolation.Add
vi.resetModules()inafterEachto prevent any cached module state from leaking between tests (useful if the provider ever caches singletons at module scope).afterEach(() => { vi.restoreAllMocks(); + vi.resetModules(); delete process.env.GOOGLE_AI_API_KEY; });
61-65: Redundant custom-model test.This uses the default model value, providing no extra coverage beyond the previous test. Either remove it or change it to assert a non-default override if supported.
- it('should initialize with custom model', () => { - googleEmbeds = new GoogleEmbeds({ model: 'gemini-embedding-001' }); - expect(googleEmbeds.model).toBe('gemini-embedding-001'); - }); + // Optional: keep only if a non-default model is ever added.
88-91: Make models assertion future-proof.Hard-equality will break the test if more models are added. Assert membership instead.
- expect(GoogleEmbeds.models).toEqual(['gemini-embedding-001']); + expect(GoogleEmbeds.models).toContain('gemini-embedding-001');
137-151: Strengthen env-fallback test with an extra assertion.Also assert that
getLLMCredentialswas attempted before falling back.const result = await googleEmbeds.embedText('test text', mockAccessCandidate); expect(result).toEqual(mockEmbedding); + expect(getLLMCredentials).toHaveBeenCalledTimes(1); expect(GoogleGenerativeAI).toHaveBeenCalledWith('env-api-key');
217-243: Consider adding a newline-preservation batch test.Mirror the single-text newline opt-out test for batched inputs to ensure consistency.
it('should handle chunking correctly', async () => { @@ expect(mockModel.embedContent).toHaveBeenCalledTimes(5); }); + it('should preserve newlines in batch when stripNewLines is false', async () => { + googleEmbeds = new GoogleEmbeds({ params: { chunkSize: 2, stripNewLines: false } }); + const mockEmbedding = [0.1, 0.2, 0.3]; + mockModel.embedContent.mockResolvedValue({ embedding: { values: mockEmbedding } }); + await googleEmbeds.embedTexts(['line1\nline2'], mockAccessCandidate); + expect(mockModel.embedContent).toHaveBeenCalledWith('line1\nline2'); + });
289-301: Also assert model reuse.You assert client reuse; assert model reuse too to avoid repeated
getGenerativeModelcalls.await googleEmbeds.embedText('test1', mockAccessCandidate); await googleEmbeds.embedText('test2', mockAccessCandidate); // GoogleGenerativeAI constructor should only be called once expect(GoogleGenerativeAI).toHaveBeenCalledTimes(1); + expect(mockClient.getGenerativeModel).toHaveBeenCalledTimes(1); expect(mockModel.embedContent).toHaveBeenCalledTimes(2);
303-319: Assert the SDK receives the custom API key.This makes the intent unambiguous when explicit credentials are passed.
await googleEmbeds.embedText('test', mockAccessCandidate); expect(getLLMCredentials).toHaveBeenCalledWith(mockAccessCandidate, { provider: 'GoogleAI', modelId: 'gemini-embedding-001', credentials: customCredentials, }); + expect(GoogleGenerativeAI).toHaveBeenCalledWith('custom-key');
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
packages/core/tests/unit/embeddings/GoogleEmbedding.test.ts(1 hunks)
🔇 Additional comments (5)
packages/core/tests/unit/embeddings/GoogleEmbedding.test.ts (5)
114-135: Newline handling tests are clear and accurate.These assertions validate default stripping and the opt-out path well.
190-215: Batch order and mapping are well-covered.Good verification that outputs map to inputs despite concurrent processing.
322-347: Error-wrapping coverage is solid.Good coverage for Error objects, plain objects, and non-Error rejections, with consistent messaging.
349-388: Text edge cases well handled.Empty, whitespace-only, and very long inputs are exercised appropriately.
66-81: Dimensions parameter is not applied in GoogleEmbedding; skip dimension assertion
GoogleEmbedding does not passdimensionsto the SDK (nooutputDimensionality) nor apply client-side trimming/padding. Remove thedimensionssetting and itsexpect(googleEmbeds.dimensions).toBe(512)assertion from this test—or implement dimension support before adding any checks.Likely an incorrect or invalid review comment.
|
Thank you for this contribution, it's merged to dev branch and will be published in the next release :) |
|
@alaa-eddine-k You're welcome. I'd love to contribute more to the project in the future. |
📝 Description
Adds Google AI embedding provider support to SmythOS, enabling users to use
gemini-embedding-001embeddings alongside existing OpenAI providers. The integration includes proper provider registration, environment variable fallback for API keys, and support for 3072-dimensional vectors.Key changes:
GoogleAIprovider in embeddings factoryGOOGLE_AI_API_KEYenvironment variable fallback🔗 Related Issues
🔧 Type of Change
✅ Checklist
Summary by CodeRabbit
New Features
Tests