.Net: Integrate IEmbeddingGenerator #11682

roji · 2025-04-22T23:04:13Z

Adds support for using IEmbeddingGenerator to generate embeddings from arbitrary user property types.

An IEmbeddingGenerator can be configured at the property, collection or store level. The DI methods also look up an IEmbeddingGenerator in DI, and if they find it, configure it as the store level one.
When calling SearchAsync and UpsertAsync, we detect vector properties that have an IEmbeddingGenerator configured, and call it (note that for batching UpsertAsync, we call it once for all records, per property). When upserting, the generated embeddings are then passed into the mapper, and are used instead of the .NET property on the user's CLR type.
Connectors are in charge of specifying what embedding types are supported (float, Half...), in order of priority. On the other hand, the VectorStoreRecordVectorPropertyModel is responsible for specifying which input types are supported (string, DataContent). This means that the user can simply have a string property on their record type, have an IEmbeddingGenerator registered in DI, and as long as that generator implements IEmbeddingGenerator for string inputs, and for an embedding output that's supported by their connector, everything will just work.
The default VectorStoreRecordVectorPropertyModel supports string and DataContent, which are the "base" types expected to be implemented by actual embedding generation services. To support arbitrary input types, the user must use the newly-introduced generic VectorStoreRecordVectorPropertyModel<TInput> (the generic part is necessary to make everything work in a trimming-safe manner). This allows arbitrary user types to be used for vector properties, and as long as the correct IEmbeddingGenerator is registered, everything will work.

Various notes

Various bits here are going to looking a bit strange - because I've made sure to make everything compatible with trimming and NativeAOT. In older times I'd have passed around a bunch of Types for embedding types and done reflection to check what interfaces they implement - but we can't do that. This makes the code more a bit more convoluted.
There's also definitely potential for deduplicating the added code. However, this is more complicated than it looks - I've left this for a later cleanup.
Azure AI Search does not currently support IEmbeddingGenerator; this is because it uses an Azure API to directly handle the .NET CLR type, so we have no place to integrate the embedding generation. We should explore this in the future.
- However, Azure AI Search's built-in embedding generation is of course still supported via SearchAsync (though this is text-only, not sure what happens with e.g. images).
InMemory required extra work: we now have a wrapper around the user POCO, which holds the generated embeddings in a separate Dictionary. This is because when using an IEmbeddingGenerator, the user POCO only has a source property (e.g. string), and we need to put the generated embeddings somewhere.
Added VectorDataStrings.resx for exception messages; these are checked in the conformance tests.
Made the SupportedTypes properties on VectorStoreRecordModelBuildingOptions non-nullable, which required cleaning up how connectors populate these.

Closes #10492
Closes #11527

dotnet/samples/Concepts/Memory/VectorStoreEmbeddingGeneration/GenerateTextEmbeddingAttribute.cs

roji · 2025-04-23T07:16:47Z

dotnet/samples/Concepts/Search/VectorStore_TextSearch.cs

@@ -1,5 +1,7 @@
 // Copyright (c) Microsoft. All rights reserved.

+#if DISABLED


@markwallace-microsoft @westey-m ITextSearch over MEVD will also need to be updated to react to this change. This should make things simpler: you currently need to supply an IEmbeddingGenerationService to VectorStoreTextSearch, but after this PR all IVectorStoreRecordCollections themselves support embedding generation - if properly configure.

So I think we can simply remove embedding generation as a concern of the ITextSearch implementation, and simply call the new IVectorSearch.SearchAsync, passing the text; as long as the user has properly configured the store with an IEmbeddingGeneration, everything should just work out of the box.

I'm leaving relevant samples commented out via #if DISABLED for now.

Shouldn't we just use Skip in the facts or does it introduces compilation errors? If the second maybe Exclude from project with the comment in the .csproj xml would be better.

@roji, @markwallace-microsoft, The textsearch implementations and features still work as is though right? Is the intention to Disable these as a marker to update later? Not sure I like the approach of using #if, since we do want these to still be compiled and work, since some users are currently using this.

OK, I've done another pass and re-enabled all VectorStoreTextSearch functionality and tests, adapting them to the new embedding generator support in this PR; that work is in a separate commit in this PR for easier reviewing.

Please take a look... I'm proposing to obsolete the functionality of handling embedding generation in VectorStoreTextSearch, as we've now pushed that concern down into MEVD itself. So rather than creating a VectorStoreTextSearch wrapper over an MEVD IVectorizedSearch and supplying an IEmbeddingGenerationService, users can simply configure an IEmbeddingGenerator with their MEVD collection, and VectorStoreTextSearch can just call SearchAsync and be oblivious of embedding generation.

Let me know what you think and if you're with things as they're proposed.

Looks good, thanks!

...ors/Connectors.AzureCosmosDBNoSQL.UnitTests/AzureCosmosDBNoSQLDynamicDataModelMapperTests.cs

dotnet/src/Connectors/Connectors.InMemory.UnitTests/InMemoryVectorStoreRecordCollectionTests.cs

roji · 2025-04-23T07:30:30Z

...rs/Connectors.Memory.AzureCosmosDBMongoDB/AzureCosmosDBMongoDBVectorStoreRecordCollection.cs

+
+            // TODO: Ideally we'd group together vector properties using the same generator (and with the same input and output properties),
+            // and generate embeddings for them in a single batch. That's some more complexity though.
+            if (vectorProperty.TryGenerateEmbedding<TRecord, Embedding<float>, ReadOnlyMemory<float>>(record, cancellationToken, out var floatTask))


This is the somewhat tricky part of the design, and the one that makes it a bit difficult to refactor logic into one centralized location (though something should be possible). Possibly look at how SearchAsync is implemented below first (UpsertAsync adds another layer of complexity above it).

First, the connector is the one who knows which embedding types are supported; when an embedding generator supports multiple embedding types (which I assume most do), the connector also determines the order of priority.

Now, since we have a non-generic IEmbeddingGenerator (what the user configured or we got from DI), we need to down-cast it to the appropriate generic IEmbeddingGenerator<TInput, TEmbedding>. We can't use reflection to enumerate the interfaces implemented by the generator, since that's not trimming compatible. So the connector performs multiple calls to TryGenerateEmbedding on the property, once for each supported embedding type. Inside TryGenerateEmbedding, the VectorPropertyModel "adds" the knowledge of the TInput side (e.g. string), and performs the actual check (is IEmbeddingGenerator<string, TEmbedding>). Note that TInput can be a completely arbitrary user type.

This design allows locating and downcasting to the correct generic IEmbeddingGenerator interface, given that the two type parameters (TInput, TEmbedding) are determined by different places (the former from the vector property configured by the user, the latter from the connector).

roji · 2025-04-23T07:35:11Z

...rs/Connectors.Memory.AzureCosmosDBMongoDB/AzureCosmosDBMongoDBVectorStoreRecordCollection.cs

@@ -278,7 +386,10 @@ public async IAsyncEnumerable<VectorSearchResult<TRecord>> VectorizedSearchAsync
        };

        var searchOptions = options ?? s_defaultVectorSearchOptions;
-        var vectorProperty = this._model.GetVectorPropertyOrSingle(searchOptions);
+        if (searchOptions.IncludeVectors && this._model.VectorProperties.Any(p => p.EmbeddingGenerator is not null))


@westey-m at some point we discussed various advanced scenarios, where users do embedding generation but still want to be able to access the embedding; this check disables that for now, we can add it in the future.

dotnet/src/Connectors/Connectors.Memory.InMemory/InMemoryVectorStore.cs

roji · 2025-04-23T08:41:55Z

.../Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordVectorPropertyModel.cs

+    /// The implementation on this non-generic <see cref="VectorStoreRecordVectorPropertyModel"/> checks for <see cref="string"/>
+    /// and <see cref="DataContent"/> as input types for <see cref="EmbeddingGenerator"/>.
+    /// </summary>
+    public virtual bool TrySetupEmbeddingGeneration<TEmbedding, TUnwrappedEmbedding>(IEmbeddingGenerator embeddingGenerator, Type? embeddingType)


See previous comments on how this all works. Note that this method gets overridden by the generic subclass of VectorStoreRecordVectorPropertyModel - that's how we can support arbitrary user input types.

Note that if we transition to using Embedding as the representation for raw embeddings (we currently use e.g. ReadOnlyMemory<float>), TUnwrappedEmbedding can go away.

roji · 2025-04-23T08:45:11Z

dotnet/src/Connectors/VectorData.Abstractions/Properties/AssemblyInfo.cs

@@ -0,0 +1,3 @@
+// Copyright (c) Microsoft. All rights reserved.
+
+[assembly: System.Resources.NeutralResourcesLanguage("en-US")]


Added this because of a compiler warning, since this PR adds string resources.

subjective: we could add <NeutralLanguage>en-US</NeutralLanguage> to the project file instead of creating a new file

No strong feeling either way... I'm used to having AssemblyInfo.cs forr assembly-wide attributes but am good for doing it via the csproj (is this what you see in e.g. runtime projects?)

roji · 2025-04-23T08:45:56Z

dotnet/src/Connectors/VectorData.Abstractions/Properties/VectorDataStrings.resx

+    <resheader name="writer">
+        <value>System.Resources.ResXResourceWriter, System.Windows.Forms, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</value>
+    </resheader>
+    <data name="IncompatibleEmbeddingGenerator" xml:space="preserve">


The idea here is to start standardizing error strings across providers, and to enforce the right message in tests a well.

Closes microsoft#10492 Closes microsoft#11527

...t/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordVectorProperty.cs

dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorizedSearch.cs

dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs

dotnet/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordDefinition.cs

.../Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordVectorPropertyModel.cs

westey-m · 2025-04-23T16:57:50Z

dotnet/src/Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordModelBuilder.cs

+                vectorProperty.Dimensions = vectorAttribute.Dimensions;
+                vectorProperty.IndexKind = vectorAttribute.IndexKind;
+                vectorProperty.DistanceFunction = vectorAttribute.DistanceFunction;


Shouldn't this only be done if no definition property was available? Definition should always override attribute, but here we may be overriding the definition with the attribute values.

Yeah, I can see how this is a bit confusing (and maybe in need of cleanup). After we process the .NET property with the attributes (which is what's done here), we then proceed to process the record definition, and will overwrite whatever's done here once again.

This is how it was done before this PR as well (and AFAICT everything should work correctly), but now we also use VectorStoreRecordVectorProperty to actually create the VectorPropertyModel, so it gets a bit more convoluted. Note that we still must take the attribute settings into account, since the record definition might not define e.g. the DistanceFunction at all...

Maybe it would be better for CreatePropertyModel to only instantiate the model, and not to populate the index kind, distance function, etc. This way we've got the instantiation step (CreatePropertyModel), then the attribute settings get set, and finally the record definition. Though it would seem a bit odd for CreatePropertyModel() to not actually set any of these things on the property model it returns...

All this is completely internal and can be cleaned up later (and there's indeed cleanup to be done here regardless...) - let me know if you think this is important enough to do now or whether it's OK to defer.

The way I was thinking about this was that we have two potential cases:

Dimensions, IndexKind and DistanceFunction is not used until after we process the record definition, in which case we don't need to set them here.

Dimensions, IndexKind and DistanceFunction is used before we process the record definition, in which case we will be using the values from the attributes instead of the overrides from the definition.

Aside, we should still clarify how overriding behaves for property types. E.g. do we allow overriding on a setting by setting basis, or only the entire Property with all it's settings in one go. I lean towards the second, since it allows unsetting properties via null.

Dimensions, IndexKind and DistanceFunction is used before we process the record definition, in which case we will be using the values from the attributes instead of the overrides from the definition.

I don't think there's any case where we "use" any of these settings before both attributes and record definitions have been fully processed and applied - are you seeing something else?

Other than that, yeah - the current implementation indeed allows overriding on a setting-by-setting basis, meaning that unsetting via null isn't possible.

The EF solution to this problem is to have Fluent API with builders rather than a record definition type. So instead of passing in a VectorStoreRecordProperty (and then there's the question of whether null overrides, etc.), users invoke APIs for each setting. The MEVD equivalent would look something like the following:

recordBuilder .HasKeyProperty("id") .HasVectorProperty("embedding", vb => vb .HasStorageName("Emb") .HasDimensions(3) .UseSimilarityFunction(null)); // This unsets what was set by the attribute // Since the above doesn't mention e.g. the index type, whatever was set by the attribute is untouched

Apart from allowing very clean merging of configuration with the attributes, this also aligns nicely with the way DI/ASP.NET and modern .NET config generally works. I've thought about this several times already, but have not raised it as we already have a lot going on.

Assuming we don't do the above fluent/builder API... I understand the desire to have the record definition fully override the attributes (in order to allow unsetting via null). I can see some people who may want to merge configuration from both, e.g. some universal config via attributes, and then different config for different connectors via record definitions (but that's likely an edge case, and they can just do everything via record definitions instead).

In any case, the current merging behavior was already there before this PR - can you open an issue for us to change it after this PR?

dotnet/src/Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordModelBuilder.cs

westey-m · 2025-04-23T17:29:43Z

dotnet/src/Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordModelBuilder.cs

-                if (this.Options.SupportedVectorPropertyTypes is not null)
+                Debug.Assert(vectorProperty.EmbeddingGenerator is null ^ vectorProperty.Type != vectorProperty.EmbeddingType);
+
+                if (!this.Options.SupportedVectorPropertyTypes.Contains(vectorProperty.EmbeddingType))


Are we missing a check here to also see if the embedding generator is not null?

I'm not sure... EmbeddingType contains the type of the embedding (e.g. ReadOnlyMemory<float> regardless of whether an embedding generator is configured or not, so if that's not in the supported types list, we need to throw in any case (just like the validation for data/key properties).

But there's indeed the question of the wording of the exception - for key/data properties we can just say "this type isn't supported", whereas for vector properties there'll be the case where users try to have a string, but forget to configure the embedding generator...

Let me know if you think this can be improved...

Yeah, the error message doesn't match the check here: Property '{0}' has non-Embedding type '{1}', but no embedding generator is configured.. My expectation here is that we want to alert folks who are using a non-embedding vector property, e.g. type string, that they need to provide an embedding generator otherwise non-embedding vectors is not supported.

And yes, for other property types, and for vector types (both traditional and the output type for a non-embedding vector) we should have a separate error message that says they are not support if they don't match our allow list.

Ah, I see now. I've modified the code accordingly, take a look.

Do we want to throw for the .EmbeddingGenerator is null else case?

Maybe I'm misunderstanding something, but doesn't this block our main scenario, where someone has defined their type as string, and they have provided an EmbeddingGenerator that can take string as input.

AFAICT we only want to throw if someone provided a non-vector type and they didn't provide an embedding generator somewhere, since this is the only scenario where we should generate embeddings, and of course then we cannot.

I'm assuming EmbeddingType is string in the following case right?

[VectorStoreVectorProperty(1536)] string Embedding {get; set;}

If it is, the second error doesn't make sense, but if it's ReadonlyMemory then the first error doesn't make any sense.

No, EmbeddingType always represents the output/embedding type, so it's ReadOnlyMemory<float> in this case - the regular Type is string (that represents the input/.NET type), whereas Embedding type is the output.

Does it make more sense now?

As discussed, EmbeddingType can actually be string or ReadonlyMemory<float> in this case, depending on whether an EmbeddingGenerator was defined (assuming the EmbeddingGenerator output type is ReadonlyMemory<float>, and then this code does make sense.

dotnet/SK-dotnet.sln

dotnet/src/VectorDataIntegrationTests/PostgresIntegrationTests/PostgresIntegrationTests.csproj

dotnet/src/VectorDataIntegrationTests/QdrantIntegrationTests/QdrantEmbeddingGenerationTests.cs

westey-m · 2025-04-24T16:12:14Z

...src/VectorDataIntegrationTests/RedisIntegrationTests/RedisHashSetEmbeddingGenerationTests.cs

+        public override Func<IServiceCollection, IServiceCollection>[] DependencyInjectionStoreRegistrationDelegates =>
+        [
+            // TODO: This doesn't work because if a RedisVectorStoreOptions is provided (and it needs to be for HashSet), the embedding generator
+            // isn't looked up in DI. The options are also immutable so we can't inject an embedding generator into them.


I think that immutability is irrelevant here. Even if the options were mutable, I'm not sure we should be updating them. We aren't doing it anywhere else after all. If the user provided options without an embedding generator, we can only assume they wanted options without an embedding generator. Assuming anything else is probably an assumption too far.

I can see that... Though if they've specifically registered an IEmbeddingGenerator in DI, it seems like there's a pretty explicit intention is for that to get used by MEVD (we'd obviously not overwrite a non-null generator in the user-provided options - only null).

Another related thought is that if they don't want embedding generation, their properties are ReadOnlyMemory<float> (and not string), so whether the embedding generator gets picked up or not has no significance (it'd never get used). So a long as we only overwrite null it may be a bit academic...

One last way to think about it: if IEmbeddingGenerator happened to be a separate parameter on the VectorStore constructor (and not part of the options bag), people would definitely expect it to get injected; so it's a bit odd that we'd refrain from doing so only because it happens to be part of the option bag.

Anyway, something to discuss later (possibly in #11715, will make a note there).

.../Connectors/Connectors.Memory.AzureCosmosDBNoSQL/AzureCosmosDBNoSQLDynamicDataModelMapper.cs

dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreRecordCollection.cs

adamsitnik

I have added two design question, namely:

should IVectorSearch be extended with public IEmbeddingGenerator? EmbeddingGenerator { get; } property?
Shoud SearchAsync method allow for proving IEmbeddingGenerator instance?

dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs

roji · 2025-04-25T13:00:04Z

@adamsitnik

From the pure coding perspective, I would like to request changes in the regard of InMemoryCollection: the static cache should be removed.

As discussed offline, I'm removing the static cache and in general the (new) ability to specify a store name/share data between multiple InMemoryVectorStore instances. This very slightly reduces InMemory test coverage but nothing meaningful - thanks again for insisting on this, I believe it's the better way.

On the other stuff:

From the usability perspective, I am not sure about how we should approach the problem of storing both the input value and its embedding (generated by the generator on the fly).

First of all, it seems to me that it makes no sense to perist only one of them (as there is no easy way of reversing the process).

We just had a design discussion with you and @eavanvalkenburg, I think we more or less reached a consensus, and I summarized the discussion and possible future work in #11734 and #11736. If my understanding is correct, we're OK to proceed with the approach in this PR, and think that the possible improvements can be added in the future (without breaking).

roji · 2025-04-26T17:48:28Z

@westey-m @adamsitnik I've pushed more commits and addressed all outstanding comments. Note that there's a specific separate commit which does extensive work on VectorStoreTextSearch (see #11682 (comment)).

adamsitnik

LGTM, thank you for addressing my feedback and concerns @roji !

dotnet/src/Connectors/Connectors.Memory.InMemory/InMemoryVectorStoreRecordCollection.cs

dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteVectorStoreRecordCollection.cs

adamsitnik · 2025-04-27T20:50:51Z

.../Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordVectorPropertyModel.cs

+    /// and <see cref="DataContent"/> as input types for <see cref="EmbeddingGenerator"/>.
+    /// </para>
+    /// </remarks>
+    public virtual bool TryGenerateEmbedding<TRecord, TEmbedding, TUnwrappedEmbedding>(TRecord record, CancellationToken cancellationToken, [NotNullWhen(true)] out Task<TEmbedding>? task)


For me it would be fine to have one helper for float, another for float and double and what we have now in the connectors that support more types.

It's fine to postpone it to next PR/release.

dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs

roji · 2025-04-28T10:38:02Z

FYI pushed another commit to address comments. I also did a quick pass to make sure all the options are used and coalesced correctly in the search APIs, as suggested by @adamsitnik above.

TsengSR · 2025-05-20T21:22:54Z

@roji

Sorry to butt in, but what is the supposed migration path if we want to keep generating the embeddings as vectorized and not use the automatic embedding generation?

The migration docs only metion the replacement, but shows no concrete example of how.

Assuming I have

        ITextEmbeddingGenerationService embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

        ReadOnlyMemory<float> embeddings = await embeddingGenerator.GenerateEmbeddingAsync(term, kernel, HttpContext.RequestAborted);
        VectorSearchResults<VectorModels.Employee> empoyeesResult = await employeeCollection.VectorizedSearchAsync(embeddings, 10);

And want to keep it that way with IEmbeddingGenerator<string, Embedding<float>> w/o using the automatic embeddings ? previously I used to pass IReadOnlyMemory<float> to the VectorizedSearchAsync (now SearchAsync)?

I'm in the middle of upgrading and want do as little changes as possible to not change behavior until I can test out the new embeddings generator stuff. Neiter April 2025 nor the May 2025 vector store changes guides outline that case specifically.

roji · 2025-05-21T05:40:35Z

@TsengSR you can pass embeddings (e.g. as ReadOnlyMemory<float>) directly to the new SearchAsync method - this is covered in this section of the May 2025 notes.

TsengSR · 2025-05-21T07:27:28Z

@roji

@TsengSR you can pass embeddings (e.g. as ReadOnlyMemory<float>) directly to the new SearchAsync method - this is covered in this section of the May 2025 notes.

Yea that is how it roked before, but GenerateAsyncin the new api returns Embeddings<float>. Does it work with that too, since it's a type residing in Microsoft.Extensions.Ai? The new usage is confusing at best, since the methods aren't strong typed anymore (Search<T> not being constraint to something)

westey-m · 2025-05-21T09:35:51Z

@roji

@TsengSR you can pass embeddings (e.g. as ReadOnlyMemory<float>) directly to the new SearchAsync method - this is covered in this section of the May 2025 notes.

Yea that is how it roked before, but GenerateAsyncin the new api returns Embeddings<float>. Does it work with that too, since it's a type residing in Microsoft.Extensions.Ai? The new usage is confusing at best, since the methods aren't strong typed anymore (Search<T> not being constraint to something)

@TsengSR, All vector stores now support Embedding, ReadonlyMemory and float[] both for Upsert and for Search. Some implementations also support additional types. The documentation pages for each vector store has been updated to reflect the newly supported list of vector types.
See e.g. https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/out-of-the-box-connectors/azure-ai-search-connector?pivots=programming-language-csharp

markwallace-microsoft added .NET kernel memory labels Apr 22, 2025

github-actions bot changed the title ~~Integrate Embedding and IEmbeddingGenerator~~ .Net: Integrate Embedding and IEmbeddingGenerator Apr 22, 2025

roji had a problem deploying to integration April 22, 2025 23:04 — with GitHub Actions Failure

roji force-pushed the EmbeddingGeneration branch from d903a35 to dbb1628 Compare April 23, 2025 07:08

roji had a problem deploying to integration April 23, 2025 07:09 — with GitHub Actions Failure

roji force-pushed the EmbeddingGeneration branch from dbb1628 to 525a135 Compare April 23, 2025 07:55

roji temporarily deployed to integration April 23, 2025 07:55 — with GitHub Actions Inactive

roji commented Apr 23, 2025

View reviewed changes

roji marked this pull request as ready for review April 23, 2025 08:46

roji requested a review from a team as a code owner April 23, 2025 08:46

roji requested review from adamsitnik, dmytrostruk and westey-m April 23, 2025 08:46

Integrate Embedding and IEmbeddingGenerator

298b8f4

Closes microsoft#10492 Closes microsoft#11527

roji force-pushed the EmbeddingGeneration branch from 525a135 to 298b8f4 Compare April 23, 2025 08:49

roji temporarily deployed to integration April 23, 2025 08:50 — with GitHub Actions Inactive