-
Notifications
You must be signed in to change notification settings - Fork 3.9k
.Net: Integrate IEmbeddingGenerator #11682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net: Integrate IEmbeddingGenerator #11682
Conversation
d903a35
to
dbb1628
Compare
dbb1628
to
525a135
Compare
dotnet/samples/Concepts/Memory/VectorStoreEmbeddingGeneration/GenerateTextEmbeddingAttribute.cs
Show resolved
Hide resolved
@@ -1,5 +1,7 @@ | |||
// Copyright (c) Microsoft. All rights reserved. | |||
|
|||
#if DISABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markwallace-microsoft @westey-m ITextSearch over MEVD will also need to be updated to react to this change. This should make things simpler: you currently need to supply an IEmbeddingGenerationService to VectorStoreTextSearch, but after this PR all IVectorStoreRecordCollections themselves support embedding generation - if properly configure.
So I think we can simply remove embedding generation as a concern of the ITextSearch implementation, and simply call the new IVectorSearch.SearchAsync
, passing the text; as long as the user has properly configured the store with an IEmbeddingGeneration, everything should just work out of the box.
I'm leaving relevant samples commented out via #if DISABLED
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just use Skip
in the facts or does it introduces compilation errors? If the second maybe Exclude
from project with the comment in the .csproj
xml would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roji, @markwallace-microsoft, The textsearch implementations and features still work as is though right? Is the intention to Disable these as a marker to update later? Not sure I like the approach of using #if, since we do want these to still be compiled and work, since some users are currently using this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've done another pass and re-enabled all VectorStoreTextSearch functionality and tests, adapting them to the new embedding generator support in this PR; that work is in a separate commit in this PR for easier reviewing.
Please take a look... I'm proposing to obsolete the functionality of handling embedding generation in VectorStoreTextSearch, as we've now pushed that concern down into MEVD itself. So rather than creating a VectorStoreTextSearch wrapper over an MEVD IVectorizedSearch and supplying an IEmbeddingGenerationService, users can simply configure an IEmbeddingGenerator with their MEVD collection, and VectorStoreTextSearch can just call SearchAsync and be oblivious of embedding generation.
Let me know what you think and if you're with things as they're proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
...ors/Connectors.AzureCosmosDBNoSQL.UnitTests/AzureCosmosDBNoSQLDynamicDataModelMapperTests.cs
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.InMemory.UnitTests/InMemoryVectorStoreRecordCollectionTests.cs
Show resolved
Hide resolved
|
||
// TODO: Ideally we'd group together vector properties using the same generator (and with the same input and output properties), | ||
// and generate embeddings for them in a single batch. That's some more complexity though. | ||
if (vectorProperty.TryGenerateEmbedding<TRecord, Embedding<float>, ReadOnlyMemory<float>>(record, cancellationToken, out var floatTask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the somewhat tricky part of the design, and the one that makes it a bit difficult to refactor logic into one centralized location (though something should be possible). Possibly look at how SearchAsync is implemented below first (UpsertAsync adds another layer of complexity above it).
First, the connector is the one who knows which embedding types are supported; when an embedding generator supports multiple embedding types (which I assume most do), the connector also determines the order of priority.
Now, since we have a non-generic IEmbeddingGenerator (what the user configured or we got from DI), we need to down-cast it to the appropriate generic IEmbeddingGenerator<TInput, TEmbedding>
. We can't use reflection to enumerate the interfaces implemented by the generator, since that's not trimming compatible. So the connector performs multiple calls to TryGenerateEmbedding on the property, once for each supported embedding type. Inside TryGenerateEmbedding, the VectorPropertyModel "adds" the knowledge of the TInput side (e.g. string), and performs the actual check (is IEmbeddingGenerator<string, TEmbedding>
). Note that TInput can be a completely arbitrary user type.
This design allows locating and downcasting to the correct generic IEmbeddingGenerator interface, given that the two type parameters (TInput, TEmbedding) are determined by different places (the former from the vector property configured by the user, the latter from the connector).
@@ -278,7 +386,10 @@ public async IAsyncEnumerable<VectorSearchResult<TRecord>> VectorizedSearchAsync | |||
}; | |||
|
|||
var searchOptions = options ?? s_defaultVectorSearchOptions; | |||
var vectorProperty = this._model.GetVectorPropertyOrSingle(searchOptions); | |||
if (searchOptions.IncludeVectors && this._model.VectorProperties.Any(p => p.EmbeddingGenerator is not null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westey-m at some point we discussed various advanced scenarios, where users do embedding generation but still want to be able to access the embedding; this check disables that for now, we can add it in the future.
/// The implementation on this non-generic <see cref="VectorStoreRecordVectorPropertyModel"/> checks for <see cref="string"/> | ||
/// and <see cref="DataContent"/> as input types for <see cref="EmbeddingGenerator"/>. | ||
/// </summary> | ||
public virtual bool TrySetupEmbeddingGeneration<TEmbedding, TUnwrappedEmbedding>(IEmbeddingGenerator embeddingGenerator, Type? embeddingType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comments on how this all works. Note that this method gets overridden by the generic subclass of VectorStoreRecordVectorPropertyModel - that's how we can support arbitrary user input types.
Note that if we transition to using Embedding as the representation for raw embeddings (we currently use e.g. ReadOnlyMemory<float>
), TUnwrappedEmbedding can go away.
@@ -0,0 +1,3 @@ | |||
// Copyright (c) Microsoft. All rights reserved. | |||
|
|||
[assembly: System.Resources.NeutralResourcesLanguage("en-US")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this because of a compiler warning, since this PR adds string resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subjective: we could add <NeutralLanguage>en-US</NeutralLanguage>
to the project file instead of creating a new file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong feeling either way... I'm used to having AssemblyInfo.cs forr assembly-wide attributes but am good for doing it via the csproj (is this what you see in e.g. runtime projects?)
<resheader name="writer"> | ||
<value>System.Resources.ResXResourceWriter, System.Windows.Forms, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</value> | ||
</resheader> | ||
<data name="IncompatibleEmbeddingGenerator" xml:space="preserve"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to start standardizing error strings across providers, and to enforce the right message in tests a well.
525a135
to
298b8f4
Compare
...t/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordVectorProperty.cs
Outdated
Show resolved
Hide resolved
...t/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordVectorProperty.cs
Outdated
Show resolved
Hide resolved
...t/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordVectorProperty.cs
Show resolved
Hide resolved
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorizedSearch.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordDefinition.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/VectorData.Abstractions/RecordDefinition/VectorStoreRecordDefinition.cs
Outdated
Show resolved
Hide resolved
.../Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordVectorPropertyModel.cs
Outdated
Show resolved
Hide resolved
vectorProperty.Dimensions = vectorAttribute.Dimensions; | ||
vectorProperty.IndexKind = vectorAttribute.IndexKind; | ||
vectorProperty.DistanceFunction = vectorAttribute.DistanceFunction; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this only be done if no definition property was available? Definition should always override attribute, but here we may be overriding the definition with the attribute values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see how this is a bit confusing (and maybe in need of cleanup). After we process the .NET property with the attributes (which is what's done here), we then proceed to process the record definition, and will overwrite whatever's done here once again.
This is how it was done before this PR as well (and AFAICT everything should work correctly), but now we also use VectorStoreRecordVectorProperty to actually create the VectorPropertyModel, so it gets a bit more convoluted. Note that we still must take the attribute settings into account, since the record definition might not define e.g. the DistanceFunction at all...
Maybe it would be better for CreatePropertyModel to only instantiate the model, and not to populate the index kind, distance function, etc. This way we've got the instantiation step (CreatePropertyModel), then the attribute settings get set, and finally the record definition. Though it would seem a bit odd for CreatePropertyModel() to not actually set any of these things on the property model it returns...
All this is completely internal and can be cleaned up later (and there's indeed cleanup to be done here regardless...) - let me know if you think this is important enough to do now or whether it's OK to defer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I was thinking about this was that we have two potential cases:
- Dimensions, IndexKind and DistanceFunction is not used until after we process the record definition, in which case we don't need to set them here.
- Dimensions, IndexKind and DistanceFunction is used before we process the record definition, in which case we will be using the values from the attributes instead of the overrides from the definition.
Aside, we should still clarify how overriding behaves for property types. E.g. do we allow overriding on a setting by setting basis, or only the entire Property
with all it's settings in one go. I lean towards the second, since it allows unsetting properties via null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dimensions, IndexKind and DistanceFunction is used before we process the record definition, in which case we will be using the values from the attributes instead of the overrides from the definition.
I don't think there's any case where we "use" any of these settings before both attributes and record definitions have been fully processed and applied - are you seeing something else?
Other than that, yeah - the current implementation indeed allows overriding on a setting-by-setting basis, meaning that unsetting via null isn't possible.
The EF solution to this problem is to have Fluent API with builders rather than a record definition type. So instead of passing in a VectorStoreRecordProperty (and then there's the question of whether null overrides, etc.), users invoke APIs for each setting. The MEVD equivalent would look something like the following:
recordBuilder
.HasKeyProperty("id")
.HasVectorProperty("embedding", vb => vb
.HasStorageName("Emb")
.HasDimensions(3)
.UseSimilarityFunction(null)); // This unsets what was set by the attribute
// Since the above doesn't mention e.g. the index type, whatever was set by the attribute is untouched
Apart from allowing very clean merging of configuration with the attributes, this also aligns nicely with the way DI/ASP.NET and modern .NET config generally works. I've thought about this several times already, but have not raised it as we already have a lot going on.
Assuming we don't do the above fluent/builder API... I understand the desire to have the record definition fully override the attributes (in order to allow unsetting via null). I can see some people who may want to merge configuration from both, e.g. some universal config via attributes, and then different config for different connectors via record definitions (but that's likely an edge case, and they can just do everything via record definitions instead).
In any case, the current merging behavior was already there before this PR - can you open an issue for us to change it after this PR?
dotnet/src/Connectors/VectorData.Abstractions/ConnectorSupport/VectorStoreRecordModelBuilder.cs
Outdated
Show resolved
Hide resolved
if (this.Options.SupportedVectorPropertyTypes is not null) | ||
Debug.Assert(vectorProperty.EmbeddingGenerator is null ^ vectorProperty.Type != vectorProperty.EmbeddingType); | ||
|
||
if (!this.Options.SupportedVectorPropertyTypes.Contains(vectorProperty.EmbeddingType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing a check here to also see if the embedding generator is not null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure... EmbeddingType contains the type of the embedding (e.g. ReadOnlyMemory<float>
regardless of whether an embedding generator is configured or not, so if that's not in the supported types list, we need to throw in any case (just like the validation for data/key properties).
But there's indeed the question of the wording of the exception - for key/data properties we can just say "this type isn't supported", whereas for vector properties there'll be the case where users try to have a string, but forget to configure the embedding generator...
Let me know if you think this can be improved...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the error message doesn't match the check here: Property '{0}' has non-Embedding type '{1}', but no embedding generator is configured.
. My expectation here is that we want to alert folks who are using a non-embedding vector property, e.g. type string, that they need to provide an embedding generator otherwise non-embedding vectors is not supported.
And yes, for other property types, and for vector types (both traditional and the output type for a non-embedding vector) we should have a separate error message that says they are not support if they don't match our allow list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now. I've modified the code accordingly, take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to throw for the .EmbeddingGenerator is null else case?
Maybe I'm misunderstanding something, but doesn't this block our main scenario, where someone has defined their type as string, and they have provided an EmbeddingGenerator that can take string as input.
AFAICT we only want to throw if someone provided a non-vector type and they didn't provide an embedding generator somewhere, since this is the only scenario where we should generate embeddings, and of course then we cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming EmbeddingType is string in the following case right?
[VectorStoreVectorProperty(1536)]
string Embedding {get; set;}
If it is, the second error doesn't make sense, but if it's ReadonlyMemory then the first error doesn't make any sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, EmbeddingType always represents the output/embedding type, so it's ReadOnlyMemory<float>
in this case - the regular Type is string (that represents the input/.NET type), whereas Embedding type is the output.
Does it make more sense now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, EmbeddingType
can actually be string
or ReadonlyMemory<float>
in this case, depending on whether an EmbeddingGenerator
was defined (assuming the EmbeddingGenerator
output type is ReadonlyMemory<float>
, and then this code does make sense.
dotnet/src/VectorDataIntegrationTests/PostgresIntegrationTests/PostgresIntegrationTests.csproj
Outdated
Show resolved
Hide resolved
dotnet/src/VectorDataIntegrationTests/QdrantIntegrationTests/QdrantEmbeddingGenerationTests.cs
Outdated
Show resolved
Hide resolved
public override Func<IServiceCollection, IServiceCollection>[] DependencyInjectionStoreRegistrationDelegates => | ||
[ | ||
// TODO: This doesn't work because if a RedisVectorStoreOptions is provided (and it needs to be for HashSet), the embedding generator | ||
// isn't looked up in DI. The options are also immutable so we can't inject an embedding generator into them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that immutability is irrelevant here. Even if the options were mutable, I'm not sure we should be updating them. We aren't doing it anywhere else after all. If the user provided options without an embedding generator, we can only assume they wanted options without an embedding generator. Assuming anything else is probably an assumption too far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that... Though if they've specifically registered an IEmbeddingGenerator in DI, it seems like there's a pretty explicit intention is for that to get used by MEVD (we'd obviously not overwrite a non-null generator in the user-provided options - only null).
Another related thought is that if they don't want embedding generation, their properties are ReadOnlyMemory<float>
(and not string), so whether the embedding generator gets picked up or not has no significance (it'd never get used). So a long as we only overwrite null it may be a bit academic...
One last way to think about it: if IEmbeddingGenerator happened to be a separate parameter on the VectorStore constructor (and not part of the options bag), people would definitely expect it to get injected; so it's a bit odd that we'd refrain from doing so only because it happens to be part of the option bag.
Anyway, something to discuss later (possibly in #11715, will make a note there).
.../Connectors/Connectors.Memory.AzureCosmosDBNoSQL/AzureCosmosDBNoSQLDynamicDataModelMapper.cs
Outdated
Show resolved
Hide resolved
.../Connectors/Connectors.Memory.AzureCosmosDBNoSQL/AzureCosmosDBNoSQLDynamicDataModelMapper.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.Memory.Postgres/PostgresVectorStoreRecordCollection.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added two design question, namely:
- should
IVectorSearch
be extended withpublic IEmbeddingGenerator? EmbeddingGenerator { get; }
property? - Shoud
SearchAsync
method allow for provingIEmbeddingGenerator
instance?
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs
Show resolved
Hide resolved
As discussed offline, I'm removing the static cache and in general the (new) ability to specify a store name/share data between multiple InMemoryVectorStore instances. This very slightly reduces InMemory test coverage but nothing meaningful - thanks again for insisting on this, I believe it's the better way. On the other stuff:
We just had a design discussion with you and @eavanvalkenburg, I think we more or less reached a consensus, and I summarized the discussion and possible future work in #11734 and #11736. If my understanding is correct, we're OK to proceed with the approach in this PR, and think that the possible improvements can be added in the future (without breaking). |
@westey-m @adamsitnik I've pushed more commits and addressed all outstanding comments. Note that there's a specific separate commit which does extensive work on VectorStoreTextSearch (see #11682 (comment)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for addressing my feedback and concerns @roji !
dotnet/src/Connectors/Connectors.Memory.InMemory/InMemoryVectorStoreRecordCollection.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteVectorStoreRecordCollection.cs
Outdated
Show resolved
Hide resolved
dotnet/src/Connectors/Connectors.Memory.Sqlite/SqliteVectorStoreRecordCollection.cs
Outdated
Show resolved
Hide resolved
/// and <see cref="DataContent"/> as input types for <see cref="EmbeddingGenerator"/>. | ||
/// </para> | ||
/// </remarks> | ||
public virtual bool TryGenerateEmbedding<TRecord, TEmbedding, TUnwrappedEmbedding>(TRecord record, CancellationToken cancellationToken, [NotNullWhen(true)] out Task<TEmbedding>? task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me it would be fine to have one helper for float
, another for float
and double
and what we have now in the connectors that support more types.
It's fine to postpone it to next PR/release.
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs
Show resolved
Hide resolved
dotnet/src/Connectors/VectorData.Abstractions/VectorSearch/IVectorSearch.cs
Outdated
Show resolved
Hide resolved
FYI pushed another commit to address comments. I also did a quick pass to make sure all the |
Sorry to butt in, but what is the supposed migration path if we want to keep generating the embeddings as vectorized and not use the automatic embedding generation? The migration docs only metion the replacement, but shows no concrete example of how. Assuming I have ITextEmbeddingGenerationService embeddingGenerator = kernel.GetRequiredService<ITextEmbeddingGenerationService>();
ReadOnlyMemory<float> embeddings = await embeddingGenerator.GenerateEmbeddingAsync(term, kernel, HttpContext.RequestAborted);
VectorSearchResults<VectorModels.Employee> empoyeesResult = await employeeCollection.VectorizedSearchAsync(embeddings, 10); And want to keep it that way with I'm in the middle of upgrading and want do as little changes as possible to not change behavior until I can test out the new embeddings generator stuff. Neiter April 2025 nor the May 2025 vector store changes guides outline that case specifically. |
@TsengSR you can pass embeddings (e.g. as |
Yea that is how it roked before, but |
@TsengSR, All vector stores now support Embedding, ReadonlyMemory and float[] both for Upsert and for Search. Some implementations also support additional types. The documentation pages for each vector store has been updated to reflect the newly supported list of vector types. |
Adds support for using IEmbeddingGenerator to generate embeddings from arbitrary user property types.
IEmbeddingGenerator
for string inputs, and for an embedding output that's supported by their connector, everything will just work.VectorStoreRecordVectorPropertyModel<TInput>
(the generic part is necessary to make everything work in a trimming-safe manner). This allows arbitrary user types to be used for vector properties, and as long as the correct IEmbeddingGenerator is registered, everything will work.Various notes
Closes #10492
Closes #11527