You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed an issue with the way collections are structured in Chroma DB that makes data retrieval less efficient and more complex than it needs to be. When I retrieve a collection, I expect a collection of entities, but instead, I get many collections of entity components.
Here's an example of how I currently have to retrieve ids and some metadata from a collection:
for x in range(len(collection.get()["ids"])):
id = collection.get()["ids"][x]
metadata = collection.get()["metadatas"][x]
source = str(x) + "-" + id + "-" + metadata["title"]
print(source)
This approach is not ideal from a syntactic point of view, and possibly from a performance perspective as well, because to project some features of an item, I need to retrieve the whole collection, then grab some items according to the ordinal position.
Conceptually, it feels like going to a car dealership to choose a car, but instead of seeing complete cars, you’re shown all the doors in one place and all the wheels in another. In the end, you can’t mix and match parts—you still have to choose items that belong to the same car.
I’m aware that it’s possible to decide whether to include embeddings or filter against features, but this doesn’t fully address the issue. I believe a more intuitive and efficient approach would be to structure collections as collections of entities, rather than collections of entity components.
Has anyone else experienced this issue, or can anyone provide insight into why the data structure is designed this way?
I think this is a comment on a row-based vs column-based return format.
The main reason that chroma exists in this way is because at ingest time, most users have a columnar data structure since thats how the embeddings are generated. Rather than munge that into a row format the thought was it would be nice if that could be dumped directly into chroma. We felt it was a bit odd to accept columnar inputs but return row based outputs.
I think this has been raised a couple of times #282 #420
We are open to ideas here ! Just think its important we are consistent
Describe the problem
I've noticed an issue with the way collections are structured in Chroma DB that makes data retrieval less efficient and more complex than it needs to be. When I retrieve a collection, I expect a collection of entities, but instead, I get many collections of entity components.
Here's an example of how I currently have to retrieve ids and some metadata from a collection:
This approach is not ideal from a syntactic point of view, and possibly from a performance perspective as well, because to project some features of an item, I need to retrieve the whole collection, then grab some items according to the ordinal position.
Conceptually, it feels like going to a car dealership to choose a car, but instead of seeing complete cars, you’re shown all the doors in one place and all the wheels in another. In the end, you can’t mix and match parts—you still have to choose items that belong to the same car.
I’m aware that it’s possible to decide whether to include embeddings or filter against features, but this doesn’t fully address the issue. I believe a more intuitive and efficient approach would be to structure collections as collections of entities, rather than collections of entity components.
Has anyone else experienced this issue, or can anyone provide insight into why the data structure is designed this way?
Describe the proposed solution
Seems like a proposal has been made:
https://github.com/amikos-tech/chroma-go/blob/main/types/record.go
The solution should be as simple as a standard dictionary retrieval pattern should be:
Alternatives considered
No response
Importance
i cannot use Chroma without it
Additional Information
No response
The text was updated successfully, but these errors were encountered: