You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 5, 2023. It is now read-only.
Thanks for the suggestion! Thing is, I feel the current SOTA encoders which can handle multi-modal data (i.e. both texts and images) are a bit behind strictly text encoders, especially when mainly working with text.
One way of combining the better ones for text with also having images around is to have two embeddings stored for texts, one with a strictly text encoder, one with an image one. When you're working with text, it would use the cleaner text one, while when working with images it would use the CLIP one.
In your view, would the slightly better performance with text justify the increase in complexity, @LifeIsStrange ?
The state of the art can be found here:
https://paperswithcode.com/sota/semantic-textual-similarity-on-sts-benchmark
Other benchmarcks:
https://paperswithcode.com/task/semantic-textual-similarity
The text was updated successfully, but these errors were encountered: