Training a model for ViT-L/14 image embeddings #10

rom1504 · 2022-04-10T23:20:42Z

Hey,
Thanks for providing this awesome multilingual clip-aligned text encoder.
We used it to filter the 3 billions of (image, text) pairs of laion5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ and it worked well.
I'm also using this model to provide a multilingual search in https://rom1504.github.io/clip-retrieval/.
For laion400m we used the ViT-B/32 model of openai to produce the index, but for laion5B we went with ViT-L/14 which is much more powerful.
To provide the same multilingual search feature, it would be really helpful if I had a clip ViT-L/14 aligned multilingual text encoder.

Would you advise running https://github.com/FreddeFrallan/Multilingual-CLIP#training-a-new-model (and now I'm writing it, I guess I could use a subset of the multilingual set of laion5B for this) to align such a text encoder ?

FreddeFrallan · 2022-04-11T08:04:41Z

Hi there,
I'm happy that you found a good use case for these models.
A multilingual ViT-L/14 sounds very interesting to me, and I'm fond of the idea of making large-scale models available to people.

My most extensive advice for creating a good Multilingual encoder would be to increase the number of translated data points. For example on the Swedish CLIP encoder, there’s a quantifiable difference between 2M samples and 500K. (A short M-CLIP paper has been accepted, but not yet released. But I could share it with you if you want more details).
Therefore, my advice would be to machine translate as many texts from your collected dataset as possible.

The code and models in this Github repo were created during a single weekend, so you could expect better results with more data and compute.

FreddeFrallan closed this as completed May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a model for ViT-L/14 image embeddings #10

Training a model for ViT-L/14 image embeddings #10

rom1504 commented Apr 10, 2022

FreddeFrallan commented Apr 11, 2022

Training a model for ViT-L/14 image embeddings #10

Training a model for ViT-L/14 image embeddings #10

Comments

rom1504 commented Apr 10, 2022

FreddeFrallan commented Apr 11, 2022