Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a model for ViT-L/14 image embeddings #10

Closed
rom1504 opened this issue Apr 10, 2022 · 1 comment
Closed

Training a model for ViT-L/14 image embeddings #10

rom1504 opened this issue Apr 10, 2022 · 1 comment

Comments

@rom1504
Copy link
Collaborator

rom1504 commented Apr 10, 2022

Hey,
Thanks for providing this awesome multilingual clip-aligned text encoder.
We used it to filter the 3 billions of (image, text) pairs of laion5B https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/ and it worked well.
I'm also using this model to provide a multilingual search in https://rom1504.github.io/clip-retrieval/.
For laion400m we used the ViT-B/32 model of openai to produce the index, but for laion5B we went with ViT-L/14 which is much more powerful.
To provide the same multilingual search feature, it would be really helpful if I had a clip ViT-L/14 aligned multilingual text encoder.

Would you advise running https://github.com/FreddeFrallan/Multilingual-CLIP#training-a-new-model (and now I'm writing it, I guess I could use a subset of the multilingual set of laion5B for this) to align such a text encoder ?

@FreddeFrallan
Copy link
Owner

Hi there,
I'm happy that you found a good use case for these models.
A multilingual ViT-L/14 sounds very interesting to me, ​and I'm fond of the idea of making large-scale models available to people.

My most extensive advice for creating a good Multilingual encoder would be to increase the number of translated data points. For example on the Swedish CLIP encoder, there’s a quantifiable difference between 2M samples and 500K. (A short M-CLIP paper has been accepted, but not yet released. But I could share it with you if you want more details).
Therefore, my advice would be to machine translate as many texts from your collected dataset as possible.

The code and models in this Github repo were created during a single weekend, so you could expect better results with more data and compute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants