This folder contains notebooks regarding Idefics2, a powerful vision-language model developed by Hugging Face.
- Idefics2 docs
- Idefics2 blog post
- see also this nice blog post: https://medium.com/google-developer-experts/ml-story-fine-tune-vision-language-model-on-custom-dataset-8e5f5dace7b1.
-
I just uploaded a similar notebook for LLaVa: it works just as well, and I removed the addition of special tokens to make the logic simpler. Can be done for Idefics2, too.
-
The notebook I currently include here is aimed for extraction use cases (image->text or JSON).
If you have a chatbot use case, I'd recommend taking a look at the experimental support for VLMs in the TRL library:
- example script for fine-tuning Llava for chat: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py
- example script for fine-tuning Idefics2 for chat: https://gist.github.com/edbeeching/228652fc6c2b29a1641be5a5778223cb