Adding notebook for Llava-OneVision on multi-image task

Hey @zucchini-nlp and @NielsRogge👋,

I created a [notebook](https://colab.research.google.com/drive/1DEne3yuCmHKMgvtV3sMxJZQRRkDiLXYB?usp=sharing) for fine-tuning [Llava-OneVision-0.5b-ov-hf](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf) on the [BLINK Benchmark](https://huggingface.co/datasets/BLINK-Benchmark/BLINK), based on the notebook of LLAVA-NeXT.
This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision.
During the implementation, a few questions arose:
1.  How to pass images of size 384x384 only once and not additionally the global patch (especially helpful with multiple images)?
2. Why do we need the input type f32 instead of f16 when we load the trained model?
3. And last but not least, do you have any tips on how to reduce the size of the input_ids (I saw there are some interesting parameters like *vision_feature_select_strategy* or *num_image_tokens*)?  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding notebook for Llava-OneVision on multi-image task #470

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Adding notebook for Llava-OneVision on multi-image task #470

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions