-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
Hey @zucchini-nlp and @NielsRogge👋,
I created a notebook for fine-tuning Llava-OneVision-0.5b-ov-hf on the BLINK Benchmark, based on the notebook of LLAVA-NeXT.
This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision.
During the implementation, a few questions arose:
- How to pass images of size 384x384 only once and not additionally the global patch (especially helpful with multiple images)?
- Why do we need the input type f32 instead of f16 when we load the trained model?
- And last but not least, do you have any tips on how to reduce the size of the input_ids (I saw there are some interesting parameters like vision_feature_select_strategy or num_image_tokens)?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels