Skip to content

Adding notebook for Llava-OneVision on multi-image task #470

@nicokossmann

Description

@nicokossmann

Hey @zucchini-nlp and @NielsRogge👋,

I created a notebook for fine-tuning Llava-OneVision-0.5b-ov-hf on the BLINK Benchmark, based on the notebook of LLAVA-NeXT.
This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision.
During the implementation, a few questions arose:

  1. How to pass images of size 384x384 only once and not additionally the global patch (especially helpful with multiple images)?
  2. Why do we need the input type f32 instead of f16 when we load the trained model?
  3. And last but not least, do you have any tips on how to reduce the size of the input_ids (I saw there are some interesting parameters like vision_feature_select_strategy or num_image_tokens)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions