[train] Single- or multi-round multi-image training #234

codybum · 2023-08-04T12:32:13Z

I was very please to run across this very impressive project, thanks for the contribution.

In some domains, such as pathology and radiology images and it can take more than one image/resolution to describe a region of interest due to image size or count. In this case we don't need to compare images, but allow several images to represent one thing. This would be similar in concept to MIL visual modeling (https://github.com/Project-MONAI/tutorials/tree/main/pathology/multiple_instance_learning).

I have run across several post [1-2] discussing multi-image conversations, but I could not find any information on how a model might be trained with multi-images. A multi-round solution might work, but from a training prospective I would like to explore training multiple images for a single response. With larger context sizes, including 15-20 images along with a narrative report would be possible.

Any help exploring this topic would be appreciated.

[1] #150
[2] #89

ZhangYuanhan-AI · 2023-08-04T12:55:42Z

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

Otter/pipeline/mimicit_utils/mimicit_dataset.py

Line 432 in 9b34a44

    
           def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:

"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },

Modify this line from:

elif cur_train_id.startswith("SD"):

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"):

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:

--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

codybum · 2023-08-04T13:16:37Z

@ZhangYuanhan-AI Wow! Thank you for the quick response, I will let you know how training goes, and will of course cite your project on any works.

LarsDoorenbos · 2023-08-11T15:41:18Z

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

Luodian · 2023-08-11T16:03:05Z

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

Since the Otter/Flamingo model's vision_x's shape is [B, T, F, C, H, W], where T is in-context examples and F is frames. (details could be seen in Flamingo's paper).

If you need to train with multi-images, there are two scenarios.

regard them as frames with sequential order. The model has a time_embeddings to handle the frames to assign the sequential relationship to it.
Then you treat them like video input and orgainze your dataset like DC/TVC/E4D. In this way, your training prompt should be designed as <image>User: {instruction} GPT:<answer>. Only one denote the whole frames. In our SD subset, we treat it this way.
regard them as in-context examples. You could refer LA_I2I/T2T part to organize dataset. In this way, if train with two images. The prompt should be <image><image>GPT: {instruction} User:<answer>.

helloword12345678 · 2023-08-22T08:52:08Z

in scenario 1， the model can't do video frame location ?

xmc-andy · 2023-08-24T10:05:58Z

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

Otter/pipeline/mimicit_utils/mimicit_dataset.py

Line 432 in 9b34a44

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
Modify this line from:
elif cur_train_id.startswith("SD"): 
to:
elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
to:
--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Thanks for your work, is there any related inference/evaluation code (input is multi-images && single prompt for SD format)?

iz2late · 2024-03-13T18:44:49Z

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

Otter/pipeline/mimicit_utils/mimicit_dataset.py

Line 432 in 9b34a44

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
Modify this line from:
elif cur_train_id.startswith("SD"): 
to:
elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
to:
--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

This is really helpful. But it seems the code has been updated and there is no "process_spot_the_difference()" function in the same file. Any ideas how to fientune the model to handle multiple images for single reponse in the current version?

Luodian self-assigned this Aug 4, 2023

Luodian added area:train code of trainining area:model code of model labels Aug 4, 2023

ZhangYuanhan-AI self-assigned this Aug 4, 2023

charlierabea mentioned this issue Sep 24, 2023

Question about choosing multi-image input mode and replacing image decoder #279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Single- or multi-round multi-image training #234

[train] Single- or multi-round multi-image training #234

codybum commented Aug 4, 2023

ZhangYuanhan-AI commented Aug 4, 2023 •

edited

Loading

codybum commented Aug 4, 2023

LarsDoorenbos commented Aug 11, 2023

Luodian commented Aug 11, 2023 •

edited

Loading

helloword12345678 commented Aug 22, 2023

xmc-andy commented Aug 24, 2023 •

edited

Loading

iz2late commented Mar 13, 2024

[train] Single- or multi-round multi-image training #234

[train] Single- or multi-round multi-image training #234

Comments

codybum commented Aug 4, 2023

ZhangYuanhan-AI commented Aug 4, 2023 • edited Loading

codybum commented Aug 4, 2023

LarsDoorenbos commented Aug 11, 2023

Luodian commented Aug 11, 2023 • edited Loading

helloword12345678 commented Aug 22, 2023

xmc-andy commented Aug 24, 2023 • edited Loading

iz2late commented Mar 13, 2024

ZhangYuanhan-AI commented Aug 4, 2023 •

edited

Loading

Luodian commented Aug 11, 2023 •

edited

Loading

xmc-andy commented Aug 24, 2023 •

edited

Loading