Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train] Single- or multi-round multi-image training #234

Open
codybum opened this issue Aug 4, 2023 · 7 comments
Open

[train] Single- or multi-round multi-image training #234

codybum opened this issue Aug 4, 2023 · 7 comments
Assignees
Labels
area:model code of model area:train code of trainining

Comments

@codybum
Copy link

codybum commented Aug 4, 2023

I was very please to run across this very impressive project, thanks for the contribution.

In some domains, such as pathology and radiology images and it can take more than one image/resolution to describe a region of interest due to image size or count. In this case we don't need to compare images, but allow several images to represent one thing. This would be similar in concept to MIL visual modeling (https://github.com/Project-MONAI/tutorials/tree/main/pathology/multiple_instance_learning).

I have run across several post [1-2] discussing multi-image conversations, but I could not find any information on how a model might be trained with multi-images. A multi-round solution might work, but from a training prospective I would like to explore training multiple images for a single response. With larger context sizes, including 15-20 images along with a narrative report would be possible.

Any help exploring this topic would be appreciated.

[1] #150
[2] #89

@Luodian Luodian self-assigned this Aug 4, 2023
@Luodian Luodian added area:train code of trainining area:model code of model labels Aug 4, 2023
@ZhangYuanhan-AI
Copy link
Collaborator

ZhangYuanhan-AI commented Aug 4, 2023

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

  1. Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
  1. Modify this line from:
elif cur_train_id.startswith("SD"): 

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

  1. Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

@ZhangYuanhan-AI ZhangYuanhan-AI self-assigned this Aug 4, 2023
@codybum
Copy link
Author

codybum commented Aug 4, 2023

@ZhangYuanhan-AI Wow! Thank you for the quick response, I will let you know how training goes, and will of course cite your project on any works.

@LarsDoorenbos
Copy link

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

@Luodian
Copy link
Owner

Luodian commented Aug 11, 2023

I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them?

Since the Otter/Flamingo model's vision_x's shape is [B, T, F, C, H, W], where T is in-context examples and F is frames. (details could be seen in Flamingo's paper).

If you need to train with multi-images, there are two scenarios.

  • regard them as frames with sequential order. The model has a time_embeddings to handle the frames to assign the sequential relationship to it.
    Then you treat them like video input and orgainze your dataset like DC/TVC/E4D. In this way, your training prompt should be designed as <image>User: {instruction} GPT:<answer>. Only one denote the whole frames. In our SD subset, we treat it this way.

  • regard them as in-context examples. You could refer LA_I2I/T2T part to organize dataset. In this way, if train with two images. The prompt should be <image><image>GPT: {instruction} User:<answer>.

@helloword12345678
Copy link

in scenario 1, the model can't do video frame location ?

@xmc-andy
Copy link

xmc-andy commented Aug 24, 2023

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

  1. Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
  1. Modify this line from:
elif cur_train_id.startswith("SD"): 

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

  1. Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Thanks for your work, is there any related inference/evaluation code (input is multi-images && single prompt for SD format)?

@iz2late
Copy link

iz2late commented Mar 13, 2024

Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:

def process_spot_the_difference(self, instruction_id, instruction, answer, image_ids, in_context_example_ids):

To achieve this, you may follow these steps:

  1. Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:
"MED_INS_00001": {
            "instruction":"XXX",
            "answer":"XXX.",
            "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
            "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
        },
  1. Modify this line from:
elif cur_train_id.startswith("SD"): 

to:

elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 

This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.

  1. Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \

to:

--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \

If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

This is really helpful. But it seems the code has been updated and there is no "process_spot_the_difference()" function in the same file. Any ideas how to fientune the model to handle multiple images for single reponse in the current version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:model code of model area:train code of trainining
Projects
None yet
Development

No branches or pull requests

7 participants