-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] Single- or multi-round multi-image training #234
Comments
Thank you for your interest. Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here:
To achieve this, you may follow these steps:
to:
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.
to:
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication. |
@ZhangYuanhan-AI Wow! Thank you for the quick response, I will let you know how training goes, and will of course cite your project on any works. |
I am also interested in using multiple images for a single response. Could you expand upon how this works under the hood? Does it just concatenate the input images in the prompt? Or is there a special way it combines them? |
in scenario 1, the model can't do video frame location ? |
Thanks for your work, is there any related inference/evaluation code (input is multi-images && single prompt for SD format)? |
This is really helpful. But it seems the code has been updated and there is no "process_spot_the_difference()" function in the same file. Any ideas how to fientune the model to handle multiple images for single reponse in the current version? |
I was very please to run across this very impressive project, thanks for the contribution.
In some domains, such as pathology and radiology images and it can take more than one image/resolution to describe a region of interest due to image size or count. In this case we don't need to compare images, but allow several images to represent one thing. This would be similar in concept to MIL visual modeling (https://github.com/Project-MONAI/tutorials/tree/main/pathology/multiple_instance_learning).
I have run across several post [1-2] discussing multi-image conversations, but I could not find any information on how a model might be trained with multi-images. A multi-round solution might work, but from a training prospective I would like to explore training multiple images for a single response. With larger context sizes, including 15-20 images along with a narrative report would be possible.
Any help exploring this topic would be appreciated.
[1] #150
[2] #89
The text was updated successfully, but these errors were encountered: