You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for sharing. This question has also been raised in the context of LRV.
In the process of utilizing a visual encoder, resizing images becomes necessary. Since there are no predefined pre-processing steps, a concern arises: how can we maintain consistency in the image coordinates within referring questions and answers, especially when dealing with resized images across different models?
The text was updated successfully, but these errors were encountered:
Hi @zhang-jr, thanks for raising this issue. The coordinates in referring QAs are based on the size of original images.
In LLaVA, images are padded to square before the visual encoder (by setting --image_aspect_ratio pad), the coordinates will also be changed due to the padding behavior. Here we provide a script to help expand the bounding boxes to square as well. After running the script, you can feed the output data to LLaVA.
For other image resizing strategies, I think it will be required to preprocess the referring QAs accordingly, just as the one we do in above steps.
Thank you for sharing. This question has also been raised in the context of LRV.
In the process of utilizing a visual encoder, resizing images becomes necessary. Since there are no predefined pre-processing steps, a concern arises: how can we maintain consistency in the image coordinates within referring questions and answers, especially when dealing with resized images across different models?
The text was updated successfully, but these errors were encountered: