You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Similar to the output in the HF space: "The image shows a group of people around a table with a green tablecloth. There are playing cards on the table, and the people are engaged in a card game. The room has a yellowish wall and a window with blinds."
Actual behavior
CUDA: "The image appears to be a highly distorted or corrupted photo. It contains various colors and shapes that do not correspond to a recognizable scene or object. The image lacks clarity and coherence, making it difficult to provide a detailed description."
CPU: "The image appears to be a highly distorted or corrupted photo. The central figure is a person wearing a white shirt and dark pants, standing in front of a green background. There is visible color fringing and pixelation throughout the image."
CPU can provide at least some detail, but still performs much worse than expected.
Screenshots
The image in question is from the Flickr30k dataset (zipped to avoid compression):
Describe the bug
microsoft/Phi-4-multimodal-instruct
describes the image as extremely corrupt beyond recognition when it can clearly be described.To Reproduce
Adapted from the
phi4-mm.py
example:Expected behavior
Similar to the output in the HF space: "The image shows a group of people around a table with a green tablecloth. There are playing cards on the table, and the people are engaged in a card game. The room has a yellowish wall and a window with blinds."
Actual behavior
CUDA: "The image appears to be a highly distorted or corrupted photo. It contains various colors and shapes that do not correspond to a recognizable scene or object. The image lacks clarity and coherence, making it difficult to provide a detailed description."
CPU: "The image appears to be a highly distorted or corrupted photo. The central figure is a person wearing a white shirt and dark pants, standing in front of a green background. There is visible color fringing and pixelation throughout the image."
CPU can provide at least some detail, but still performs much worse than expected.
Screenshots
The image in question is from the Flickr30k dataset (zipped to avoid compression):
36979.jpg.zip
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: