Skip to content

Phi-4 multi-modal produces unusually poor results #1383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mertalev opened this issue Apr 7, 2025 · 1 comment
Open

Phi-4 multi-modal produces unusually poor results #1383

mertalev opened this issue Apr 7, 2025 · 1 comment

Comments

@mertalev
Copy link

mertalev commented Apr 7, 2025

Describe the bug

microsoft/Phi-4-multimodal-instruct describes the image as extremely corrupt beyond recognition when it can clearly be described.

To Reproduce
Adapted from the phi4-mm.py example:

import onnxruntime_genai as og

model_path = "gpu/gpu-int4-rtn-block-32"
execution_provider = "cuda"
image = "36979.jpg"
text = "Describe the image."

config = og.Config(model_path)
config.clear_providers()
if execution_provider != "cpu":
    config.append_provider(execution_provider)
model = og.Model(config)

processor = model.create_multimodal_processor()
tokenizer_stream = processor.create_stream()

images = og.Images.open(image)
prompt = "<|user|>\n"
prompt += "<|image_1|>\n"
prompt += f"{text}<|end|>\n<|assistant|>\n"
inputs = processor(prompt, images=images, audios=None)
params = og.GeneratorParams(model)
params.set_inputs(inputs)
params.set_search_options(max_length=768)

generator = og.Generator(model, params)

while not generator.is_done():
    generator.generate_next_token()

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

Expected behavior

Similar to the output in the HF space: "The image shows a group of people around a table with a green tablecloth. There are playing cards on the table, and the people are engaged in a card game. The room has a yellowish wall and a window with blinds."

Actual behavior

CUDA: "The image appears to be a highly distorted or corrupted photo. It contains various colors and shapes that do not correspond to a recognizable scene or object. The image lacks clarity and coherence, making it difficult to provide a detailed description."

CPU: "The image appears to be a highly distorted or corrupted photo. The central figure is a person wearing a white shirt and dark pants, standing in front of a green background. There is visible color fringing and pixelation throughout the image."

CPU can provide at least some detail, but still performs much worse than expected.

Screenshots

The image in question is from the Flickr30k dataset (zipped to avoid compression):

36979.jpg.zip

Desktop (please complete the following information):

  • OS: Arch
  • GPU: RTX 4090
  • onnxruntime-genai-cuda: 0.7.0
@avinash31d
Copy link

The quantization is not done properly it seems like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants