Phi-4 multi-modal produces unusually poor results #1383

mertalev · 2025-04-07T21:32:14Z

Describe the bug

microsoft/Phi-4-multimodal-instruct describes the image as extremely corrupt beyond recognition when it can clearly be described.

To Reproduce
Adapted from the phi4-mm.py example:

import onnxruntime_genai as og

model_path = "gpu/gpu-int4-rtn-block-32"
execution_provider = "cuda"
image = "36979.jpg"
text = "Describe the image."

config = og.Config(model_path)
config.clear_providers()
if execution_provider != "cpu":
    config.append_provider(execution_provider)
model = og.Model(config)

processor = model.create_multimodal_processor()
tokenizer_stream = processor.create_stream()

images = og.Images.open(image)
prompt = "<|user|>\n"
prompt += "<|image_1|>\n"
prompt += f"{text}<|end|>\n<|assistant|>\n"
inputs = processor(prompt, images=images, audios=None)
params = og.GeneratorParams(model)
params.set_inputs(inputs)
params.set_search_options(max_length=768)

generator = og.Generator(model, params)

while not generator.is_done():
    generator.generate_next_token()

    new_token = generator.get_next_tokens()[0]
    print(tokenizer_stream.decode(new_token), end="", flush=True)

Expected behavior

Similar to the output in the HF space: "The image shows a group of people around a table with a green tablecloth. There are playing cards on the table, and the people are engaged in a card game. The room has a yellowish wall and a window with blinds."

Actual behavior

CUDA: "The image appears to be a highly distorted or corrupted photo. It contains various colors and shapes that do not correspond to a recognizable scene or object. The image lacks clarity and coherence, making it difficult to provide a detailed description."

CPU: "The image appears to be a highly distorted or corrupted photo. The central figure is a person wearing a white shirt and dark pants, standing in front of a green background. There is visible color fringing and pixelation throughout the image."

CPU can provide at least some detail, but still performs much worse than expected.

Screenshots

The image in question is from the Flickr30k dataset (zipped to avoid compression):

36979.jpg.zip

Desktop (please complete the following information):

OS: Arch
GPU: RTX 4090
onnxruntime-genai-cuda: 0.7.0

The text was updated successfully, but these errors were encountered:

avinash31d · 2025-04-17T15:01:04Z

The quantization is not done properly it seems like

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phi-4 multi-modal produces unusually poor results #1383

Phi-4 multi-modal produces unusually poor results #1383

mertalev commented Apr 7, 2025 •

edited

Loading

avinash31d commented Apr 17, 2025

Phi-4 multi-modal produces unusually poor results #1383

Phi-4 multi-modal produces unusually poor results #1383

Comments

mertalev commented Apr 7, 2025 • edited Loading

avinash31d commented Apr 17, 2025

mertalev commented Apr 7, 2025 •

edited

Loading