Closed
Description
I built onnxruntime_genai from source with the cuda execution provider then installed the python wheel.
I tried to run the model microsoft/phi-2 but it seems there is a problem with the GroupQueryAttention node.
Here is the command to build the phi-2 model from Hugging Face :
python -m onnxruntime_genai.models.builder -m microsoft/phi-2 -e cuda -p int4 -o ./example-models/phi2-int4-cuda
Here is the python code to reproduce the error :
import onnxruntime_genai as og
import time
prompt = '''def is_prime(n):
"""
Determine if n is prime or not
"""'''
model=og.Model(f'example-models\phi2-int4-cuda')
tokenizer = og.Tokenizer(model)
tokens = tokenizer.encode(prompt)
params=og.GeneratorParams(model)
params.set_search_options(max_length=100)
params.input_ids = tokens
start_time = time.time()
output_tokens=model.generate(params)[0]
end_time = time.time()
text = tokenizer.decode(output_tokens)
print(text)
Here is the error that I obtain :
onnxruntime_genai.onnxruntime_genai.OrtException: Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 1 must be <= head_size / 2 and a multiple of 8.
OS : Windows 10
Architecture : x64
Language : Python
Onnxruntime version : 1.17.1
Onnxruntime_genai version : 0.3.0-dev
Cuda version : 12.3
Metadata
Metadata
Assignees
Labels
No labels