Describe the bug
When the "max_tokens" in the payload is higher than --inference_max_seq_length passed to the server (in my case: 8192 vs 4096) the server responds with empty assistant message.
Steps/Code to reproduce bug
Deployment snippet (Eos cluster):
python \
/opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
--megatron_checkpoint /lustre/fsw/coreai_dlalgo_ci/nemo_export_deploy_eval_checkpoints/mbridge/meta-llama/Llama-3.1-8B-Instruct/iter_0000000/ \
--model_id megatron_model \
--port 8886 \
--host 0.0.0.0 \
--num_gpus 8 \
--tensor_model_parallel_size 1 \
--pipeline_model_parallel_size 1 \
--expert_model_parallel_size 1 \
--max_batch_size 2 \
--num_replicas 8 \
--inference_max_seq_length 4096 \
--runtime_env '{"py_executable": "/opt/venv/bin/python"}' &
Then to send a request to the model:
import requests
model_name="megatron_model"
endpoint_url="http://0.0.0.0:8886/v1/chat/completions"
payload = {"model": model_name, "max_tokens": 8192, "top_p": 0.9999999, "temperature": 1e-07, "messages": [{"role": "user", "content": "## Instruction:\n\nPlease answer this question by first reasoning and then selecting the correct choice.\nPresent your reasoning and solution in the following json format.\nPlease show your choice in the `answer` field with only the choice letter, e.g.,`\"answer\": \"C\"`.\n\n```json\n{\n \"reasoning\": \"___\",\n \"answer\": \"___\"\n}\n```\n\n## Question:\n\nWhich of the following is a disorder characterized by uncontrollable episodes of falling asleep during the day?\n\n## Choices:\n\n- (A) Dyslexia\n- (B) Epilepsy\n- (C) Hydrocephalus\n- (D) Narcolepsy\n\n## Answer:"}]}
response = requests.post(endpoint_url, json=payload)
response.json()
Expected behavior
The server should respond with descriptive error
Describe the bug
When the
"max_tokens"in the payload is higher than--inference_max_seq_lengthpassed to the server (in my case: 8192 vs 4096) the server responds with empty assistant message.Steps/Code to reproduce bug
Deployment snippet (Eos cluster):
Then to send a request to the model:
Expected behavior
The server should respond with descriptive error