Server returns empty answer when too many tokens requested

**Describe the bug**

When the `"max_tokens"` in the payload is higher than `--inference_max_seq_length` passed to the server (in my case: 8192 vs 4096) the server responds with empty assistant message.

**Steps/Code to reproduce bug**

Deployment snippet (Eos cluster):

```
python \
  /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
  --megatron_checkpoint /lustre/fsw/coreai_dlalgo_ci/nemo_export_deploy_eval_checkpoints/mbridge/meta-llama/Llama-3.1-8B-Instruct/iter_0000000/ \
  --model_id megatron_model \
  --port 8886 \
  --host 0.0.0.0 \
  --num_gpus 8 \
  --tensor_model_parallel_size 1 \
  --pipeline_model_parallel_size 1 \
  --expert_model_parallel_size 1 \
  --max_batch_size 2 \
  --num_replicas 8 \
  --inference_max_seq_length 4096 \
  --runtime_env '{"py_executable": "/opt/venv/bin/python"}' &
```

Then to send a request to the model:

```
import requests
model_name="megatron_model"
endpoint_url="http://0.0.0.0:8886/v1/chat/completions"

payload = {"model": model_name, "max_tokens": 8192, "top_p": 0.9999999, "temperature": 1e-07, "messages": [{"role": "user", "content": "## Instruction:\n\nPlease answer this question by first reasoning and then selecting the correct choice.\nPresent your reasoning and solution in the following json format.\nPlease show your choice in the `answer` field with only the choice letter, e.g.,`\"answer\": \"C\"`.\n\n```json\n{\n    \"reasoning\": \"___\",\n    \"answer\": \"___\"\n}\n```\n\n## Question:\n\nWhich of the following is a disorder characterized by uncontrollable episodes of falling asleep during the day?\n\n## Choices:\n\n- (A) Dyslexia\n- (B) Epilepsy\n- (C) Hydrocephalus\n- (D) Narcolepsy\n\n## Answer:"}]}

response = requests.post(endpoint_url, json=payload)
response.json()
```

**Expected behavior**

The server should respond with descriptive error


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server returns empty answer when too many tokens requested #652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server returns empty answer when too many tokens requested #652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions