Degraded Performance when in-flight batching being used

### System Info

Debian 11

`nvidia-smi`
```
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   75C    P0              62W /  72W |  20585MiB / 23034MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                      Off | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0              66W /  72W |  20585MiB / 23034MiB |     76%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
```

`nvcc --version`
```
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
```


### Who can help?

@kaiyux

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

I am seeking to use a set of LoRa Weights (trained with linear 1.75 rope scaling and a 875000 rotary base) on a llama3-8B base model.  I am planning to deploy to 2X L4 GPUs and would like to support 14,000 

compiled the `rel` branch of triton inference server tensorrt-llm backend (also uses `rel` of tensorrt-llm).  I have been approaching this path to ensure that the container I will serve using is identical to the one I use for compilation.

```
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrt_llm__backend
git checkout b92bdd79b6c50fb67203b6064e73662163012fe3

git lfs install
git submodule update --init --recursive
# Build Container.  Will be container for serving as well as inference
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm:b92bdd79 -f dockerfile/Dockerfile.trt_llm_backend .
```

I am updating the `config.json` file within the LLama3B base model for the rope scaline parameters used in training the LoRa adapters:

```python
...
# Update rope_scaling key in config.json to {"type": "dynamic", "factor": 1.5}
config_path = os.path.join(BASE_MODEL_DIR, "config.json")
with open(config_path, "r") as f:
    config = json.load(f)
config["rope_scaling"] = {"type": "dynamic", "factor": 1.75}
config["rope_theta"] = 875000
with open(config_path, "w") as f:
    json.dump(config, f)
```

I am then using this container to compile a llama3-8B base model for tensor parallelism 2 using the following convert/build commands.

```bash
python3 /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2


trtllm-build \
--max_input_len=14000 \
--max_num_tokens=14000 \
--max_seq_len=14000 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--multi_block_mode enable \
--remove_input_padding enable \
--checkpoint_dir /converted_base_model \
--use_custom_all_reduce enable \
--cluster_key L4 \
--workers=2 \
--use_paged_context_fmha enable \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}
```

I have additional conversions to make my lora base weights into warmup files which i am using to initialize my lora weights.  Leaving out these details here (though I might make a PR to provide them in the backend repo)

I then start my inference server and warmup runs successfully.

When I send sequential single inference traffic all adapters produce results of high quality.  When I run several concurrent requests (and begin utilizing in-flight batching) the results degrade.  The same input when run as the only thing in flight will give different results than if it is running while other inferences are in-flight.  



### Expected behavior

Inference results are deterministic (beam size 3 and I am passing random seed as well) and do not change when in flight batching active.

### actual behavior

Results are only deterministic if it is the only inference in flight.

### additional notes

I am willing to repost in the https://github.com/triton-inference-server/tensorrtllm_backend repo if the root cause is in that code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Degraded Performance when in-flight batching being used #1823

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Degraded Performance when in-flight batching being used #1823

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions