-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System Info
Debian 11
nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 75C P0 62W / 72W | 20585MiB / 23034MiB | 75% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 75C P0 66W / 72W | 20585MiB / 23034MiB | 76% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am seeking to use a set of LoRa Weights (trained with linear 1.75 rope scaling and a 875000 rotary base) on a llama3-8B base model. I am planning to deploy to 2X L4 GPUs and would like to support 14,000
compiled the rel branch of triton inference server tensorrt-llm backend (also uses rel of tensorrt-llm). I have been approaching this path to ensure that the container I will serve using is identical to the one I use for compilation.
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrt_llm__backend
git checkout b92bdd79b6c50fb67203b6064e73662163012fe3
git lfs install
git submodule update --init --recursive
# Build Container. Will be container for serving as well as inference
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm:b92bdd79 -f dockerfile/Dockerfile.trt_llm_backend .
I am updating the config.json file within the LLama3B base model for the rope scaline parameters used in training the LoRa adapters:
...
# Update rope_scaling key in config.json to {"type": "dynamic", "factor": 1.5}
config_path = os.path.join(BASE_MODEL_DIR, "config.json")
with open(config_path, "r") as f:
config = json.load(f)
config["rope_scaling"] = {"type": "dynamic", "factor": 1.75}
config["rope_theta"] = 875000
with open(config_path, "w") as f:
json.dump(config, f)I am then using this container to compile a llama3-8B base model for tensor parallelism 2 using the following convert/build commands.
python3 /app/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir ${BASE_MODEL_DIR} \
--output_dir /converted_base_model \
--rotary_base 875000 \
--dtype bfloat16 \
--tp_size 2
trtllm-build \
--max_input_len=14000 \
--max_num_tokens=14000 \
--max_seq_len=14000 \
--tp_size 2 \
--max_batch_size 4 \
--max_beam_width 3 \
--lora_plugin bfloat16 \
--gemm_plugin bfloat16 \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_gate mlp_4h_to_h \
--max_lora_rank 32 \
--gpt_attention_plugin bfloat16 \
--paged_kv_cache enable \
--multi_block_mode enable \
--remove_input_padding enable \
--checkpoint_dir /converted_base_model \
--use_custom_all_reduce enable \
--cluster_key L4 \
--workers=2 \
--use_paged_context_fmha enable \
--context_fmha enable \
--lookup_plugin bfloat16 \
--enable_xqa enable \
--output_dir ${ENGINE_DIR}I have additional conversions to make my lora base weights into warmup files which i am using to initialize my lora weights. Leaving out these details here (though I might make a PR to provide them in the backend repo)
I then start my inference server and warmup runs successfully.
When I send sequential single inference traffic all adapters produce results of high quality. When I run several concurrent requests (and begin utilizing in-flight batching) the results degrade. The same input when run as the only thing in flight will give different results than if it is running while other inferences are in-flight.
Expected behavior
Inference results are deterministic (beam size 3 and I am passing random seed as well) and do not change when in flight batching active.
actual behavior
Results are only deterministic if it is the only inference in flight.
additional notes
I am willing to repost in the https://github.com/triton-inference-server/tensorrtllm_backend repo if the root cause is in that code.