Problem running Triton Server with TensorRT-LLM backend and Llama 2 in Kubernetes #674

onlygo · 2023-12-16T02:54:55Z

We are trying to deploy a Llama 2 model with Triton server TensorRT-LLM backend, using the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 container image.

In Docker everything works fine. However when we tried the same container image in Kubernetes, we are getting bus error:

[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
I1215 18:15:20.401963 1034 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1215 18:15:20.402378 1034 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[tensor-rt-llm:1034 :0:1040] Caught signal 7 (Bus error: nonexistent physical address)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4b95 vs 0x436758)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4b95 vs 0x436758)

We tried the workaround mentioned in NVIDIA/nccl-tests#143, by increasing /dev/shm to 1GB, but it didn't help.

Here is our deployment yaml

apiVersion: v1
kind: Pod
metadata:
  name: tensor-rt-llm
spec:
  hostIPC: true
  nodeSelector:
    kubernetes.io/hostname: c300-11 
  volumes:
    - name: model-store
      persistentVolumeClaim:
        claimName: pv-claim
    - name: dshm
      emptyDir:
          medium: Memory
          sizeLimit: 1Gi
  
  containers:
    - name: tensor-rt-llm-dev
      image: nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
      #securityContext:
      #  privileged: true #doesn't help
      command: [ "sleep" ]
      args: [ "infinity" ]
      volumeMounts:
        - mountPath: "/mnt/pvc"
          name: model-store
        - mountPath: /dev/shm
          name: dshm
      resources:
        limits:
          memory: "64Gi"
          cpu: "8"
          nvidia.com/gpu: "4"

Full error log
trt_llm_bus_error_k8s.txt

Good log when running with docker
trt_llm_good_docker.txt

The text was updated successfully, but these errors were encountered:

onlygo · 2023-12-17T22:12:02Z

We figured out the issue to be due to huge pages not enabled in our k8s cluster. Closing the issue.

Wenhan-Tan · 2024-06-25T23:02:37Z

Hi @onlygo Could you please explain more on how you fixed the issue? I'm having the same problem in Kubernetes as well.

onlygo closed this as completed Dec 17, 2023

Wenhan-Tan mentioned this issue Jun 28, 2024

Failed to run on H100 GPU with tensor para=8 triton-inference-server/fastertransformer_backend#166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem running Triton Server with TensorRT-LLM backend and Llama 2 in Kubernetes #674

Problem running Triton Server with TensorRT-LLM backend and Llama 2 in Kubernetes #674

onlygo commented Dec 16, 2023

onlygo commented Dec 17, 2023

Wenhan-Tan commented Jun 25, 2024

Problem running Triton Server with TensorRT-LLM backend and Llama 2 in Kubernetes #674

Problem running Triton Server with TensorRT-LLM backend and Llama 2 in Kubernetes #674

Comments

onlygo commented Dec 16, 2023

onlygo commented Dec 17, 2023

Wenhan-Tan commented Jun 25, 2024