how to run two engine in one gpu? #3169

ArtemisZGL · 2023-07-28T07:51:27Z

Description

I tried to run two engine in one gpu. when runing one engine, the running time is 10ms with gpu util 10-20%, but when I start another docker and ran another engine at the same time, the running time for each one is 20ms, with gpu util still 10-20%.

I also tried to using multiprocess in python with creating different context in each process, the result is the same as above.

However, I tried to do the same operation in pytorch, its running time is still 10ms and gpu util double.

Environment

TensorRT Version: 8.6

NVIDIA GPU: 4070

NVIDIA Driver Version: 525.125.06

CUDA Version: 11.3

CUDNN Version:

Operating System: ubuntu 20.04

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Steps To Reproduce

common.py(https://github.com/NVIDIA/TensorRT/blob/96e23978cd6e4a8fe869696d3d8ec2b47120629b/samples/python/common.py)

infer code

import numpy as np
import tensorrt as trt
import torch
import common
import time

import pycuda.driver as cuda


def get_engine(engine_file_path):
    # TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        return engine


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def get_rand_inputs(model, batch_size):
    """
    return a random inputs for model
    """
    inputs_ = torch.rand((255, 255, 3)).to(model.device)
    
    return inputs_


def trt_infer(engine, context, inputs_):
    context.active_optimization_profile = 0
    origin_inputshape = context.get_binding_shape(0)

    origin_inputshape[0] = inputs_.shape[0]
    context.set_binding_shape(0, (origin_inputshape))               

    inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
    inputs[0].host = to_numpy(inputs_)

    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    return trt_outputs

def test_trt_one_replica():
    # cuda.init()
    # cuda_ctx = cuda.Device(0).make_context()
    engine_model_path = "model.trt"
    engine = get_engine(engine_model_path)
    context = engine.create_execution_context()

    model = CreateModel("cpu")
   

    total_time = 0
    infer_times = 100

    for _ in range(infer_times):
        batch_size = 5
        inputs_ = get_rand_inputs(model, batch_size)

        _start = time.perf_counter()
        # cuda_ctx.push()
        trt_outputs = trt_infer(engine, context, inputs_)
        # cuda_ctx.pop()
        total_time += time.perf_counter()-_start

    # cuda_ctx.pop()
    print(f"latency@finished: {total_time / infer_times:.6f}s.")

The text was updated successfully, but these errors were encountered:

zerollzeng · 2023-07-29T08:39:51Z

Please use multi-thread, multi-process will create multiple cuda context, and cuda context switch on GPU is in time slice.

ArtemisZGL · 2023-08-08T09:04:30Z

Please use multi-thread, multi-process will create multiple cuda context, and cuda context switch on GPU is in time slice.

But what if I start two docker container using one gput ? It still cannot utilize the gpu. And I tried torch2trt repo with multiprocessing, its running time will not double, but the gpu util double. As far as I know, the torch2trt repo is based on tensorrt, so tensorrt engine should work as well.

zerollzeng · 2023-08-08T14:06:30Z

Let me make it simpler: two process use GPU in time slice, it's expected.

You can explore MPS as an option: https://docs.nvidia.com/deploy/mps/index.html

ArtemisZGL · 2023-08-14T03:24:22Z

Let me make it simpler: two process use GPU in time slice, it's expected.

You can explore MPS as an option: https://docs.nvidia.com/deploy/mps/index.html

Thanks, I found a simple way in python, which using torch.nn.Module to encapsulate the trt engine referring to the practice of torch2trt.

susnato · 2024-02-16T19:06:45Z

Hi @ArtemisZGL, if it's possible could you please share the solution?

zerollzeng self-assigned this Jul 29, 2023

zerollzeng added the triaged Issue has been triaged by maintainers label Jul 29, 2023

ArtemisZGL closed this as completed Aug 14, 2023

h-sh-h mentioned this issue May 9, 2024

Multiple model Inference And Runtime Model Switching NVIDIA-ISAAC-ROS/isaac_ros_dnn_inference#47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to run two engine in one gpu? #3169

how to run two engine in one gpu? #3169

ArtemisZGL commented Jul 28, 2023

zerollzeng commented Jul 29, 2023

ArtemisZGL commented Aug 8, 2023

zerollzeng commented Aug 8, 2023

ArtemisZGL commented Aug 14, 2023 •

edited

Loading

susnato commented Feb 16, 2024

how to run two engine in one gpu? #3169

how to run two engine in one gpu? #3169

Comments

ArtemisZGL commented Jul 28, 2023

Description

Environment

Steps To Reproduce

zerollzeng commented Jul 29, 2023

ArtemisZGL commented Aug 8, 2023

zerollzeng commented Aug 8, 2023

ArtemisZGL commented Aug 14, 2023 • edited Loading

susnato commented Feb 16, 2024

ArtemisZGL commented Aug 14, 2023 •

edited

Loading