Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to run two engine in one gpu? #3169

Closed
ArtemisZGL opened this issue Jul 28, 2023 · 5 comments
Closed

how to run two engine in one gpu? #3169

ArtemisZGL opened this issue Jul 28, 2023 · 5 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@ArtemisZGL
Copy link

Description

I tried to run two engine in one gpu. when runing one engine, the running time is 10ms with gpu util 10-20%, but when I start another docker and ran another engine at the same time, the running time for each one is 20ms, with gpu util still 10-20%.

I also tried to using multiprocess in python with creating different context in each process, the result is the same as above.

However, I tried to do the same operation in pytorch, its running time is still 10ms and gpu util double.

Environment

TensorRT Version: 8.6

NVIDIA GPU: 4070

NVIDIA Driver Version: 525.125.06

CUDA Version: 11.3

CUDNN Version:

Operating System: ubuntu 20.04

Python Version (if applicable): 3.10

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Steps To Reproduce

common.py(https://github.com/NVIDIA/TensorRT/blob/96e23978cd6e4a8fe869696d3d8ec2b47120629b/samples/python/common.py)

infer code

import numpy as np
import tensorrt as trt
import torch
import common
import time

import pycuda.driver as cuda


def get_engine(engine_file_path):
    # TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
    TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
    print("Reading engine from file {}".format(engine_file_path))
    with open(engine_file_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        return engine


def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def get_rand_inputs(model, batch_size):
    """
    return a random inputs for model
    """
    inputs_ = torch.rand((255, 255, 3)).to(model.device)
    
    return inputs_


def trt_infer(engine, context, inputs_):
    context.active_optimization_profile = 0
    origin_inputshape = context.get_binding_shape(0)

    origin_inputshape[0] = inputs_.shape[0]
    context.set_binding_shape(0, (origin_inputshape))               

    inputs, outputs, bindings, stream = common.allocate_buffers_v2(engine, context)
    inputs[0].host = to_numpy(inputs_)

    trt_outputs = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
    return trt_outputs

def test_trt_one_replica():
    # cuda.init()
    # cuda_ctx = cuda.Device(0).make_context()
    engine_model_path = "model.trt"
    engine = get_engine(engine_model_path)
    context = engine.create_execution_context()

    model = CreateModel("cpu")
   

    total_time = 0
    infer_times = 100

    for _ in range(infer_times):
        batch_size = 5
        inputs_ = get_rand_inputs(model, batch_size)

        _start = time.perf_counter()
        # cuda_ctx.push()
        trt_outputs = trt_infer(engine, context, inputs_)
        # cuda_ctx.pop()
        total_time += time.perf_counter()-_start

    # cuda_ctx.pop()
    print(f"latency@finished: {total_time / infer_times:.6f}s.")

@zerollzeng
Copy link
Collaborator

Please use multi-thread, multi-process will create multiple cuda context, and cuda context switch on GPU is in time slice.

@zerollzeng zerollzeng self-assigned this Jul 29, 2023
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jul 29, 2023
@ArtemisZGL
Copy link
Author

Please use multi-thread, multi-process will create multiple cuda context, and cuda context switch on GPU is in time slice.

But what if I start two docker container using one gput ? It still cannot utilize the gpu. And I tried torch2trt repo with multiprocessing, its running time will not double, but the gpu util double. As far as I know, the torch2trt repo is based on tensorrt, so tensorrt engine should work as well.

@zerollzeng
Copy link
Collaborator

Let me make it simpler: two process use GPU in time slice, it's expected.

You can explore MPS as an option: https://docs.nvidia.com/deploy/mps/index.html

@ArtemisZGL
Copy link
Author

ArtemisZGL commented Aug 14, 2023

Let me make it simpler: two process use GPU in time slice, it's expected.

You can explore MPS as an option: https://docs.nvidia.com/deploy/mps/index.html

Thanks, I found a simple way in python, which using torch.nn.Module to encapsulate the trt engine referring to the practice of torch2trt.

@susnato
Copy link

susnato commented Feb 16, 2024

Hi @ArtemisZGL, if it's possible could you please share the solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants