any engine for inference subgraph acceleration naive design #10028

Superjomn · 2018-04-18T13:48:40Z

architecture

frontend to mark the subgraphs that should be optimized by x engine
inference preparation to get the subgraph, change program desc by replacing the subgraph with an x engine op
engine op to build an x engine and execute it like a normal operator.

phrases

frontend

manually partition graph
TODO LATTER automatically partition graph

Some initital ideas, just add some special with-block just for inference

a = op0(b, c)

with infer.accelerate_by_tensorrt:
    a1 = op1(b, c)
    a2 = op1(b, c)

c = op2(b, c)

with infer.accelerate_by_some_other_engine:
    a3 = op3(c)
    ...

backend

inference prepare
- partition graph and transform the infer program desc
- tensorrt_engine_op.build
  - convert: input subgraph's block desc, add TensorRT layer into TensorRT engine
    - transform weight format from fluid to tensorrt
    - add tensorrt layer
inference execute
- for op in the infer program desc
  - op.run

x engine

get a subgraph's block desc (from attribute)
build x engine once
execute x engine any times

x op

construct x op from x network

convert

construct x network from a subgraph's block desc

The text was updated successfully, but these errors were encountered:

luotao1 · 2018-04-19T10:36:06Z

Convert

construct an TensorRT network from a sub-block's desc

Converter Class and member

using OpConverter= std::function<void(const framework::OpDesc&)>;

Class Converter {
  std::unordered_map<std::string, OpConverter> op_registry_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;
  // scope: fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt network
  nvinfer1::INetworkDefinition* network_;

public:  
  Converter(const framework::BlockDesc& block, 
    const framework::Scope& scope,
    nvinfer1::INetworkDefinition* network) {
    block_ = block;
    scope_= scope;
    network_ = network;
    this->register_op_converters();      
  }

  // register different op converters  
  void register_op_converters();

  // construct an TensorRT network from a sub-block's desc
  void ConvertSubBlockToTensorRTNetwork();

  // convert inputs of op to tensorrt inputs
  void ConvertInput(const framework::OpDesc& first_op);

  // convert a fluid Mul op to tensorrt fc layer
  void ConvertMul(const framework::OpDesc& mul_op);
  ...

  // convert tensorrt outputs to fluid
  void ConvertOutput(const framework::OpDesc& last_op);
}

function details

void register_op_converters() {
  op_registry_["Mul"]=ConvertMul;
  op_registry_["Conv2d"]=ConvertConv2d;
  ...
}

void ConvertSubBlockToTensorRTNetwork() {
   // convert fluid inputs of first op to tensorrt inputs.   
   convertInput(block_.ops[0]);
   
   for (auto op : block_.AllOps()) { 
     // convert each fluid op to tensorrt layer
     OpConverter op_converter = op_registry_.at(op.type());      
     op_converter(*this, op);
   }

   // convert tensorrt outputs of last op to fluid outputs.
   convertOutput(block_.ops[block_.OpSize() - 1]); 
}

void ConvertInput(const framework::OpDesc& first_op) {
  auto var_names = first_op.InputArgumentNames();
  for(auto var_name : var_names) {
    auto fluid_tensor  = scope_.FindVar(var_name)->GetMutable<framework::LoDTensor>();
    
    // do some transformation for input fluid tensor, and get its type and dims.
    auto shape_tensor = transformation(fluid_tensor);
    nvinfer1::DataType type = shape_tensor.type();
    nvinfer1::DimsCHW dim = shape_tensor.dims();

    // add input into tensorrt network
    nvinfer1::ITensor* input_tensor = network_->addInput(
        var_name, type, dim);

    // insert input tensor into tensorrt's tensor list
    tr_tensors_[var_name] = input_tensor 
  }
}

void ConvertMul(const framework::OpDesc& op) {
   // get input tensor from tensorrt's tensor list
   std::string x_var_name = op->Input("X");
   auto x_tensor = tr_tensors_[x_var_name];

   // get weight from fluid inference scope
   std::string y_var_name = op->Output("Y");
   auto y_tensor  = scope_.FindVar(y_var_name)->GetMutable<framework::LoDTensor>();

   // do some weight transformation
   auto y_shape_tensor = transformation(y_tensor);
   
   // add layer into tensorrt network
   nvinfer1::IFullyConnectedLayer* layer = network_->addFullyConnected(
      x_tensor, 1, y_shape_tensor, 0 /*bias*/);
   
   // get output tensor, and insert into tensorrt's tensor list
   std::string out_var_name = op->Output("Out");
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   tr_tensors_[out_var_name] = output_tensor;
}

void ConvertOutput(const framework::OpDesc& last_op) {
  auto var_names = last_op.OutputArgumentNames();
  for(auto var_name : var_names) {
    // get output tensor from tensorrt's tensor list, and add it into network
    auto tr_tensor = tr_tensors_[var_name];
    network->markOutput(*tr_tensor);
    
    // do some transformation for output tensorrt tensor, modify the fluid tensor in inference scope
    auto fluid_tensor  = scope_.FindVar(var_name)->GetMutable<framework::LoDTensor>();
    transformation(tr_tensor, fluid_tensor);
  }
}

how to call convert function

auto executor = paddle::framework::Executor(place);
auto* scope = new paddle::framework::Scope();
inference_program = paddle::inference::Load(
          executor, scope, program_path, parameter_path);

nvinfer1::IBuilder* builder = createInferBuilder(logger);
nvinfer1::INetworkDefinition* network = builder->createNetwork();
Converter converter(inference_program.block(), scope, network);
converter->ConvertSubBlockToTensorRTNetwork();

// build and executor tensorrt engine
auto engine = builder->buildCudaEngine(*network);
...

Superjomn · 2018-04-20T03:01:51Z

use TensorrtEngine to operate on TensorRT
Tensor convert ITensor, ITensor convert Tensor

luotao1 · 2018-04-20T04:36:54Z

motivation

为什么采用子图的方式来调用TensorRT，它和直接使用TensorRT之间的性能有多少差异呢？
由于测两者的benchmark需要耗费一定时间，所以先咨询了tensorflow社区：tensorflow/models#4028

这个issue中的主要信息如下：

“直接使用TensorRT”肯定比“子图方式”快，但对能转的那些部分，两者非常接近：
Our tests indicate that, for the converted segments, TFTRT performance is pretty close to TensorRT performance。
“子图方式”比“直接使用TensorRT”容易，因为TensorRT没有支持很多OP：
Even though TensorRT supports large fraction of ops encountered in the models it doesn't support everything. TF-TRT approach enables you to get supported parts of the networks run in TensorRT and unsupported part to run in TensorFlow。
未来他们会加plugin：
In the future we will add plugin support so a larger fraction of the networks could be transformed to TensorRT engine。

luotao1 · 2018-04-20T14:23:04Z

TensorRTConverter

convert fluid block desc to tensorrt network

class and member

class TensorRTConverter {
  std::unordered_map<std::string, std::function<const framework::OpDesc&>> op_registry_;
  // fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt engine
  TensorrtEngine* engine_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;

public:  
  TensorRTConverter(
    const framework::Scope& scope,
    TensorrtEngine* engine,
    std::map<std::string, nvinfer1::ITensor*>tr_tensors_) {
    scope_ = scope;
    engine_ = engine;
    tr_tensors_ = tr_tensors;
    this->RegisterOpConverters();
  }

  // convert fluid op to tensorrt layer
  void ConvertOp(const framework::OpDesc& op);

  // convert fluid block to tensorrt network
  void ConvertBlock(const framework::BlockDesc& block);

private:
  // register different op converters  
  void RegisterOpConverters();

  // convert a fluid Mul op to tensorrt fc layer
  void ConvertMul(const framework::OpDesc& op);

  // convert other fluid op to tensorrt layer
  ...

}

function details

void RegisterOpConverters() {
  op_registry_["Mul"]=ConvertMul;
  op_registry_["Conv2d"]=ConvertConv2d;
  ...
}

void ConvertOp(const framework::OpDesc& op) {
  std::function<const framework::OpDesc&> op_converter = op_registry_.at(op.type());      
  op_converter(*this, op);
}

void ConvertBlock(const framework::BlockDesc& block) {
  for (auto op : block_.AllOps()) { 
   // convert each fluid op to tensorrt layer
   ConvertOp(op);
  }
}

void ConvertMul(const framework::OpDesc& op) {
   // get input tensor from tensorrt's tensor list
   std::string x_var_name = op->Input("X");
   auto x_tensor = tr_tensors_[x_var_name];

   // get weight from fluid inference scope
   std::string y_var_name = op->Output("Y");
   auto y_tensor  = scope_.FindVar(y_var_name)->GetMutable<framework::LoDTensor>();

   // do some weight transformation
   auto y_shape_tensor = transformation(y_tensor);
   
   // add layer into tensorrt network
   auto* fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, x_tensor, 1, y_shape_tensor, 0 /*bias*/);
   
   // get output tensor, and insert into tensorrt's tensor list
   std::string out_var_name = op->Output("Out");
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   tr_tensors_[out_var_name] = output_tensor;
}

usage

TensorRTConverter tensorrt_converter(scope, engine, tr_tensors);
tensorrt_converter->ConvertBlock(block);

ITensorConverter

convert between fluid tensor and nvinfer:ITensor

class and member

class ITensorConverter {
  // fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt engine
  TensorrtEngine* engine_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;

public:  
  ITensorConverter(
    const framework::Scope& scope,
    TensorrtEngine* engine,
    std::map<std::string, nvinfer1::ITensor*>tr_tensors_) {
    scope_ = scope;
    engine_ = engine;
    tr_tensors_ = tr_tensors;
  }

  // copy fluid tensor to tensorrt tensor
  void TensorToNVITensor(const std::string tensor_name);

  // copy tensorrt tensor to fluid tensor
  void NVITensorToTensor(const std::string tensor_name);
}

detail functions

void TensorToNVITensor(const std::string tensor_name) {
  auto fluid_tensor  = scope_.FindVar(tensor_name)->GetMutable<framework::LoDTensor>();
    
  // do some transformation for input fluid tensor, and get its type and dims.
  auto shape_tensor = transformation(fluid_tensor);

  nvinfer1::DataType type = shape_tensor.type();
  nvinfer1::DimsCHW dim = shape_tensor.dims();

  // get tensorrt input
  nvinfer1::ITensor* input_tensor = engine_->network()->addInput(tensor_name, type, dim);

  // insert tensorrt input into tensorrt's tensor map
  tr_tensors_[tensor_name] = input_tensor
}

void NVITensorToTensor(const std::string tensor_name) {
  auto tr_tensor = tr_tensors_[tensor_name];
    
  // do some transformation for output tensorrt tensor, and modify the fluid tensor in inference scope
  auto fluid_tensor  = scope_.FindVar(tensor_name)->GetMutable<framework::LoDTensor>();
  transformation(tr_tensor, fluid_tensor);
}

Superjomn · 2018-04-20T14:57:54Z

TensorToNVITensor and NVITensorToTensor might have different logic for different fluid op/TensorRT layer.

For example:

fluid rnn -> TensorRT rnn: LoDTensor -> Tensor like ITensor
fluid fc -> TensorRT fc: Tensor -> Tensor like ITensor

so in different op, different logic.

Might be something like TensorRTConverter with a register.

shanyi15 · 2018-08-15T10:28:54Z

您好，此issue在近一个月内暂无更新，我们将于今天内关闭。若在关闭后您仍需跟进提问，可重新开启此问题，我们将在24小时内回复您。因关闭带来的不便我们深表歉意，请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

marsbzp · 2022-04-26T06:57:58Z

architecture

frontend to mark the subgraphs that should be optimized by x engine

inference preparation to get the subgraph, change program desc by replacing the subgraph with an x engine op

engine op to build an x engine and execute it like a normal operator.

phrases

frontend

manually partition graph

TODO LATTER automatically partition graph

Some initital ideas, just add some special with-block just for inference
a = op0(b, c)

with infer.accelerate_by_tensorrt:
    a1 = op1(b, c)
    a2 = op1(b, c)

c = op2(b, c)

with infer.accelerate_by_some_other_engine:
    a3 = op3(c)
    ...
backend

inference prepare

partition graph and transform the infer program desc

tensorrt_engine_op.build

convert: input subgraph's block desc, add TensorRT layer into TensorRT engine

transform weight format from fluid to tensorrt

add tensorrt layer

inference execute

for op in the infer program desc

op.run

x engine

get a subgraph's block desc (from attribute)

build x engine once

execute x engine any times

x op

construct x op from x network

convert

construct x network from a subgraph's block desc

你好，我想问下int8模式下tensorrt子图会去做conv+bn融合操作吗

Superjomn assigned Superjomn, luotao1 and Xreki Apr 18, 2018

Xreki added the 预测原名Inference，包含Capi预测问题等 label Apr 19, 2018

Superjomn added this to To do in Inference on Engine via automation Apr 19, 2018

Xreki added this to Integrate TensorRT in Inference Framework Apr 20, 2018

luotao1 mentioned this issue Apr 23, 2018

tensorrt convert init #10144

Merged

Superjomn moved this from To do to In progress in Inference on Engine Apr 25, 2018

Superjomn mentioned this issue Apr 25, 2018

inference engine related design #10198

Closed

Superjomn moved this from In progress to Done in Inference on Engine May 12, 2018

shanyi15 closed this as completed Aug 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any engine for inference subgraph acceleration naive design #10028

any engine for inference subgraph acceleration naive design #10028

Superjomn commented Apr 18, 2018 •

edited by luotao1

luotao1 commented Apr 19, 2018 •

edited by Superjomn

Superjomn commented Apr 20, 2018

luotao1 commented Apr 20, 2018

luotao1 commented Apr 20, 2018 •

edited

Superjomn commented Apr 20, 2018

shanyi15 commented Aug 15, 2018

marsbzp commented Apr 26, 2022

architecture

phrases

frontend

backend

x engine

x op

convert

any engine for inference subgraph acceleration naive design #10028

any engine for inference subgraph acceleration naive design #10028

Comments

Superjomn commented Apr 18, 2018 • edited by luotao1

architecture

phrases

frontend

backend

x engine

x op

convert

luotao1 commented Apr 19, 2018 • edited by Superjomn

Convert

Converter Class and member

function details

how to call convert function

Superjomn commented Apr 20, 2018

luotao1 commented Apr 20, 2018

motivation

luotao1 commented Apr 20, 2018 • edited

TensorRTConverter

class and member

function details

usage

ITensorConverter

class and member

detail functions

Superjomn commented Apr 20, 2018

shanyi15 commented Aug 15, 2018

marsbzp commented Apr 26, 2022

architecture

phrases

frontend

backend

x engine

x op

convert

Superjomn commented Apr 18, 2018 •

edited by luotao1

luotao1 commented Apr 19, 2018 •

edited by Superjomn

luotao1 commented Apr 20, 2018 •

edited