Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any engine for inference subgraph acceleration naive design #10028

Closed
Superjomn opened this issue Apr 18, 2018 · 7 comments
Closed

any engine for inference subgraph acceleration naive design #10028

Superjomn opened this issue Apr 18, 2018 · 7 comments
Assignees
Labels
预测 原名Inference,包含Capi预测问题等

Comments

@Superjomn
Copy link
Contributor

Superjomn commented Apr 18, 2018

architecture

  • frontend to mark the subgraphs that should be optimized by x engine
  • inference preparation to get the subgraph, change program desc by replacing the subgraph with an x engine op
  • engine op to build an x engine and execute it like a normal operator.

phrases

frontend

  • manually partition graph
  • TODO LATTER automatically partition graph

Some initital ideas, just add some special with-block just for inference

a = op0(b, c)

with infer.accelerate_by_tensorrt:
    a1 = op1(b, c)
    a2 = op1(b, c)

c = op2(b, c)

with infer.accelerate_by_some_other_engine:
    a3 = op3(c)
    ...

backend

  • inference prepare
    • partition graph and transform the infer program desc
    • tensorrt_engine_op.build
      • convert: input subgraph's block desc, add TensorRT layer into TensorRT engine
        • transform weight format from fluid to tensorrt
        • add tensorrt layer
  • inference execute
    • for op in the infer program desc
      • op.run

x engine

  • get a subgraph's block desc (from attribute)
  • build x engine once
  • execute x engine any times

x op

  • construct x op from x network

convert

construct x network from a subgraph's block desc

@Xreki Xreki added the 预测 原名Inference,包含Capi预测问题等 label Apr 19, 2018
@Superjomn Superjomn added this to To do in Inference on Engine via automation Apr 19, 2018
@luotao1
Copy link
Contributor

luotao1 commented Apr 19, 2018

Convert

construct an TensorRT network from a sub-block's desc

Converter Class and member

using OpConverter= std::function<void(const framework::OpDesc&)>;

Class Converter {
  std::unordered_map<std::string, OpConverter> op_registry_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;
  // scope: fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt network
  nvinfer1::INetworkDefinition* network_;

public:  
  Converter(const framework::BlockDesc& block, 
    const framework::Scope& scope,
    nvinfer1::INetworkDefinition* network) {
    block_ = block;
    scope_= scope;
    network_ = network;
    this->register_op_converters();      
  }

  // register different op converters  
  void register_op_converters();

  // construct an TensorRT network from a sub-block's desc
  void ConvertSubBlockToTensorRTNetwork();

  // convert inputs of op to tensorrt inputs
  void ConvertInput(const framework::OpDesc& first_op);

  // convert a fluid Mul op to tensorrt fc layer
  void ConvertMul(const framework::OpDesc& mul_op);
  ...

  // convert tensorrt outputs to fluid
  void ConvertOutput(const framework::OpDesc& last_op);
}

function details

void register_op_converters() {
  op_registry_["Mul"]=ConvertMul;
  op_registry_["Conv2d"]=ConvertConv2d;
  ...
}

void ConvertSubBlockToTensorRTNetwork() {
   // convert fluid inputs of first op to tensorrt inputs.   
   convertInput(block_.ops[0]);
   
   for (auto op : block_.AllOps()) { 
     // convert each fluid op to tensorrt layer
     OpConverter op_converter = op_registry_.at(op.type());      
     op_converter(*this, op);
   }

   // convert tensorrt outputs of last op to fluid outputs.
   convertOutput(block_.ops[block_.OpSize() - 1]); 
}

void ConvertInput(const framework::OpDesc& first_op) {
  auto var_names = first_op.InputArgumentNames();
  for(auto var_name : var_names) {
    auto fluid_tensor  = scope_.FindVar(var_name)->GetMutable<framework::LoDTensor>();
    
    // do some transformation for input fluid tensor, and get its type and dims.
    auto shape_tensor = transformation(fluid_tensor);
    nvinfer1::DataType type = shape_tensor.type();
    nvinfer1::DimsCHW dim = shape_tensor.dims();

    // add input into tensorrt network
    nvinfer1::ITensor* input_tensor = network_->addInput(
        var_name, type, dim);

    // insert input tensor into tensorrt's tensor list
    tr_tensors_[var_name] = input_tensor 
  }
}

void ConvertMul(const framework::OpDesc& op) {
   // get input tensor from tensorrt's tensor list
   std::string x_var_name = op->Input("X");
   auto x_tensor = tr_tensors_[x_var_name];

   // get weight from fluid inference scope
   std::string y_var_name = op->Output("Y");
   auto y_tensor  = scope_.FindVar(y_var_name)->GetMutable<framework::LoDTensor>();

   // do some weight transformation
   auto y_shape_tensor = transformation(y_tensor);
   
   // add layer into tensorrt network
   nvinfer1::IFullyConnectedLayer* layer = network_->addFullyConnected(
      x_tensor, 1, y_shape_tensor, 0 /*bias*/);
   
   // get output tensor, and insert into tensorrt's tensor list
   std::string out_var_name = op->Output("Out");
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   tr_tensors_[out_var_name] = output_tensor;
}

void ConvertOutput(const framework::OpDesc& last_op) {
  auto var_names = last_op.OutputArgumentNames();
  for(auto var_name : var_names) {
    // get output tensor from tensorrt's tensor list, and add it into network
    auto tr_tensor = tr_tensors_[var_name];
    network->markOutput(*tr_tensor);
    
    // do some transformation for output tensorrt tensor, modify the fluid tensor in inference scope
    auto fluid_tensor  = scope_.FindVar(var_name)->GetMutable<framework::LoDTensor>();
    transformation(tr_tensor, fluid_tensor);
  }
}

how to call convert function

auto executor = paddle::framework::Executor(place);
auto* scope = new paddle::framework::Scope();
inference_program = paddle::inference::Load(
          executor, scope, program_path, parameter_path);

nvinfer1::IBuilder* builder = createInferBuilder(logger);
nvinfer1::INetworkDefinition* network = builder->createNetwork();
Converter converter(inference_program.block(), scope, network);
converter->ConvertSubBlockToTensorRTNetwork();

// build and executor tensorrt engine
auto engine = builder->buildCudaEngine(*network);
...

@Xreki Xreki added this to Integrate TensorRT in Inference Framework Apr 20, 2018
@Superjomn
Copy link
Contributor Author

  1. use TensorrtEngine to operate on TensorRT
  2. Tensor convert ITensor, ITensor convert Tensor

@luotao1
Copy link
Contributor

luotao1 commented Apr 20, 2018

motivation

为什么采用子图的方式来调用TensorRT,它和直接使用TensorRT之间的性能有多少差异呢?
由于测两者的benchmark需要耗费一定时间,所以先咨询了tensorflow社区:tensorflow/models#4028

这个issue中的主要信息如下:

  1. “直接使用TensorRT”肯定比“子图方式”快,但对能转的那些部分,两者非常接近:
    Our tests indicate that, for the converted segments, TFTRT performance is pretty close to TensorRT performance。
  2. “子图方式”比“直接使用TensorRT”容易,因为TensorRT没有支持很多OP:
    Even though TensorRT supports large fraction of ops encountered in the models it doesn't support everything. TF-TRT approach enables you to get supported parts of the networks run in TensorRT and unsupported part to run in TensorFlow。
  3. 未来他们会加plugin:
    In the future we will add plugin support so a larger fraction of the networks could be transformed to TensorRT engine。

@luotao1
Copy link
Contributor

luotao1 commented Apr 20, 2018

TensorRTConverter

convert fluid block desc to tensorrt network

class and member

class TensorRTConverter {
  std::unordered_map<std::string, std::function<const framework::OpDesc&>> op_registry_;
  // fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt engine
  TensorrtEngine* engine_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;

public:  
  TensorRTConverter(
    const framework::Scope& scope,
    TensorrtEngine* engine,
    std::map<std::string, nvinfer1::ITensor*>tr_tensors_) {
    scope_ = scope;
    engine_ = engine;
    tr_tensors_ = tr_tensors;
    this->RegisterOpConverters();
  }

  // convert fluid op to tensorrt layer
  void ConvertOp(const framework::OpDesc& op);

  // convert fluid block to tensorrt network
  void ConvertBlock(const framework::BlockDesc& block);

private:
  // register different op converters  
  void RegisterOpConverters();

  // convert a fluid Mul op to tensorrt fc layer
  void ConvertMul(const framework::OpDesc& op);

  // convert other fluid op to tensorrt layer
  ...

}

function details

void RegisterOpConverters() {
  op_registry_["Mul"]=ConvertMul;
  op_registry_["Conv2d"]=ConvertConv2d;
  ...
}

void ConvertOp(const framework::OpDesc& op) {
  std::function<const framework::OpDesc&> op_converter = op_registry_.at(op.type());      
  op_converter(*this, op);
}

void ConvertBlock(const framework::BlockDesc& block) {
  for (auto op : block_.AllOps()) { 
   // convert each fluid op to tensorrt layer
   ConvertOp(op);
  }
}

void ConvertMul(const framework::OpDesc& op) {
   // get input tensor from tensorrt's tensor list
   std::string x_var_name = op->Input("X");
   auto x_tensor = tr_tensors_[x_var_name];

   // get weight from fluid inference scope
   std::string y_var_name = op->Output("Y");
   auto y_tensor  = scope_.FindVar(y_var_name)->GetMutable<framework::LoDTensor>();

   // do some weight transformation
   auto y_shape_tensor = transformation(y_tensor);
   
   // add layer into tensorrt network
   auto* fc_layer = TRT_ENGINE_ADD_LAYER(engine_, FullyConnected, x_tensor, 1, y_shape_tensor, 0 /*bias*/);
   
   // get output tensor, and insert into tensorrt's tensor list
   std::string out_var_name = op->Output("Out");
   nvinfer1::ITensor* output_tensor = layer->getOutput(0);
   tr_tensors_[out_var_name] = output_tensor;
}

usage

TensorRTConverter tensorrt_converter(scope, engine, tr_tensors);
tensorrt_converter->ConvertBlock(block);

ITensorConverter

convert between fluid tensor and nvinfer:ITensor

class and member

class ITensorConverter {
  // fluid inference scope
  const framework::Scope& scope_;
  // network: tensorrt engine
  TensorrtEngine* engine_;
  // tensorrt input/output tensor list, whose key is the fluid variable name, and value is the pointer position of tensorrt tensor.  
  std::map<std::string, nvinfer1::ITensor*>tr_tensors_;

public:  
  ITensorConverter(
    const framework::Scope& scope,
    TensorrtEngine* engine,
    std::map<std::string, nvinfer1::ITensor*>tr_tensors_) {
    scope_ = scope;
    engine_ = engine;
    tr_tensors_ = tr_tensors;
  }

  // copy fluid tensor to tensorrt tensor
  void TensorToNVITensor(const std::string tensor_name);

  // copy tensorrt tensor to fluid tensor
  void NVITensorToTensor(const std::string tensor_name);
}

detail functions

void TensorToNVITensor(const std::string tensor_name) {
  auto fluid_tensor  = scope_.FindVar(tensor_name)->GetMutable<framework::LoDTensor>();
    
  // do some transformation for input fluid tensor, and get its type and dims.
  auto shape_tensor = transformation(fluid_tensor);

  nvinfer1::DataType type = shape_tensor.type();
  nvinfer1::DimsCHW dim = shape_tensor.dims();

  // get tensorrt input
  nvinfer1::ITensor* input_tensor = engine_->network()->addInput(tensor_name, type, dim);

  // insert tensorrt input into tensorrt's tensor map
  tr_tensors_[tensor_name] = input_tensor
}

void NVITensorToTensor(const std::string tensor_name) {
  auto tr_tensor = tr_tensors_[tensor_name];
    
  // do some transformation for output tensorrt tensor, and modify the fluid tensor in inference scope
  auto fluid_tensor  = scope_.FindVar(tensor_name)->GetMutable<framework::LoDTensor>();
  transformation(tr_tensor, fluid_tensor);
}

@Superjomn
Copy link
Contributor Author

TensorToNVITensor and NVITensorToTensor might have different logic for different fluid op/TensorRT layer.

For example:

  • fluid rnn -> TensorRT rnn: LoDTensor -> Tensor like ITensor
  • fluid fc -> TensorRT fc: Tensor -> Tensor like ITensor

so in different op, different logic.

Might be something like TensorRTConverter with a register.

@Superjomn Superjomn moved this from To do to In progress in Inference on Engine Apr 25, 2018
@Superjomn Superjomn moved this from In progress to Done in Inference on Engine May 12, 2018
@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

@marsbzp
Copy link

marsbzp commented Apr 26, 2022

architecture

  • frontend to mark the subgraphs that should be optimized by x engine
  • inference preparation to get the subgraph, change program desc by replacing the subgraph with an x engine op
  • engine op to build an x engine and execute it like a normal operator.

phrases

frontend

  • manually partition graph
  • TODO LATTER automatically partition graph

Some initital ideas, just add some special with-block just for inference

a = op0(b, c)

with infer.accelerate_by_tensorrt:
    a1 = op1(b, c)
    a2 = op1(b, c)

c = op2(b, c)

with infer.accelerate_by_some_other_engine:
    a3 = op3(c)
    ...

backend

  • inference prepare

    • partition graph and transform the infer program desc

    • tensorrt_engine_op.build

      • convert: input subgraph's block desc, add TensorRT layer into TensorRT engine

        • transform weight format from fluid to tensorrt
        • add tensorrt layer
  • inference execute

    • for op in the infer program desc

      • op.run

x engine

  • get a subgraph's block desc (from attribute)
  • build x engine once
  • execute x engine any times

x op

  • construct x op from x network

convert

construct x network from a subgraph's block desc

你好,我想问下int8模式下tensorrt子图会去做conv+bn融合操作吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
预测 原名Inference,包含Capi预测问题等
Projects
Inference Framework
Integrate TensorRT
Development

Successfully merging a pull request may close this issue.

5 participants