Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inference engine related design #10198

Closed

Conversation

Superjomn
Copy link
Contributor

@Superjomn Superjomn commented Apr 25, 2018

fixes: #10028

@Xreki Xreki added the 预测 原名Inference,包含Capi预测问题等 label Apr 25, 2018
@panyx0718
Copy link
Contributor

Have we verified the performance of using tensorrt as a sub-graph?

@Superjomn
Copy link
Contributor Author

Superjomn commented Apr 25, 2018

We will get a benchmark next week. @panyx0718

The inference phase need to support some special hardware for acceleration,
such as GPU, FPGA, and ARM.
Special softwares power some of these hardwares and the inner states are hidden, for example, the TensorRT is released by NVidia to improve the inference performance on GPUs, it takes a computation graph as input,
optimize and execute it, but the users can't directly modify its internal logics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special softwares power some of these hardwares and the inner states are hidden. For example, TensorRT is released by NVIDIA to improve the inference performance on GPUs. It takes a computation graph as input, optimizes and executes it, while users can't directly modify its internal logic.


## Use Engines to Execute Sub-blocks

Compared to Paddle Fluid, the engines covers limited number of operators and can only power several kinds of models. In other words, the engines can only support a part of Fluid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation of sub-blocks method

line 13 + some information from tensorflow/models#4028, in order to tell people why we use sub-blocks method, not directly use TensorRT.

Use Engines to Execute Sub-blocks

lind 14
...


</p>

It is easy to parallelize the computation by scheduling several engines on different devices, for example, the CPU and GPU engines can be dispatched in the meantime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add . after mentime.

We use a `with-statement` to mark the sub-block as follows.

```python
with infer.power_by_engine('tensorrt'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's type of infer, ProgramDesc? Followings are current trainspiler inferface, whose parameter is a ProgramDesc.

t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place)

In my mind, the interface for automatic detection mode is:

t = fluid.InferenceTranspiler()
t.transpile(inference_transpiler_program, place, engine = 'tensorrt' )

def transpile(inference_transpiler_program, place, engine):
     if engine == "tensorrt":
        power_by_tensorrt_engine(inference_transpiler_program);
     else:
        ..

Copy link
Contributor Author

@Superjomn Superjomn Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer is a module.

import paddle.inference as infer

```python
with infer.power_by_engine('tensorrt'):
o = some_op()
o = some_op()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's meaning of o = some_op()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No practical meaning, just shows that there are several operators there.


### EngineOp

`EngineOp` is just a normal Fluid operator, which has an attribute called `subblock` to get the Fluid description about a sub-block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subblock->sub_block

*/
enum class DeviceType {
CPU = 0,
GPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum syntax just needs to set the first element, and following elements will increase automatically.


The `EngineOutputConvertOp` is similar.

### Optimizer for sub-block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimizer->Transpiler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An optimizer is not a Transpiler. It corresponds to the optimization in a compiler.

### Optimizer for sub-block

```c++
// The InferenceOptimizers input a program desc and output a block desc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input a program desc, but output maybe a series of sub-block desc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input a program desc, output a program desc with several newly inserted EngineOp with their attribute set with the sub-blocks.

// Different implementations will rewrite the original program desc by different logics.
// There might be many different optimizers, such as
// - CleanUselessOptimizer
// - PruneOpOptimizer
Copy link
Contributor

@luotao1 luotao1 Apr 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are CleanUselessOptimizer and PruneOpOptimizer ?
We already have prune method of inference. see paddle\fluid\framework\prune.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I this a factory pattern of Operators is a better interface, maybe we'd better refactor those codes.

@luotao1
Copy link
Contributor

luotao1 commented Apr 26, 2018

Are all of above implemented and run in C++ end?
Following are current C++ inference logic.

inference_program = paddle::inference::Load(&executor, scope, dirname);
executor.Run(*inference_program, scope, ...)

Thus, how to use inference engine?

inference_program = paddle::inference::Load(&executor, scope, dirname);
inference_engine_program = paddle::inference::transplier(inference_program, engine="tensorrt");
executor.Run(*inference_engine_program, scope, ...)

@Superjomn
Copy link
Contributor Author

The inference might have its own executor implementation, so there might be some more consideration about the SDK.

The Anakin and MDL team will join together to design the inference SDK, and there might be some futher designs about these issues. @luotao1

Copy link
Collaborator

@wangkuiyi wangkuiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不好意思,从这个design pr和另一个code pr里,我都没有能领会这个设计的意图。视频会议一下吧。

@@ -0,0 +1,254 @@
# Utilize Engines to Accelerate Inference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的engines指的是什么呢?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看上去是指 TensorRT?我看到后面提出一个base class,也在另一个code的PR里看到了这个base class。这是为了将来derive除了 TensorRT 之外的其他的“engine”对应的class吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorRT, Anajin, 或者其他类似自带完整优化的库

@@ -0,0 +1,254 @@
# Utilize Engines to Accelerate Inference

The inference phase need to support some special hardware for acceleration,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inference phase need to support some special hardware

=>

We want to utilize DL chips to accelerate the inference of Fluid models.

@Xreki Xreki added this to Integrate TensorRT in Inference Framework May 21, 2018
@luotao1
Copy link
Contributor

luotao1 commented Feb 1, 2019

感谢您给PaddlePaddle贡献文档。由于文档已迁移至FluidDoc repo,因此关闭您的PR,欢迎您向FluidDoc Repo贡献文档。
Thanks for contributing to PaddlePaddle! Since documents have been moved to FluidDoc repo, we close this PR. Welcome to contribute to FluidDoc repo.

@luotao1 luotao1 closed this Feb 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
预测 原名Inference,包含Capi预测问题等
Projects
No open projects
Inference Framework
Integrate TensorRT
Development

Successfully merging this pull request may close these issues.

any engine for inference subgraph acceleration naive design
5 participants