Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how about the speed in inference #8

Open
zzhanq opened this issue Aug 26, 2021 · 18 comments
Open

how about the speed in inference #8

zzhanq opened this issue Aug 26, 2021 · 18 comments

Comments

@zzhanq
Copy link

zzhanq commented Aug 26, 2021

Thanks for your great work!

How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?

@JiaquanYe
Copy link
Owner

Thanks for your great work!

How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?

I think it takes 10s for the end2end inference(text-line detection, text-line recognition, table structure restruction and match process) is normal, with GTX-1080.

@delveintodetail
Copy link
Collaborator

In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.

@sonack
Copy link

sonack commented Sep 1, 2021

In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s.

But I found the bottleneck of speed is not the encoder backbone MASTER, but the self-regressed decoder which max length could be up too 500 steps, do you have any idea of it?

@delveintodetail
Copy link
Collaborator

In the self-regressed decoder, there are many repeated operations. In the master paper, we use a memory-cached mechanism to speedup the inference. The speedup is extremely effecient for the long length decoder. Please check the paper. Actually, o(n^2) complexity can speedup to o(n).

@delveintodetail
Copy link
Collaborator

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

@sonack
Copy link

sonack commented Sep 6, 2021

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected.
I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder?
I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement?
BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain.
If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

@JiaquanYe
Copy link
Owner

JiaquanYe commented Sep 6, 2021

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected.
I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder?
I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement?
BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain.
If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

Hi, sonack.
Code line 42-43 is unnecessary, I have fix this bug.
In our experience, "memory-cached inference" may speed up about 40-50%, when maxlength is 100, in a normal size master decoder. And we haven't try "memory-cached inference" in lightweight decoder.
A early stop mechanism is a useful speed-up method in sequence decoding.

@delveintodetail
Copy link
Collaborator

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected.
I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder?
I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement?
BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain.
If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.

cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation.
If you prefer, you can send me your code, I will check it for you.
If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400400, the output resolution by 5050C, you can further decrease it to 2525*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

@sonack
Copy link

sonack commented Sep 9, 2021

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected.
I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder?
I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement?
BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain.
If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.
cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation.
If you prefer, you can send me your code, I will check it for you.
If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.

@sonack
Copy link

sonack commented Sep 13, 2021

Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度:
在max_len=500,遇到EOS不提前终止的情况下:

  1. 没有去除Code line 42-43的冗余计算时,大概是11s/img;
  2. 去除冗余计算后,大概是6s/img;
  3. 再加上memory-cached inference的话,大概是4.5s/img。

请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。

@delveintodetail
Copy link
Collaborator

Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度:
在max_len=500,遇到EOS不提前终止的情况下:

  1. 没有去除Code line 42-43的冗余计算时,大概是11s/img;
  2. 去除冗余计算后,大概是6s/img;
  3. 再加上memory-cached inference的话,大概是4.5s/img。

请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。

In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step.
I believe this is the problem of your code. Tell me if you fix this issue. Thanks.

@sonack
Copy link

sonack commented Sep 14, 2021

Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度:
在max_len=500,遇到EOS不提前终止的情况下:

  1. 没有去除Code line 42-43的冗余计算时,大概是11s/img;
  2. 去除冗余计算后,大概是6s/img;
  3. 再加上memory-cached inference的话,大概是4.5s/img。

请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。

In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step.
I believe this is the problem of your code. Tell me if you fix this issue. Thanks.

不是这样的,encoder和decoder之间的cross attention的K、V确实是算一次就可以了,但是Q需要每步都算吧(因为Q是上一个时间步才预测出来的token,时间步T只算第T-1步,即上一步刚刚预测的token的多头线性变换,而不是原始的1~T步都重复计算),类似于下图中的红框部分:
image

我后面会尝试私下再重新实现一下,然后看看能不能提个PR,希望到时候可以一起看看~ @JiaquanYe @delveintodetail

@delveintodetail
Copy link
Collaborator

In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it.

I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected.
I have already slim the decoder to an only-1-block transformer decoder, which brought the inference speed to ~2.5s / img, after the "memory-cached optimization", the inference speed reach to 2.2s / img, about 12% improvement. Do you think such a small improvement is normal for a lightweight decoder?
I found the single decode step cost about 6ms only, and the total 500 steps sum up to ~3s, is there any room to further improvement?
BTW, I found something unnecessary at the code lines 42~43, when I removed these lines, the inference speed can reach up to 1.3s / img, which is a pretty large gain.
If I add an early stop mechanism when meeting EOS symbol, the avg time cost is about hundreds of millisecond, which could satisfy the requirements of a normal industrial application.
cc @zzhanq @JiaquanYe @delveintodetail

I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation.
If you prefer, you can send me your code, I will check it for you.
If you highly care about the speed, I would like to suggest you to decrease the resolution of the output feature map by CNN. If currently, the input image size is 400_400, the output resolution by 50_50_C, you can further decrease it to 25_25*C by convolution and pooling, it will largely speedup the inference but may decrease the performance by 1%.

Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much.

Line 11, q: 1d, M^k_b: T-1d, ops: T-1d
Line 12, ops: T-1
d
line 13, ops:T-1d + T-1d
line 14, ops: dd + T-1d + T-1*d

I roughly list the computational operations of each line above. Please check if it is right for your implementation.

@Sanster
Copy link

Sanster commented Oct 20, 2021

I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes

@JiaquanYe
Copy link
Owner

I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes

It's a great job! I will try it in TableMASTER.

@WenmuZhou
Copy link

mark

@LiuDong777
Copy link

请问用 memory-cached inference 和EOS 以加快推理速度具体怎么实现呢

@delveintodetail
Copy link
Collaborator

本质上memory-cache就是把过去的k-v cache给后面的q计算的时候用, huggingface我看到最近也支持了这个, 你可以看看。当时在master论文里面我想到这个时候觉得是memory-cache,我们实际做的和现在的k-v cache是一样的事情。 你可以参见: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.use_cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants