-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how about the speed in inference #8
Comments
I think it takes 10s for the end2end inference(text-line detection, text-line recognition, table structure restruction and match process) is normal, with GTX-1080. |
In the released code, we did not put the speedup module in the repo. You can refer to our MASTER paper for the speedup module. I expect the end2end inference can be speedup to 2-4s. |
But I found the bottleneck of speed is not the encoder backbone MASTER, but the self-regressed decoder which max length could be up too 500 steps, do you have any idea of it? |
In the self-regressed decoder, there are many repeated operations. In the master paper, we use a memory-cached mechanism to speedup the inference. The speedup is extremely effecient for the long length decoder. Please check the paper. Actually, o(n^2) complexity can speedup to o(n). |
In the competition, we used our own internal tool (FastOCR) to implement our algorithm. In the mmocr framework, we did not implement it. If you fully understand the master paper, I believe you can implement it. |
I have implemented the "memory-cached inference", but I found the speedup ratio is not as large as I expected. |
Hi, sonack. |
I expect the speedup ratio should be much larger than you reported. It highly depends on your implementation. |
Sorry I can't provide you the code directly, because I finish it in company. Do you have wechat? Maybe we can discuss it in depth more conveniently through it. Thank you very much. |
Hi,@JiaquanYe @delveintodetail,我测试了一下这个大模型的推理速度:
请问一下在相同config模型下,你们内部实现的memory-cached inference的推理速度最快能到多少?我profile了一下目前的memory-cached inference的实现,主要瓶颈都在KQV的matrix计算上了,应该不是pytorch具体实现的问题。目前都是采用的原生Pytorch进行的计算,没有经过torchscript或者其他工程化操作。 |
In the decoder, the computation of K and V was conducted only one time (this is important for efficiency.), then it will be cached for further usage. You don't need to compute it in each time step. |
不是这样的,encoder和decoder之间的cross attention的K、V确实是算一次就可以了,但是Q需要每步都算吧(因为Q是上一个时间步才预测出来的token,时间步T只算第T-1步,即上一步刚刚预测的token的多头线性变换,而不是原始的1~T步都重复计算),类似于下图中的红框部分: 我后面会尝试私下再重新实现一下,然后看看能不能提个PR,希望到时候可以一起看看~ @JiaquanYe @delveintodetail |
Line 11, q: 1d, M^k_b: T-1d, ops: T-1d I roughly list the computational operations of each line above. Please check if it is right for your implementation. |
I implement Decoder part of MASTER-pytorch in FasterTransformer, compare to origin pytorch it's much faster. Example code。It should be easy to add to TableMASTER with the minor changes |
It's a great job! I will try it in TableMASTER. |
mark |
请问用 memory-cached inference 和EOS 以加快推理速度具体怎么实现呢 |
本质上memory-cache就是把过去的k-v cache给后面的q计算的时候用, huggingface我看到最近也支持了这个, 你可以看看。当时在master论文里面我想到这个时候觉得是memory-cache,我们实际做的和现在的k-v cache是一样的事情。 你可以参见: https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.GenerationConfig.use_cache |
Thanks for your great work!
How about the inference speed ? I use a GTX-1080 and it cost a few seconds(almost 10s) per image. So I want to know is it normal?
The text was updated successfully, but these errors were encountered: