-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[onert] CPU backend getTensorShape overhead #4544
Comments
It does not seem like there is a way doing it in Let me give a rough example, Define a raw pointer format. A simple vector, size(rank) comes first. Say we have a pointer
Then introduce Plus, it might be further optimized if the pointer( |
Since we know that cpu backend always gets backend::cpu_common::Tensor as its input, we could make a static cast for each tensor first and then get a constant reference to its _info field with a special Tensor-specific method, and use it directly. Unificating ir::Shape and cker::Shape the conversion overhead may be reduced even more. I don't get the reason why our ITensor interface is so abstract, with an individual getter for every thing, when we could just keep some general tensor descriptor structure inside common for all tensor types and return a constant reference to it with a single call, probably even non-virtual. That is an approach used in MNN, for example. |
Besides, getTensorShape is not the only problem which introduces tensor size specific preprocessing overhead. Since dynamic tensors are not common for neural networks, we better make some onTensorInfoChanged method in addition to configure() and make all necessary preprocessing there, including caching necessary tensor shapes. When tensor shape, data type, layout, or something like that changes, some mechanism would trigger onTensorInfoChanged method call of an affected operator before calling run() |
Yes, that could be a way. But please also note that it gets
That's all because we separate backends as plugins. If we had cpu backend only, it would have been much simpler. We first started this project binding with ARM ComputeLibrary as the primary backend and then extended it to support other kernels(cpu) so we introduced abstract
That's somewhat I meant by #4544 (comment) , but keeping current structure. It would be great to do that with structural changes if it keeps the backend generality too. If you would like to talk about something specific, you may talk to me directly as well. 😄 |
I respectfully disagree. I think these days dynamic tensors are getting popular. And we are working on many dynamic tensor models.
I'm not sure I get it right, I left some comments:
|
If I understand the dynamic tensor logic right, all dynamic shape input tensors should return true on the call to is_dynamic. Once Hence, it is possible to call is_dynamic for every input and update corresponding getTensorShape cache only when it returns true. Moreover, static shape inferer would always set dynamic flag of the op output tensor once any of op input tensors are dynamic. So, it may be enough to check only that the output tensor returns false on is_dynamic to be sure all the input tensors are also static. Also it is possible to assume that once a tensor has been set to dynamic, it will never be static again. Static tensor shapes are also alreayd known when configure() is run. I'm going to try that approach in binary ops: once output is_dynamic returns false in run(), I will reuse results of getTensorShape and ProcessBroadcastShape previously cached in configure(). What do you think? Would that be the correct logic? |
@krayzemli I am not sure it is always right. We sometimes run the same model with different input.
I was also thinking of caching
Please see |
With some models with which app does not call For models with dynamic tensors, there are some tricky cases:
In this case, during run B) and C), the hit ratio of tensor shape cache at run A) will be 0%. Then in run Z) the ratio will be 100%. Another case to consider is You mention,
I am not sure if I understand you correctly, but maybe there is a case of
If a model always runs without dynamic tensors, I think you're correct. However, considering dynamic tensor situation is complex. For optimization, we may need some API or env variable limiting such complex situation due to dynamic tensor. e.g., ( |
@krayzemli I get what you are trying to do, and it would be grateful for that. However I am not sure it would work out. By the way, the thing that you mentioned:
Rather than the cache work, wouldn't this solve almost everything? I think with that there is almost no overhead so we wouldn't need cache stuff. |
@krayzemli @hyunsik-yoon Back to the cache work, is it possible to check the cache is invalidated or not? I'm not sure if |
Tensor shape may be not the only thing we would want to cache. For example, |
PR #4611 |
Good point. Sounds reasonable. 😄
I get what you meant. However as I'm more interested in dynamic models, I still think it would be worthwhile doing this, as your work is focused on static models only. And tensor shapes are something that most operation kernels are interested. |
In #4544 (comment), This means that once |
Since each tensor is typically used twice (once as an output and once as an input), caching |
@wateret (beyond this PR)
Please note that
getTensorShape
overhead is significant for models which has a lot of op, no 1~2 dominant kernel. It would be good if we can remove the overhead ofgetTensorShape
. I think the complex hierarchy hinders the compiler's optimization. Do you have any idea or suggestion?Originally posted by @glistening in #4538 (comment)
The text was updated successfully, but these errors were encountered: