New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using RBIT Instruction for bit reversal memory reordering #1278
Comments
@richardallenatgarmin Thanks for the report. Nice to see new optimizations. It is not simple. Ideally, I'd like to be able to have this optimization. But I would need to detect if we are compiling for M0/M0+/M23 (RBIT not available) or for M55 (vectorized code is not using RBIT and still relying on a table). Today, CMSIS-DSP is only checking for the existence of DSP extensions. And RBIT is not part of it. I don't want to test for a specific architecture because it is adding too much complexity will all the variants which must be supported by the library. When I am looking at the ARM C language extension specification, I see that RBIT is available as a C intrinsic:
but only on the targets with a rbit instruction. And there is no specific compiler define to check for the presence of the instruction or not (architecture must be tested). I am switching the tag of this github issue to "enhancement" and I'll think about it for the future and how we can support instructions like this one which are not available everywhere and which are not tagged by a specific compiler define like the DSP extensions. |
Initially I worked on arm_bitreversal_q15, but arm_bitreversal_15 seems more commonly used. Some of the optimizations apply there too, but table-less will be a slower as well, for AC5 arm_bitreversal_15 total change wasn't large(but AC5 is very sensitive to loop alignment, so more testing is needed). I'll rebase the other optimization so it can be applied separately from table-less reversal, but I agree that without RBIT, table-less reversal would be quite a bit slower. Because the functions are small compared to the tables, for now our fork builds in both functions - callers can specify a table pointer to use the upstream algorithm or NULL to use table-less in their arm_cfft_instance_q15. |
We are moving CMSIS-DSP to a new repository : https://github.com/ARM-software/CMSIS-DSP I have recreated the github issues on the new repository (sometimes merging a few related ones). You can see the link just above this comment (... mentioned this issue ...). So, I am closing the github issue here. |
I found that at least on my current platform(nRF52/Cortex-M4F,-Otime,-O3), arm_bitreversal_q15() can be made about 10% faster and 2KB smaller by using RBIT and shifting the result appropriately, compared to the current approach of using a bit-reversal table.
I've got a patch started for this, but the RBIT instruction does not appear to be available for M0/M0+/M23. Before I proposed a patch I wanted to ask your thoughts on how best to handle them and how to tell if __RBIT intrinsic is available or not, or if it might be better to implement a fallback __RBIT in C and remove the table.
The text was updated successfully, but these errors were encountered: