Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using RBIT Instruction for bit reversal memory reordering #1278

Closed
richardallenatgarmin opened this issue Aug 10, 2021 · 3 comments
Closed
Assignees
Labels
DSP enhancement moved moved to new github repository

Comments

@richardallenatgarmin
Copy link

I found that at least on my current platform(nRF52/Cortex-M4F,-Otime,-O3), arm_bitreversal_q15() can be made about 10% faster and 2KB smaller by using RBIT and shifting the result appropriately, compared to the current approach of using a bit-reversal table.

I've got a patch started for this, but the RBIT instruction does not appear to be available for M0/M0+/M23. Before I proposed a patch I wanted to ask your thoughts on how best to handle them and how to tell if __RBIT intrinsic is available or not, or if it might be better to implement a fallback __RBIT in C and remove the table.

@christophe0606
Copy link
Contributor

@richardallenatgarmin Thanks for the report. Nice to see new optimizations.

It is not simple. Ideally, I'd like to be able to have this optimization. But I would need to detect if we are compiling for M0/M0+/M23 (RBIT not available) or for M55 (vectorized code is not using RBIT and still relying on a table).

Today, CMSIS-DSP is only checking for the existence of DSP extensions. And RBIT is not part of it. I don't want to test for a specific architecture because it is adding too much complexity will all the variants which must be supported by the library.

When I am looking at the ARM C language extension specification, I see that RBIT is available as a C intrinsic:

uint32_t __rbit(uint32_t x);

but only on the targets with a rbit instruction.

And there is no specific compiler define to check for the presence of the instruction or not (architecture must be tested).

I am switching the tag of this github issue to "enhancement" and I'll think about it for the future and how we can support instructions like this one which are not available everywhere and which are not tagged by a specific compiler define like the DSP extensions.

@richardallenatgarmin
Copy link
Author

Initially I worked on arm_bitreversal_q15, but arm_bitreversal_15 seems more commonly used. Some of the optimizations apply there too, but table-less will be a slower as well, for AC5 arm_bitreversal_15 total change wasn't large(but AC5 is very sensitive to loop alignment, so more testing is needed). I'll rebase the other optimization so it can be applied separately from table-less reversal, but I agree that without RBIT, table-less reversal would be quite a bit slower.

Because the functions are small compared to the tables, for now our fork builds in both functions - callers can specify a table pointer to use the upstream algorithm or NULL to use table-less in their arm_cfft_instance_q15.

@christophe0606
Copy link
Contributor

We are moving CMSIS-DSP to a new repository : https://github.com/ARM-software/CMSIS-DSP

I have recreated the github issues on the new repository (sometimes merging a few related ones). You can see the link just above this comment (... mentioned this issue ...).

So, I am closing the github issue here.

@christophe0606 christophe0606 added the moved moved to new github repository label Jun 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DSP enhancement moved moved to new github repository
Projects
None yet
Development

No branches or pull requests

3 participants