Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - support for Zc* extensions #633

Open
biosbob opened this issue Jun 21, 2023 · 12 comments
Open

feature request - support for Zc* extensions #633

biosbob opened this issue Jun 21, 2023 · 12 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed HW hardware-related

Comments

@biosbob
Copy link
Collaborator

biosbob commented Jun 21, 2023

the proposed Zc* extensions described here have recently been ratified....

the Zca extensions is of particular interest in "small cores" with limit memory resources....

just a placeholder for what appears to be a non-trivial improvement....

@stnolting
Copy link
Owner

The standard C extension is already implemented (-> CPU_EXTENSION_RISCV_C generic). In this particular case C = Zcf, but all the compressed floating-point operations are mapped to normal integer load/store because Zfinx is used instead of F.

I had a closer look at the Zc* ISA extension. It think it is quite promising. However, in terms of the NEORV32 I am not fully convinced if it would be a good idea to implemten all of them.

  • Zca - this is what we have if the FPU is disabled (Zfinx disabled)
  • Zcf - this is what we have if the FPU is enabled
  • Zcd makes no sense as the FPU is single-precision only
  • Zcmp (list-based push/pop similar to ARM) is quite interesting, but would require a lot of additional hardware. Furthermore, precise exception trapping is complex here as there are several memory loads/stores invoked by a single instruction.
  • Zcmt (table-based jumps) might be nice thing to have. But this would have a high latency - so the only gain would be further code size reduction (not performance).
  • Zcb - I really like this extension because it adds 16-bit variants for common operations (like multiplication). This would be quite easy to implement I think. So, yeah, maybe this sub-extension might be integrated in the future. 😉

This is just my opinion. Any thoughts?

@stnolting stnolting added enhancement New feature or request HW hardware-related labels Jun 23, 2023
@biosbob
Copy link
Collaborator Author

biosbob commented Jun 23, 2023

Zcb looks promising, in that it is relatively easy to implement.... with disciplined declarations of integer types in EM (uint8, int16, uint32, etc) this would mesh quite well.... future CPU implementations that use (say) an internal 8- or 16-bit ALU would also benefit; reducing ALU width obviously saves gates....

as for Zcmp, the EM runtime would generally place some common push/pop code fragments (used by LLVM) into the boot ROM.... even with the smallest boot ROM (say, 2K), there is plenty of "free and fast" instruction memory....

i'm not entirely sure what motivates Zcmt.... but honestly, reducing code size without a performance gain doesn't seem worth it anyway....

bottom line -- Zcb would be first, when we get around to it....

@stnolting
Copy link
Owner

bottom line -- Zcb would be first, when we get around to it....

I agree! But before we start implementing that we should wait for GCC support. Unfortunately, upcoming GCC 13(.1) does not include Zcb (https://gcc.gnu.org/gcc-13/changes.html).

@biosbob
Copy link
Collaborator Author

biosbob commented Jun 30, 2023

i'm finding that LLVM is much more current with risc-v extensions.... looks like they have lots of Zb* support -- as requested in #640

since EM supports both compilers, comparative benchmarks are trivial....

@kimstik
Copy link

kimstik commented Jan 23, 2024

As per benchmark Zcmp saves up to 35% (~6.5% average) of footprint.
GCC looks set to adopt it too.

@stnolting
Copy link
Owner

Interesting results! Thanks for sharing!

35% would be quite amazing, but I'm not sure what the "cost" of that might be (additional hardware resources, impact on critical path, etc.). Zcmp adds push and pop operations that would require to modify the CPU's pipeline (as there are several memory accesses triggered by a single instruction).

But the NEORV32's execution stage is a multi-cycle architecture... so maybe the additional hardware overhead would be quite small... I think I'll need to have a closer look at this again.

@kimstik
Copy link

kimstik commented Jan 26, 2024

Moreover, Zcb has become mandatory for RVA2023 profile.

@stnolting
Copy link
Owner

Oh, I did not expect that. However, RVA is the application-class profile (MMU, 64-bit, ...), which is out of scope of this project right now 🙈

I had another look at the Zcb specs. Basically, it just adds 11 new compressed instructions. Adding the memory operations should be quite easy and I think the performance benefit might be noticeable. Adding the remaining instructions (bit-manip, multiplication, inversion) is a little bit more complex but still doable.

Anybody volunteering to do a PR? 😅

@stnolting stnolting added help wanted Extra attention is needed good first issue Good for newcomers labels Jan 28, 2024
@kimstik
Copy link

kimstik commented Apr 30, 2024

btw, I tested LLVM 18 with/without Zcmp:
image

@stnolting
Copy link
Owner

The Zcmp's push/pop instruction are quite powerful as they can "compress" up to 13 load/stores and an addition into a single 16-bit instruction! They might even increase performance a little bit as there will be less traffic on the CPU's instruction fetch interface.

However, the big problem with these two instructions is that they do not de-compress into a 32-bit counterpart. Instead, they decompress into several and different instructions which would require a lot of hardware overhead. So I think that the "costs" clearly exceed the benefits here.

What do you think? 🤔

@kimstik
Copy link

kimstik commented May 7, 2024

The main advantages from my biased point of view are reducing the load on the instruction fetch channel, less cache pollution, and it should have a positive impact on interrupt handler latency. However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.

I think technical difficulties are unavoidable, and it's hard to objectively evaluate their value until they come into play :).
If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.

@stnolting
Copy link
Owner

The main advantages from my biased point of view are reducing the load on the instruction fetch channel

That's true! In its best case, this instruction saves up to 27 further 16-bit words from being fetched.

less cache pollution

Also true. However, embedded single-core systems might not need any kind of caches if you use fast on-chip memory.

and it should have a positive impact on interrupt handler latency

I'm not sure about this. Execution time would be identical. However, due to cache pollution / bus congestion there might be a relevant speedup.

However, the most valuable aspect is the reduction of byte-code size to at least the level of Cortex-M0.

Maybe, but technically such a complex instruction isn't "RISC" anymore, right? 😅

If the overhead is truly enormous, the number of configurations where Zcmp would be useful will shrink to a minimum. But I want to hope that the overhead won't be so huge and won't ruin the whole idea.

I think there are several benchmark examples provided by the people who invented these extended compressed instructions. The benefit (entirely looking at code size and performance) is quite impressive!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed HW hardware-related
Projects
None yet
Development

No branches or pull requests

3 participants