New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a new JIT for Arm devices #6057
Conversation
Download the artifacts for this pull request: Experimental GUI (Avalonia)GUI-less (SDL2)Only for Developers
|
Nintendo 64 Apps now start on firmware 13, firmware 14 and above still crashes |
On x64, this is because ARMeilleure's implementation of these instructions don't fuse mul-add/mul-sub. |
Other titles tested include... BoTW: Red Dead Redemption: Pokemon Mystery Dungeon: Rescue Team DX: Mario Kart 8 Deluxe: tagged random dude sorry |
IIRC its due to the hacks to make host mapped memory manager mode work wth 16KB pages. |
Wow thank you so much for your hard work! |
I have a Thinkpad X13s, one of the rare few Windows on ARM laptops that can reasonably run Linux. On Vulkan with PPTC disabled on both master and PR build
(I measure boot time as time between clicking a game and getting to the menu, so if the time looks long, it's probably because the intro is long) This is not accounting for the poor performance or major graphical errors present on this device on most titles (either Turnip not knowing what to do with Ryujinx, or Ryujinx just not being optimized for Adreno, or a mix of both). The JIT doesn't really affect performance apart from maybe 1 or 2 games of mine. but the difference is felt even on this random laptop |
Luigis Mansion 3 now works for me! Thanks. |
I found if Emulation create Allocated addres space is smaller than guest application requeriments Will crash emulator returning bus error system. |
Which game does this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an absolutely insane amount of work, major kudos... and for A32 too. Compilation is definitely way faster, and loading times seem much shorter than Armeilleure with PPTC in general. We would probably get some fantastic numbers for ingame performance if the GPU weren't the bottleneck. It'll be a big improvement when a system has weaker multicore performance even with a GPU bottleneck.
I've read through all files, some have just been skimmed because I'd be cross referencing a datasheet otherwise. My only complaint is that InstEmit for A32 is super long, but I don't see a better way to do it and keep the instruction table readable. A64 approach is really elegant since most instructions can just rewrite the in/out registers.
I've tested across a number of games and haven't seen any problems yet. Note that I haven't really played games for an extensive period, so I'm unsure if any of the HV softlocks or similar are present in any form (they definitely don't show up in the first 15 minutes).
src/Ryujinx.Cpu/LightningJit/Arm32/Target/Arm64/InstEmitSystem.cs
Outdated
Show resolved
Hide resolved
src/Ryujinx.Cpu/LightningJit/Arm32/Target/Arm64/InstEmitSystem.cs
Outdated
Show resolved
Hide resolved
src/Ryujinx.Cpu/LightningJit/Arm64/Target/Arm64/InstEmitSystem.cs
Outdated
Show resolved
Hide resolved
src/Ryujinx.Cpu/LightningJit/Arm64/Target/Arm64/InstEmitSystem.cs
Outdated
Show resolved
Hide resolved
aa0af01
to
0915127
Compare
After an hour of running in a circle it does look like at least this instance of HV softlock also occurs in JIT. As usual nothing strange in logs. Breath of the Wild does not experience the same HV deadlocks that occurred before the ordering revert and seems very stable. |
Closes #1884?
|
That request is not very clear. I'm not sure what would be categorized as a "simple" ARM instruction. It sounds more like a request for something that can run the code directly like hypervisor, or something like "NCE". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work
* Implement a new JIT for Arm devices * Auto-format * Make a lot of Assembler members read-only * More read-only * Fix more warnings * ObjectDisposedException.ThrowIf * New JIT cache for platforms that enforce W^X, currently unused * Remove unused using * Fix assert * Pass memory manager type around * Safe memory manager mode support + other improvements * Actual safe memory manager mode masking support * PR feedback
Overview
This is a new JIT compiler that is designed for Arm CPUs. The main reason for creating a new JIT is that the current one (ARMeilleure) can't easily take advantage of the fact that when translating Arm code to another Arm CPU, most code can remain untouched since its the same architecture.
The new JIT skips several steps that the current JIT would do, and as a result it can produce code much faster (and with even better quality).
It is important to note that the new JIT currently does not support x86 host, and there is no plan for x86 support. While I do think we can add support for other RISC architectures in the future, I think x86 is different enough from Arm that it would not be worth it, and the approach used by the current JIT is good enough.
Motivation
Arm devices are becoming more popular, and we want to ensure we have good support for those devices in the future. Having a decent JIT That can target Arm CPUs is important for that. While it is always possible to improve the existing JIT, in order for it to be the best it can would require a completely different approach, that can take maximum advantage of the similarity between Arm and... well, Arm.
While on macOS, we can use hypervisor to run the guest code directly, there are other platforms where we don't have access to a hypervisor and would still depend on the JIT.
How it works
On both the 32-bit and 64-bit modes, the original register allocation is preserved. This has several benefits:
For the 32-bit mode, this is pretty simple since AArch64 has (almost) 32 general purpose registers, while AArch32 has (almost) 16. So we have double the amount of registers to work with. Registers W0-W14 are mapped to 32-bit registers R0-R14 directly. R15 is special (PC), it is also mapped to W15, but handled in a special way.
The 64-bit mode is more challenging, because the amount of general purpose registers is the same in this case. Some of those registers are special (SP, LR, FP, etc). We can't use special registers in translated guest code directly. So instead, if a function uses one of those special registers, it is remapped to regular registers that are not in use on that function. If there are no enough unused registers available on the function to remap all special registers, the function is truncated up to the point where the amount of registers reaches the limit. Truncated functions are basically split into smaller "sub-functions". At the end of the truncated block, it jumps to the next part of the function that has been translated separately. It would have been also possible to spill the registers instead, but it would be considerably more complex for little benefit. The number of truncated functions is very small, and is usually around 2%. Currently there is only one register that is reserved as temporary register. Increasing the amount of temporary registers it needs to reserve would also increase the amount of truncated functions.
Like ARMeilleure, this JIT is capable of translating entire functions (rather than a single basic block). However if may not be able to detect the end of a function perfectly, this is not really a problem, but can cause overlapping parts of the same function to appear on the JIT cache. It can also potentially have performance implications. The function block decoding is more limited than the one ARMeilleure, the new JIT does not extend functions backwards (if there is a backwards jump, from a loop for example), or forward jumps that are too far way. That said, it can handle most common cases, and can handle loops just fine. As long the function is well formed and doesn't have weird control flow, it can detect the entire function.
The guest state is loaded on the entry point of the function, and saved on all exit points. When another guest function is called, the guest state is saved to the context. When it returns, it compares the address of the next function it should execute (returned on X0), if it matches the next address it would execute on this function, it loads the state from context again and keeps executing, otherwise it returns, until it finds the right function or reaches the dispatcher. Like ARMeilleure, it performs data flow analysis to find the registers that are used and modified on each entry/exit point of the function. That way it only needs to save and restore from context what is really used. However, the results might not be optimal, this is not a simple problem to solve. The data flow analysis is also currently only available for Arm64 mode, Arm32 instead always saves and restores all registers that are used in the entire function. This is simpler and produces worse results. The reason is that I planned to skip saving/restoring the context entirely for Arm32 (as it can just keep the state in registers all the time). But in the end, I didn't do it and here we are. All register usage information is derived from the instruction table. The table says which registers each instruction uses, and the translator uses that to extract the register numbers from the instruction itself, and build register usage information. This is also used for register remapping. Since Arm32 is not doing data flow, I also did not include most of the register usage information in the instruction table, instead it only tracks register writes (which is required to know if a instruction changes the PC register, considered a jump).
Speaking about the instruction table, it uses an approach similar to ARMeilleure for finding the instruction from the opcode. One difference is the way how constraints are handled. On ARMeilleure, they are generally handled where its most convenient. Sometimes on the table, sometimes while decoding the instruction fields, something on the implementation itself. If necessary, it would call the
Undefined
instruction implementation, if a specific encoding is undefined. On the new JIT, all constraints are added to the table. To make this possible, each entry on the table has a blacklist of encodings that are not allowed. So if a specific encoding is "Undefined" or another instruction, it is added to that list. This approach allows us to know if an instruction is undefined up-front. It also allows filtering out instructions that requires specific extensions or ISA versions if we need.Performance improvements
I didn't really notice frame rate improvements on the few tests I did, but there are some noticeable improvements for startup time and code size. The measurements below have been done on a Ryzen 9 7900X CPU (with the JIT generating Arm64 code, obviously it can't actually execute it, being a x86 CPU).
The chart below shows the measurement of the time taken to compile all the game code in the main binary. The games used for the test are Shantae and the Pirate's Curse, Neptunia Game Maker R:Evolution and New Super Mario Bros U Deluxe.
Neptunia has the highest size among them, so naturally it was also the one that took most time to compile. The code size when uncompressed is 75 MB. Shantae code size is 2MB, and New Super Mario Bros U Deluxe 9 MB.
Here we can see the code size increase ratio when compared to the original code. A ratio of 1x means that it has the same size as the original code, 2x would be double the size, etc. Generally, the new JIT can produce code that is between 4x or 5x the size of the original (in a few cases managing to be around the 3.5x mark), while ARMeilleure is between 7x and 8x with HighCq mode. ARMeilleure also has the LowCq mode, which trades code quality for compilation speed, and as a result, is faster but produces significantly worse code (usually being 16x or 17x larger than the original).
The takeaway is that the new JIT can produce better code, and do so much faster than the old JIT. The most dramatic difference is seen on Neptunia, which has the largest code size. It took 4 minutes and 26 seconds to compile with the old JIT on HighCq mode, while on the new JIT it took only 4.64 seconds. The old JIT produced 633MB of code in the HighCq mode, while the new one produced 348MB. This is almost half the size, while taking a fraction of the time.
For 32-bit games, the code size difference is not as large for a few reasons:
Regardless, it still manages to win against the old JIT on code size, even with those limitations. It is also much faster, with New Super Mario Bros U Deluxe taking 8.7 seconds to translate in LowCq mode, 17.3 in HighCq, and 3 seconds with the new JIT.
The much faster compilation means that we do not need PPTC at all with the new JIT, and can avoid all the problems that comes with it. It means no more disk space used for PPTC caches, no more issues with mods that change code, and now more slow-ish first time gameplay experience. To reiterate what I said earlier, those benefits are only for Arm CPUs, this JIT does not support x86.
There are visible improvements on boot times. Below we can see a comparison with Mario Kart 8 Deluxe.
Old JIT, PPTC disabled:
mk8d_oldjit_no_ptc.mp4
Old JIT, PPTC enabled:
mk8d_oldjit_with_ptc.mp4
New JIT:
mk8d_newjit.mp4
Not surprisingly, the new JIT can beat the old one with PPTC disabled. Even with PPTC enabled, it manages to be a little bit faster (by roughly 2 seconds). One thing I did notice is that the second boot seems to be noticeably faster than the first one with the new JIT. I did not have time to investigate why yet, but it might be the cost of the .NET JIT. For all tests above, I ran the game at least once beforehand to ensure .NET JIT compilation time is not showing.
Compatibility improvements
There are some games that did not work with the old JIT on Arm. They did work fine with hypervisor on macOS, so it's not a big deal, but it could be problematic for other platforms if we used JIT for them.
Luigi's Mansion 3, which would crash before showing the title screen:
Pokémon Scarlet, which would softlock before getting in-game:
Future improvements
Both Arm32 and Arm64 can be improved in the future:
While working on this, I also found some problems on the old JIT:
Alternative approaches
Switch emulators on Android have been using an approach that they call "NCE" (Native Code Execution). It can run the guest code (almost) directly without hypervisor, but it comes with limitations (or rather, the need to hack things around to make it work):
pthread_kill
on Unix-like OS, but Windows has no such a thing.The last one deserves deserves its own section with the amount of problems it introduces:
That said, we could consider implementing it in the future.
Testing
Since this is a completely new JIT, there is a lot of things that have the potential to be broken, so testing is welcome. Note that it should be compared to the old JIT (hypervisor disabled). If the issue also happens with the old JIT, it is not a regression and it's probably not JIT related. 32-bit game testing is more important since we can't use hypervisor for that, so problems there would be more visible.
Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.