Implement a new JIT for Arm devices #6057

gdkchan · 2023-12-24T15:37:20Z

Overview

This is a new JIT compiler that is designed for Arm CPUs. The main reason for creating a new JIT is that the current one (ARMeilleure) can't easily take advantage of the fact that when translating Arm code to another Arm CPU, most code can remain untouched since its the same architecture.

The new JIT skips several steps that the current JIT would do, and as a result it can produce code much faster (and with even better quality).

It is important to note that the new JIT currently does not support x86 host, and there is no plan for x86 support. While I do think we can add support for other RISC architectures in the future, I think x86 is different enough from Arm that it would not be worth it, and the approach used by the current JIT is good enough.

Motivation

Arm devices are becoming more popular, and we want to ensure we have good support for those devices in the future. Having a decent JIT That can target Arm CPUs is important for that. While it is always possible to improve the existing JIT, in order for it to be the best it can would require a completely different approach, that can take maximum advantage of the similarity between Arm and... well, Arm.

While on macOS, we can use hypervisor to run the guest code directly, there are other platforms where we don't have access to a hypervisor and would still depend on the JIT.

How it works

On both the 32-bit and 64-bit modes, the original register allocation is preserved. This has several benefits:

Easier debugging as the generated code can be compared to the original without the need for a mental remapping of the registers.
Much faster compilation, register allocation is slow, even more so if you use an algorithm that produces good allocations.
Generally lower code size and better code. The original binary most likely was compiled using a good register allocation with no time constraints, so it already has good allocation.

For the 32-bit mode, this is pretty simple since AArch64 has (almost) 32 general purpose registers, while AArch32 has (almost) 16. So we have double the amount of registers to work with. Registers W0-W14 are mapped to 32-bit registers R0-R14 directly. R15 is special (PC), it is also mapped to W15, but handled in a special way.

The 64-bit mode is more challenging, because the amount of general purpose registers is the same in this case. Some of those registers are special (SP, LR, FP, etc). We can't use special registers in translated guest code directly. So instead, if a function uses one of those special registers, it is remapped to regular registers that are not in use on that function. If there are no enough unused registers available on the function to remap all special registers, the function is truncated up to the point where the amount of registers reaches the limit. Truncated functions are basically split into smaller "sub-functions". At the end of the truncated block, it jumps to the next part of the function that has been translated separately. It would have been also possible to spill the registers instead, but it would be considerably more complex for little benefit. The number of truncated functions is very small, and is usually around 2%. Currently there is only one register that is reserved as temporary register. Increasing the amount of temporary registers it needs to reserve would also increase the amount of truncated functions.

Like ARMeilleure, this JIT is capable of translating entire functions (rather than a single basic block). However if may not be able to detect the end of a function perfectly, this is not really a problem, but can cause overlapping parts of the same function to appear on the JIT cache. It can also potentially have performance implications. The function block decoding is more limited than the one ARMeilleure, the new JIT does not extend functions backwards (if there is a backwards jump, from a loop for example), or forward jumps that are too far way. That said, it can handle most common cases, and can handle loops just fine. As long the function is well formed and doesn't have weird control flow, it can detect the entire function.

The guest state is loaded on the entry point of the function, and saved on all exit points. When another guest function is called, the guest state is saved to the context. When it returns, it compares the address of the next function it should execute (returned on X0), if it matches the next address it would execute on this function, it loads the state from context again and keeps executing, otherwise it returns, until it finds the right function or reaches the dispatcher. Like ARMeilleure, it performs data flow analysis to find the registers that are used and modified on each entry/exit point of the function. That way it only needs to save and restore from context what is really used. However, the results might not be optimal, this is not a simple problem to solve. The data flow analysis is also currently only available for Arm64 mode, Arm32 instead always saves and restores all registers that are used in the entire function. This is simpler and produces worse results. The reason is that I planned to skip saving/restoring the context entirely for Arm32 (as it can just keep the state in registers all the time). But in the end, I didn't do it and here we are. All register usage information is derived from the instruction table. The table says which registers each instruction uses, and the translator uses that to extract the register numbers from the instruction itself, and build register usage information. This is also used for register remapping. Since Arm32 is not doing data flow, I also did not include most of the register usage information in the instruction table, instead it only tracks register writes (which is required to know if a instruction changes the PC register, considered a jump).

Speaking about the instruction table, it uses an approach similar to ARMeilleure for finding the instruction from the opcode. One difference is the way how constraints are handled. On ARMeilleure, they are generally handled where its most convenient. Sometimes on the table, sometimes while decoding the instruction fields, something on the implementation itself. If necessary, it would call the Undefined instruction implementation, if a specific encoding is undefined. On the new JIT, all constraints are added to the table. To make this possible, each entry on the table has a blacklist of encodings that are not allowed. So if a specific encoding is "Undefined" or another instruction, it is added to that list. This approach allows us to know if an instruction is undefined up-front. It also allows filtering out instructions that requires specific extensions or ISA versions if we need.

Performance improvements

I didn't really notice frame rate improvements on the few tests I did, but there are some noticeable improvements for startup time and code size. The measurements below have been done on a Ryzen 9 7900X CPU (with the JIT generating Arm64 code, obviously it can't actually execute it, being a x86 CPU).

The chart below shows the measurement of the time taken to compile all the game code in the main binary. The games used for the test are Shantae and the Pirate's Curse, Neptunia Game Maker R:Evolution and New Super Mario Bros U Deluxe.

Neptunia has the highest size among them, so naturally it was also the one that took most time to compile. The code size when uncompressed is 75 MB. Shantae code size is 2MB, and New Super Mario Bros U Deluxe 9 MB.

Here we can see the code size increase ratio when compared to the original code. A ratio of 1x means that it has the same size as the original code, 2x would be double the size, etc. Generally, the new JIT can produce code that is between 4x or 5x the size of the original (in a few cases managing to be around the 3.5x mark), while ARMeilleure is between 7x and 8x with HighCq mode. ARMeilleure also has the LowCq mode, which trades code quality for compilation speed, and as a result, is faster but produces significantly worse code (usually being 16x or 17x larger than the original).

The takeaway is that the new JIT can produce better code, and do so much faster than the old JIT. The most dramatic difference is seen on Neptunia, which has the largest code size. It took 4 minutes and 26 seconds to compile with the old JIT on HighCq mode, while on the new JIT it took only 4.64 seconds. The old JIT produced 633MB of code in the HighCq mode, while the new one produced 348MB. This is almost half the size, while taking a fraction of the time.

For 32-bit games, the code size difference is not as large for a few reasons:

As mentioned before, it is not doing data flow analysis for more precise register usage information, which means in many cases, it saves and restores registers that are not actually used on a given control flow path.
32-bit code does not map as naturally to Arm64 like 64-bit code does. That means it requires more instructions per guest instruction.

Regardless, it still manages to win against the old JIT on code size, even with those limitations. It is also much faster, with New Super Mario Bros U Deluxe taking 8.7 seconds to translate in LowCq mode, 17.3 in HighCq, and 3 seconds with the new JIT.

The much faster compilation means that we do not need PPTC at all with the new JIT, and can avoid all the problems that comes with it. It means no more disk space used for PPTC caches, no more issues with mods that change code, and now more slow-ish first time gameplay experience. To reiterate what I said earlier, those benefits are only for Arm CPUs, this JIT does not support x86.

There are visible improvements on boot times. Below we can see a comparison with Mario Kart 8 Deluxe.

Old JIT, PPTC disabled:

mk8d_oldjit_no_ptc.mp4

Old JIT, PPTC enabled:

mk8d_oldjit_with_ptc.mp4

New JIT:

mk8d_newjit.mp4

Not surprisingly, the new JIT can beat the old one with PPTC disabled. Even with PPTC enabled, it manages to be a little bit faster (by roughly 2 seconds). One thing I did notice is that the second boot seems to be noticeably faster than the first one with the new JIT. I did not have time to investigate why yet, but it might be the cost of the .NET JIT. For all tests above, I ran the game at least once beforehand to ensure .NET JIT compilation time is not showing.

Compatibility improvements

There are some games that did not work with the old JIT on Arm. They did work fine with hypervisor on macOS, so it's not a big deal, but it could be problematic for other platforms if we used JIT for them.

Luigi's Mansion 3, which would crash before showing the title screen:

Pokémon Scarlet, which would softlock before getting in-game:

Future improvements

Both Arm32 and Arm64 can be improved in the future:

Investigate improvements to register state load/store.
Re-use pointers for memory access when the guest register is the same.
Some peephole optimization opportunities on Arm32.
Endian flag is currently ignored on Arm32. We should implement it in the future, but no game actually uses this. The old JIT does not actually use it in all places either (for example, it is missing for VLDn/VSTn instructions, and also VLDM/VSTM iirc).

While working on this, I also found some problems on the old JIT:

NEON comparison instructions (VCGE_Z and friends) ignores size field, and assumes that floating point size is FP32 even when it is FP16. Not a big deal since the Switch CPU does not support FP16, but it should be an Undefined instruction.
VMAXNM and friends ignores sz field. Just like the above.
There is a difference of 1ulp for VRECPS/VRSQRTS results compared to the native instruction on Arm64. Not sure what is up with that...
VTBL and VTBX appears produce wrong results if the table and destination registers are the same. Weirdly enough, it seems that this is intentional (see https://github.com/Ryujinx/Ryujinx/blob/master/src/ARMeilleure/Instructions/InstEmitSimdMove32.cs#L294C25-L294C41). x86 has a SSSE3 path that might not have this issue, but Arm has no fast path.
VLD2_1/VST2_1 with size = 0 and index_align<0> = 1. The register increment value is incorrect in this case, but I was not able to test it against Unicorn (it just freezes when I add this test case, for some reason...)

Alternative approaches

Switch emulators on Android have been using an approach that they call "NCE" (Native Code Execution). It can run the guest code (almost) directly without hypervisor, but it comes with limitations (or rather, the need to hack things around to make it work):

We can't run system instructions such as SVC, MRS, MSR etc directly, so patching the code to make it call the emulator is required. Those modifications are visible to the game.
Interrupting the guest threads becomes somewhat complicated, since we can't insert "interruption points" in the code. We can use pthread_kill on Unix-like OS, but Windows has no such a thing.
Worse by far, the code will access the emulator address space directly, so we need to set it up in a way that makes the game happy.

The last one deserves deserves its own section with the amount of problems it introduces:

On many devices, the address space size is 39-bit. The Switch also has 39-bit address space, which means its not possible to fit it within the emulator address space. We need to shrink the guest address space to make it fit, which has the potential to break games as we differ from the Switch. This problem also affects JIT when using the host mapped memory manager modes.
The allocated guest region must be contained entirely inside the guest address space. This means that 36-bit games won't work, as we most likely won't be able to allocate 36-bit worth of virtual memory right at the start of the emulator address space. There are not many of those, so it is not considered a big deal.
For platforms with 16KB page size, there are additional challenges as we can't map the text segment and data segments as RX and RW respectively, since they are 4KB aligned, not 16KB aligned. We also can't just map them all as RWX if the platform has W^X. Making this work requires more hacks.

That said, we could consider implementing it in the future.

Testing

Since this is a completely new JIT, there is a lot of things that have the potential to be broken, so testing is welcome. Note that it should be compared to the old JIT (hypervisor disabled). If the issue also happens with the old JIT, it is not a regression and it's probably not JIT related. 32-bit game testing is more important since we can't use hypervisor for that, so problems there would be more visible.

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.

github-actions · 2023-12-24T16:47:18Z

Download the artifacts for this pull request:

Experimental GUI (Avalonia)

GUI-less (SDL2)

Only for Developers

piplup55 · 2023-12-24T17:14:51Z

Nintendo 64 Apps now start on firmware 13, firmware 14 and above still crashes

Ryujinx_1.1.0+b264f3e_2023-12-24_17-05-49.log
Ryujinx_1.1.1100_2023-12-24_17-09-30.log
related issue #6021

merryhime · 2023-12-24T21:23:51Z

There is a difference of 1ulp for VRECPS/VRSQRTS results compared to the native instruction on Arm64. Not sure what is up with that...

On x64, this is because ARMeilleure's implementation of these instructions don't fuse mul-add/mul-sub.

MutantAura · 2023-12-27T13:38:19Z

Tears of the Kingdom seems to be crashing on boot with the new JIT.
Ryujinx_1.1.0+45478f4_2023-12-27_13-10-25.log
Ignore me, old JIT does this too. Not sure if it's easier to debug with the new one?

Other titles tested include...

BoTW:
Old JIT - 36FPS, 10s boot, 9W (@30fps)
New JIT - 40FPS, 4s boot, 7.7W (@30fps)
HV - 49FPS, 2s boot, 7.3W (@30fps)

Red Dead Redemption:
Old JIT - 30FPS, 3s boot, 15.2W (@30fps)
New JIT - 53FPS, 2.5s boot, 8.7W (@30fps)
HV - 56FPS, <1s boot, 8.3W (@30fps)

Pokemon Mystery Dungeon: Rescue Team DX:
Old JIT - 47FPS, 8s boot, 7.9W (@30fps)
New JIT - 64FPS, 1.5s boot (fine), 4.8W (@30fps)
HV - 67FPS, <1s boot (inconsistent), 4.6W (@30fps)

Mario Kart 8 Deluxe:
Old JIT - 60FPS cap, 9s boot, 8.7W (@60fps)
New JIT - 60FPS cap, 1.5s boot, 7.7W (@60fps)

tagged random dude sorry

gdkchan · 2023-12-27T13:47:45Z

Tears of the Kingdom seems to be crashing on boot with the new JIT.
Ryujinx_1.1.0+45478f4_2023-12-27_13-10-25.log
Ignore me, old JIT does this too. Not sure if it's easier to debug with the new one?

IIRC its due to the hacks to make host mapped memory manager mode work wth 16KB pages.

iMonZ · 2023-12-28T02:47:27Z

Wow thank you so much for your hard work!

BoofOof32 · 2023-12-29T06:28:35Z

I have a Thinkpad X13s, one of the rare few Windows on ARM laptops that can reasonably run Linux.
I've tested Ryujinx on this device before, and got varying results

On Vulkan with PPTC disabled on both master and PR build

SoC: Snapdragon 8cx Gen 3
RAM: 32GB LPDDR4x 4266MHz
GPU: Adreno 690
Driver: Mesa 23.2.1-ubuntu3.1
Distro: Ubuntu 23.10

Good Job!
Old JIT - 30s boot time
New JIT - 17s boot time
Mario Kart 8 Deluxe
Old JIT - 22s boot time
New JIT - 12s boot time
Mario Party Superstars
Old JIT - perpetually loads
New JIT - 13s boot time
Mario + Rabbids Kingdom Battle
Old JIT - 41s boot time (22-28fps)
New JIT - 25s boot time (roughly same perf)
Metroid Prime Remastered
Old JIT - 12s boot time
New JIT - 10s boot time
Red Dead Redemption
Old JIT - 42s boot time (4fps)
New JIT - 36s boot time
Splatoon 2
Old JIT - 21s boot time
New JIT -18s boot time
Super Mario 3D World + Bowser's Fury
Old JIT - 9s boot time
New JIT - 5s boot time
Super Mario Odyssey
Old JIT - 20s boot time
New JIT - 17s boot time
Super Mario Party
Old JIT - 32s boot time (40-60fps)
New JIT - 13s boot time (60fps)
Super Smash Bros. Ultimate
Old JIT - 28s boot time
New JIT - 11s boot time
Tears of The Kingdom
Old JIT - 42s boot time
New JIT - 17s boot time

(I measure boot time as time between clicking a game and getting to the menu, so if the time looks long, it's probably because the intro is long)

This is not accounting for the poor performance or major graphical errors present on this device on most titles (either Turnip not knowing what to do with Ryujinx, or Ryujinx just not being optimized for Adreno, or a mix of both). The JIT doesn't really affect performance apart from maybe 1 or 2 games of mine. but the difference is felt even on this random laptop

iMonZ · 2023-12-29T07:36:06Z

Luigis Mansion 3 now works for me! Thanks.
What is still missing befor a merge?

mrcmunir · 2024-01-03T13:16:59Z

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.

I found if Emulation create Allocated addres space is smaller than guest application requeriments Will crash emulator returning bus error system.
On which line is allocated space defined?
This problem only affected for Host and fastest Unsafe options not happen under software management oldjit.

gdkchan · 2024-01-04T03:10:47Z

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.
I found if Emulation create Allocated addres space is smaller than guest application requeriments Will crash emulator returning bus error system. On which line is allocated space defined? This problem only affected for Host and fastest Unsafe options not happen under software management oldjit.

Which game does this?
It adjusts the size of various memory regions based on the address space size here: https://github.com/Ryujinx/Ryujinx/blob/master/src/Ryujinx.HLE/HOS/Kernel/Memory/KPageTableBase.cs#L214
Some games might break as a result of that.
I guess you're not testing this on macOS? Since the address space size is 48-bit on macOS, it should never need to shrink the guest address space.

riperiperi

This is an absolutely insane amount of work, major kudos... and for A32 too. Compilation is definitely way faster, and loading times seem much shorter than Armeilleure with PPTC in general. We would probably get some fantastic numbers for ingame performance if the GPU weren't the bottleneck. It'll be a big improvement when a system has weaker multicore performance even with a GPU bottleneck.

I've read through all files, some have just been skimmed because I'd be cross referencing a datasheet otherwise. My only complaint is that InstEmit for A32 is super long, but I don't see a better way to do it and keep the instruction table readable. A64 approach is really elegant since most instructions can just rewrite the in/out registers.

I've tested across a number of games and haven't seen any problems yet. Note that I haven't really played games for an extensive period, so I'm unsure if any of the HV softlocks or similar are present in any form (they definitely don't show up in the first 15 minutes).

src/Ryujinx.Cpu/LightningJit/Cache/CacheEntry.cs

src/Ryujinx.Cpu/LightningJit/Cache/NoWxCache.cs

src/Ryujinx.Cpu/LightningJit/Arm32/Target/Arm64/InstEmitSystem.cs

src/Ryujinx.Cpu/LightningJit/Arm32/Decoder.cs

src/Ryujinx.Cpu/LightningJit/Arm64/Target/Arm64/InstEmitSystem.cs

src/Ryujinx.Cpu/LightningJit/Arm32/Target/Arm64/InstEmitNeonMemory.cs

MutantAura · 2024-01-18T19:56:01Z

I'm unsure if any of the HV softlocks or similar are present in any form

After an hour of running in a circle it does look like at least this instance of HV softlock also occurs in JIT.

As usual nothing strange in logs.
Ryujinx_1.1.0+5756661_2024-01-18_19-09-37.log

Breath of the Wild does not experience the same HV deadlocks that occurred before the ordering revert and seems very stable.

MetrosexualGarbodor · 2024-01-19T10:20:10Z

Closes #1884?

The basic idea is to support "direct" execution of the simple ARM instructions to make the code work faster on the "native" platform

gdkchan · 2024-01-19T15:29:15Z

Closes #1884?

The basic idea is to support "direct" execution of the simple ARM instructions to make the code work faster on the "native" platform

That request is not very clear. I'm not sure what would be categorized as a "simple" ARM instruction. It sounds more like a request for something that can run the code directly like hypervisor, or something like "NCE".

marysaka

Impressive work

* Implement a new JIT for Arm devices * Auto-format * Make a lot of Assembler members read-only * More read-only * Fix more warnings * ObjectDisposedException.ThrowIf * New JIT cache for platforms that enforce W^X, currently unused * Remove unused using * Fix assert * Pass memory manager type around * Safe memory manager mode support + other improvements * Actual safe memory manager mode masking support * PR feedback

github-actions bot added cpu Related to ARMeilleure horizon Related to Ryujinx.HLE labels Dec 24, 2023

ryujinx-mako bot requested review from AcK77, LDj3SNuD, marysaka, riperiperi, TSRBerry and a team December 24, 2023 15:37

gdkchan mentioned this pull request Dec 28, 2023

Revert Apple hypervisor force ordered memory change #6068

Merged

Shihta mentioned this pull request Dec 28, 2023

Mario + Rabbids® Sparks of Hope - 0100317013770000 Ryujinx/Ryujinx-Games-List#4089

Open

gdkchan mentioned this pull request Jan 14, 2024

Move most of signal handling to Ryujinx.Cpu project #6128

Merged

riperiperi approved these changes Jan 17, 2024

View reviewed changes

gdkchan added 8 commits January 18, 2024 14:18

Implement a new JIT for Arm devices

0a9d524

Auto-format

da74694

Make a lot of Assembler members read-only

5bb867e

More read-only

e66fcb7

Fix more warnings

aa4e5a2

ObjectDisposedException.ThrowIf

771ec73

New JIT cache for platforms that enforce W^X, currently unused

c7e453f

Remove unused using

3e40d8e

gdkchan added 5 commits January 18, 2024 14:23

Fix assert

e6981a0

Pass memory manager type around

7c5084d

Safe memory manager mode support + other improvements

ab82135

Actual safe memory manager mode masking support

c7025c8

PR feedback

0915127

gdkchan force-pushed the fast-jit-is-no-joke branch from aa0af01 to 0915127 Compare January 18, 2024 17:24

Shihta mentioned this pull request Jan 19, 2024

Super Mario RPG - 0100BC0018138000 Ryujinx/Ryujinx-Games-List#4811

Open

marysaka approved these changes Jan 20, 2024

View reviewed changes

gdkchan merged commit 427b7d0 into Ryujinx:master Jan 20, 2024
9 checks passed

gdkchan deleted the fast-jit-is-no-joke branch January 20, 2024 14:11

gdkchan mentioned this pull request Mar 31, 2024

Disable hypervisor by default on macOS #6585

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a new JIT for Arm devices #6057

Implement a new JIT for Arm devices #6057

gdkchan commented Dec 24, 2023

github-actions bot commented Dec 24, 2023 •

edited

piplup55 commented Dec 24, 2023

merryhime commented Dec 24, 2023 •

edited

MutantAura commented Dec 27, 2023 •

edited

gdkchan commented Dec 27, 2023

iMonZ commented Dec 28, 2023 •

edited by MetrosexualGarbodor

BoofOof32 commented Dec 29, 2023

iMonZ commented Dec 29, 2023

mrcmunir commented Jan 3, 2024 •

edited

gdkchan commented Jan 4, 2024

riperiperi left a comment •

edited

MutantAura commented Jan 18, 2024 •

edited

MetrosexualGarbodor commented Jan 19, 2024

gdkchan commented Jan 19, 2024

marysaka left a comment

Implement a new JIT for Arm devices #6057

Implement a new JIT for Arm devices #6057

Conversation

gdkchan commented Dec 24, 2023

Overview

Motivation

How it works

Performance improvements

Compatibility improvements

Future improvements

Alternative approaches

Testing

github-actions bot commented Dec 24, 2023 • edited

piplup55 commented Dec 24, 2023

merryhime commented Dec 24, 2023 • edited

MutantAura commented Dec 27, 2023 • edited

gdkchan commented Dec 27, 2023

iMonZ commented Dec 28, 2023 • edited by MetrosexualGarbodor

BoofOof32 commented Dec 29, 2023

iMonZ commented Dec 29, 2023

mrcmunir commented Jan 3, 2024 • edited

gdkchan commented Jan 4, 2024

riperiperi left a comment • edited

Choose a reason for hiding this comment

MutantAura commented Jan 18, 2024 • edited

MetrosexualGarbodor commented Jan 19, 2024

gdkchan commented Jan 19, 2024

marysaka left a comment

Choose a reason for hiding this comment

github-actions bot commented Dec 24, 2023 •

edited

merryhime commented Dec 24, 2023 •

edited

MutantAura commented Dec 27, 2023 •

edited

iMonZ commented Dec 28, 2023 •

edited by MetrosexualGarbodor

mrcmunir commented Jan 3, 2024 •

edited

riperiperi left a comment •

edited

MutantAura commented Jan 18, 2024 •

edited