Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a new JIT for Arm devices #6057

Merged
merged 13 commits into from Jan 20, 2024
Merged

Conversation

gdkchan
Copy link
Member

@gdkchan gdkchan commented Dec 24, 2023

Overview

This is a new JIT compiler that is designed for Arm CPUs. The main reason for creating a new JIT is that the current one (ARMeilleure) can't easily take advantage of the fact that when translating Arm code to another Arm CPU, most code can remain untouched since its the same architecture.

The new JIT skips several steps that the current JIT would do, and as a result it can produce code much faster (and with even better quality).

It is important to note that the new JIT currently does not support x86 host, and there is no plan for x86 support. While I do think we can add support for other RISC architectures in the future, I think x86 is different enough from Arm that it would not be worth it, and the approach used by the current JIT is good enough.

Motivation

Arm devices are becoming more popular, and we want to ensure we have good support for those devices in the future. Having a decent JIT That can target Arm CPUs is important for that. While it is always possible to improve the existing JIT, in order for it to be the best it can would require a completely different approach, that can take maximum advantage of the similarity between Arm and... well, Arm.

While on macOS, we can use hypervisor to run the guest code directly, there are other platforms where we don't have access to a hypervisor and would still depend on the JIT.

How it works

On both the 32-bit and 64-bit modes, the original register allocation is preserved. This has several benefits:

  • Easier debugging as the generated code can be compared to the original without the need for a mental remapping of the registers.
  • Much faster compilation, register allocation is slow, even more so if you use an algorithm that produces good allocations.
  • Generally lower code size and better code. The original binary most likely was compiled using a good register allocation with no time constraints, so it already has good allocation.

For the 32-bit mode, this is pretty simple since AArch64 has (almost) 32 general purpose registers, while AArch32 has (almost) 16. So we have double the amount of registers to work with. Registers W0-W14 are mapped to 32-bit registers R0-R14 directly. R15 is special (PC), it is also mapped to W15, but handled in a special way.

The 64-bit mode is more challenging, because the amount of general purpose registers is the same in this case. Some of those registers are special (SP, LR, FP, etc). We can't use special registers in translated guest code directly. So instead, if a function uses one of those special registers, it is remapped to regular registers that are not in use on that function. If there are no enough unused registers available on the function to remap all special registers, the function is truncated up to the point where the amount of registers reaches the limit. Truncated functions are basically split into smaller "sub-functions". At the end of the truncated block, it jumps to the next part of the function that has been translated separately. It would have been also possible to spill the registers instead, but it would be considerably more complex for little benefit. The number of truncated functions is very small, and is usually around 2%. Currently there is only one register that is reserved as temporary register. Increasing the amount of temporary registers it needs to reserve would also increase the amount of truncated functions.

Like ARMeilleure, this JIT is capable of translating entire functions (rather than a single basic block). However if may not be able to detect the end of a function perfectly, this is not really a problem, but can cause overlapping parts of the same function to appear on the JIT cache. It can also potentially have performance implications. The function block decoding is more limited than the one ARMeilleure, the new JIT does not extend functions backwards (if there is a backwards jump, from a loop for example), or forward jumps that are too far way. That said, it can handle most common cases, and can handle loops just fine. As long the function is well formed and doesn't have weird control flow, it can detect the entire function.

The guest state is loaded on the entry point of the function, and saved on all exit points. When another guest function is called, the guest state is saved to the context. When it returns, it compares the address of the next function it should execute (returned on X0), if it matches the next address it would execute on this function, it loads the state from context again and keeps executing, otherwise it returns, until it finds the right function or reaches the dispatcher. Like ARMeilleure, it performs data flow analysis to find the registers that are used and modified on each entry/exit point of the function. That way it only needs to save and restore from context what is really used. However, the results might not be optimal, this is not a simple problem to solve. The data flow analysis is also currently only available for Arm64 mode, Arm32 instead always saves and restores all registers that are used in the entire function. This is simpler and produces worse results. The reason is that I planned to skip saving/restoring the context entirely for Arm32 (as it can just keep the state in registers all the time). But in the end, I didn't do it and here we are. All register usage information is derived from the instruction table. The table says which registers each instruction uses, and the translator uses that to extract the register numbers from the instruction itself, and build register usage information. This is also used for register remapping. Since Arm32 is not doing data flow, I also did not include most of the register usage information in the instruction table, instead it only tracks register writes (which is required to know if a instruction changes the PC register, considered a jump).

Speaking about the instruction table, it uses an approach similar to ARMeilleure for finding the instruction from the opcode. One difference is the way how constraints are handled. On ARMeilleure, they are generally handled where its most convenient. Sometimes on the table, sometimes while decoding the instruction fields, something on the implementation itself. If necessary, it would call the Undefined instruction implementation, if a specific encoding is undefined. On the new JIT, all constraints are added to the table. To make this possible, each entry on the table has a blacklist of encodings that are not allowed. So if a specific encoding is "Undefined" or another instruction, it is added to that list. This approach allows us to know if an instruction is undefined up-front. It also allows filtering out instructions that requires specific extensions or ISA versions if we need.

Performance improvements

I didn't really notice frame rate improvements on the few tests I did, but there are some noticeable improvements for startup time and code size. The measurements below have been done on a Ryzen 9 7900X CPU (with the JIT generating Arm64 code, obviously it can't actually execute it, being a x86 CPU).

The chart below shows the measurement of the time taken to compile all the game code in the main binary. The games used for the test are Shantae and the Pirate's Curse, Neptunia Game Maker R:Evolution and New Super Mario Bros U Deluxe.

image

Neptunia has the highest size among them, so naturally it was also the one that took most time to compile. The code size when uncompressed is 75 MB. Shantae code size is 2MB, and New Super Mario Bros U Deluxe 9 MB.

image

Here we can see the code size increase ratio when compared to the original code. A ratio of 1x means that it has the same size as the original code, 2x would be double the size, etc. Generally, the new JIT can produce code that is between 4x or 5x the size of the original (in a few cases managing to be around the 3.5x mark), while ARMeilleure is between 7x and 8x with HighCq mode. ARMeilleure also has the LowCq mode, which trades code quality for compilation speed, and as a result, is faster but produces significantly worse code (usually being 16x or 17x larger than the original).

The takeaway is that the new JIT can produce better code, and do so much faster than the old JIT. The most dramatic difference is seen on Neptunia, which has the largest code size. It took 4 minutes and 26 seconds to compile with the old JIT on HighCq mode, while on the new JIT it took only 4.64 seconds. The old JIT produced 633MB of code in the HighCq mode, while the new one produced 348MB. This is almost half the size, while taking a fraction of the time.

For 32-bit games, the code size difference is not as large for a few reasons:

  • As mentioned before, it is not doing data flow analysis for more precise register usage information, which means in many cases, it saves and restores registers that are not actually used on a given control flow path.
  • 32-bit code does not map as naturally to Arm64 like 64-bit code does. That means it requires more instructions per guest instruction.

Regardless, it still manages to win against the old JIT on code size, even with those limitations. It is also much faster, with New Super Mario Bros U Deluxe taking 8.7 seconds to translate in LowCq mode, 17.3 in HighCq, and 3 seconds with the new JIT.

The much faster compilation means that we do not need PPTC at all with the new JIT, and can avoid all the problems that comes with it. It means no more disk space used for PPTC caches, no more issues with mods that change code, and now more slow-ish first time gameplay experience. To reiterate what I said earlier, those benefits are only for Arm CPUs, this JIT does not support x86.

There are visible improvements on boot times. Below we can see a comparison with Mario Kart 8 Deluxe.

Old JIT, PPTC disabled:

mk8d_oldjit_no_ptc.mp4

Old JIT, PPTC enabled:

mk8d_oldjit_with_ptc.mp4

New JIT:

mk8d_newjit.mp4

Not surprisingly, the new JIT can beat the old one with PPTC disabled. Even with PPTC enabled, it manages to be a little bit faster (by roughly 2 seconds). One thing I did notice is that the second boot seems to be noticeably faster than the first one with the new JIT. I did not have time to investigate why yet, but it might be the cost of the .NET JIT. For all tests above, I ran the game at least once beforehand to ensure .NET JIT compilation time is not showing.

Compatibility improvements

There are some games that did not work with the old JIT on Arm. They did work fine with hypervisor on macOS, so it's not a big deal, but it could be problematic for other platforms if we used JIT for them.

Luigi's Mansion 3, which would crash before showing the title screen:
Captura de Tela 2023-12-24 às 02 43 06
Pokémon Scarlet, which would softlock before getting in-game:
Captura de Tela 2023-12-24 às 02 42 26

Future improvements

Both Arm32 and Arm64 can be improved in the future:

  • Investigate improvements to register state load/store.
  • Re-use pointers for memory access when the guest register is the same.
  • Some peephole optimization opportunities on Arm32.
  • Endian flag is currently ignored on Arm32. We should implement it in the future, but no game actually uses this. The old JIT does not actually use it in all places either (for example, it is missing for VLDn/VSTn instructions, and also VLDM/VSTM iirc).

While working on this, I also found some problems on the old JIT:

  • NEON comparison instructions (VCGE_Z and friends) ignores size field, and assumes that floating point size is FP32 even when it is FP16. Not a big deal since the Switch CPU does not support FP16, but it should be an Undefined instruction.
  • VMAXNM and friends ignores sz field. Just like the above.
  • There is a difference of 1ulp for VRECPS/VRSQRTS results compared to the native instruction on Arm64. Not sure what is up with that...
  • VTBL and VTBX appears produce wrong results if the table and destination registers are the same. Weirdly enough, it seems that this is intentional (see https://github.com/Ryujinx/Ryujinx/blob/master/src/ARMeilleure/Instructions/InstEmitSimdMove32.cs#L294C25-L294C41). x86 has a SSSE3 path that might not have this issue, but Arm has no fast path.
  • VLD2_1/VST2_1 with size = 0 and index_align<0> = 1. The register increment value is incorrect in this case, but I was not able to test it against Unicorn (it just freezes when I add this test case, for some reason...)

Alternative approaches

Switch emulators on Android have been using an approach that they call "NCE" (Native Code Execution). It can run the guest code (almost) directly without hypervisor, but it comes with limitations (or rather, the need to hack things around to make it work):

  • We can't run system instructions such as SVC, MRS, MSR etc directly, so patching the code to make it call the emulator is required. Those modifications are visible to the game.
  • Interrupting the guest threads becomes somewhat complicated, since we can't insert "interruption points" in the code. We can use pthread_kill on Unix-like OS, but Windows has no such a thing.
  • Worse by far, the code will access the emulator address space directly, so we need to set it up in a way that makes the game happy.

The last one deserves deserves its own section with the amount of problems it introduces:

  • On many devices, the address space size is 39-bit. The Switch also has 39-bit address space, which means its not possible to fit it within the emulator address space. We need to shrink the guest address space to make it fit, which has the potential to break games as we differ from the Switch. This problem also affects JIT when using the host mapped memory manager modes.
  • The allocated guest region must be contained entirely inside the guest address space. This means that 36-bit games won't work, as we most likely won't be able to allocate 36-bit worth of virtual memory right at the start of the emulator address space. There are not many of those, so it is not considered a big deal.
  • For platforms with 16KB page size, there are additional challenges as we can't map the text segment and data segments as RX and RW respectively, since they are 4KB aligned, not 16KB aligned. We also can't just map them all as RWX if the platform has W^X. Making this work requires more hacks.

That said, we could consider implementing it in the future.

Testing

Since this is a completely new JIT, there is a lot of things that have the potential to be broken, so testing is welcome. Note that it should be compared to the old JIT (hypervisor disabled). If the issue also happens with the old JIT, it is not a regression and it's probably not JIT related. 32-bit game testing is more important since we can't use hypervisor for that, so problems there would be more visible.

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.

@github-actions github-actions bot added cpu Related to ARMeilleure horizon Related to Ryujinx.HLE labels Dec 24, 2023
@piplup55
Copy link
Contributor

Nintendo 64 Apps now start on firmware 13, firmware 14 and above still crashes
image
Ryujinx_1.1.0+b264f3e_2023-12-24_17-05-49.log
Ryujinx_1.1.1100_2023-12-24_17-09-30.log
related issue #6021

@merryhime
Copy link
Contributor

merryhime commented Dec 24, 2023

There is a difference of 1ulp for VRECPS/VRSQRTS results compared to the native instruction on Arm64. Not sure what is up with that...

On x64, this is because ARMeilleure's implementation of these instructions don't fuse mul-add/mul-sub.

@MutantAura
Copy link
Collaborator

MutantAura commented Dec 27, 2023

Tears of the Kingdom seems to be crashing on boot with the new JIT.
Ryujinx_1.1.0+45478f4_2023-12-27_13-10-25.log

Ignore me, old JIT does this too. Not sure if it's easier to debug with the new one?

Other titles tested include...

BoTW:
Old JIT - 36FPS, 10s boot, 9W (@30fps)
New JIT - 40FPS, 4s boot, 7.7W (@30fps)
HV - 49FPS, 2s boot, 7.3W (@30fps)

Red Dead Redemption:
Old JIT - 30FPS, 3s boot, 15.2W (@30fps)
New JIT - 53FPS, 2.5s boot, 8.7W (@30fps)
HV - 56FPS, <1s boot, 8.3W (@30fps)

Pokemon Mystery Dungeon: Rescue Team DX:
Old JIT - 47FPS, 8s boot, 7.9W (@30fps)
New JIT - 64FPS, 1.5s boot (fine), 4.8W (@30fps)
HV - 67FPS, <1s boot (inconsistent), 4.6W (@30fps)

Mario Kart 8 Deluxe:
Old JIT - 60FPS cap, 9s boot, 8.7W (@60fps)
New JIT - 60FPS cap, 1.5s boot, 7.7W (@60fps)

tagged random dude sorry

@gdkchan
Copy link
Member Author

gdkchan commented Dec 27, 2023

Tears of the Kingdom seems to be crashing on boot with the new JIT.
Ryujinx_1.1.0+45478f4_2023-12-27_13-10-25.log
Ignore me, old JIT does this too. Not sure if it's easier to debug with the new one?

IIRC its due to the hacks to make host mapped memory manager mode work wth 16KB pages.

@iMonZ
Copy link

iMonZ commented Dec 28, 2023

Wow thank you so much for your hard work!

@BoofOof32
Copy link

I have a Thinkpad X13s, one of the rare few Windows on ARM laptops that can reasonably run Linux.
I've tested Ryujinx on this device before, and got varying results

On Vulkan with PPTC disabled on both master and PR build

SoC: Snapdragon 8cx Gen 3
RAM: 32GB LPDDR4x 4266MHz
GPU: Adreno 690
Driver: Mesa 23.2.1-ubuntu3.1
Distro: Ubuntu 23.10
  • Good Job!
    Old JIT - 30s boot time
    New JIT - 17s boot time

  • Mario Kart 8 Deluxe
    Old JIT - 22s boot time
    New JIT - 12s boot time

  • Mario Party Superstars
    Old JIT - perpetually loads
    New JIT - 13s boot time

  • Mario + Rabbids Kingdom Battle
    Old JIT - 41s boot time (22-28fps)
    New JIT - 25s boot time (roughly same perf)

  • Metroid Prime Remastered
    Old JIT - 12s boot time
    New JIT - 10s boot time

  • Red Dead Redemption
    Old JIT - 42s boot time (4fps)
    New JIT - 36s boot time

  • Splatoon 2
    Old JIT - 21s boot time
    New JIT -18s boot time

  • Super Mario 3D World + Bowser's Fury
    Old JIT - 9s boot time
    New JIT - 5s boot time

  • Super Mario Odyssey
    Old JIT - 20s boot time
    New JIT - 17s boot time

  • Super Mario Party
    Old JIT - 32s boot time (40-60fps)
    New JIT - 13s boot time (60fps)

  • Super Smash Bros. Ultimate
    Old JIT - 28s boot time
    New JIT - 11s boot time

  • Tears of The Kingdom

  • Old JIT - 42s boot time
    New JIT - 17s boot time

(I measure boot time as time between clicking a game and getting to the menu, so if the time looks long, it's probably because the intro is long)

This is not accounting for the poor performance or major graphical errors present on this device on most titles (either Turnip not knowing what to do with Ryujinx, or Ryujinx just not being optimized for Adreno, or a mix of both). The JIT doesn't really affect performance apart from maybe 1 or 2 games of mine. but the difference is felt even on this random laptop

@iMonZ
Copy link

iMonZ commented Dec 29, 2023

Luigis Mansion 3 now works for me! Thanks.
What is still missing befor a merge?

@mrcmunir
Copy link

mrcmunir commented Jan 3, 2024

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.

I found if Emulation create Allocated addres space is smaller than guest application requeriments Will crash emulator returning bus error system.
On which line is allocated space defined?
This problem only affected for Host and fastest Unsafe options not happen under software management oldjit.

@gdkchan
Copy link
Member Author

gdkchan commented Jan 4, 2024

Note that the new JIT does not support the software memory manager mode, so it should be tested with the host mapped mode. Using the software mode will still use the old JIT.

I found if Emulation create Allocated addres space is smaller than guest application requeriments Will crash emulator returning bus error system. On which line is allocated space defined? This problem only affected for Host and fastest Unsafe options not happen under software management oldjit.

Which game does this?
It adjusts the size of various memory regions based on the address space size here: https://github.com/Ryujinx/Ryujinx/blob/master/src/Ryujinx.HLE/HOS/Kernel/Memory/KPageTableBase.cs#L214
Some games might break as a result of that.
I guess you're not testing this on macOS? Since the address space size is 48-bit on macOS, it should never need to shrink the guest address space.

Copy link
Member

@riperiperi riperiperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an absolutely insane amount of work, major kudos... and for A32 too. Compilation is definitely way faster, and loading times seem much shorter than Armeilleure with PPTC in general. We would probably get some fantastic numbers for ingame performance if the GPU weren't the bottleneck. It'll be a big improvement when a system has weaker multicore performance even with a GPU bottleneck.

I've read through all files, some have just been skimmed because I'd be cross referencing a datasheet otherwise. My only complaint is that InstEmit for A32 is super long, but I don't see a better way to do it and keep the instruction table readable. A64 approach is really elegant since most instructions can just rewrite the in/out registers.

I've tested across a number of games and haven't seen any problems yet. Note that I haven't really played games for an extensive period, so I'm unsure if any of the HV softlocks or similar are present in any form (they definitely don't show up in the first 15 minutes).

@MutantAura
Copy link
Collaborator

MutantAura commented Jan 18, 2024

I'm unsure if any of the HV softlocks or similar are present in any form

After an hour of running in a circle it does look like at least this instance of HV softlock also occurs in JIT.
image

As usual nothing strange in logs.
Ryujinx_1.1.0+5756661_2024-01-18_19-09-37.log

Breath of the Wild does not experience the same HV deadlocks that occurred before the ordering revert and seems very stable.

@MetrosexualGarbodor
Copy link
Collaborator

Closes #1884?

The basic idea is to support "direct" execution of the simple ARM instructions to make the code work faster on the "native" platform

@gdkchan
Copy link
Member Author

gdkchan commented Jan 19, 2024

Closes #1884?

The basic idea is to support "direct" execution of the simple ARM instructions to make the code work faster on the "native" platform

That request is not very clear. I'm not sure what would be categorized as a "simple" ARM instruction. It sounds more like a request for something that can run the code directly like hypervisor, or something like "NCE".

Copy link
Contributor

@marysaka marysaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work

@gdkchan gdkchan merged commit 427b7d0 into Ryujinx:master Jan 20, 2024
9 checks passed
@gdkchan gdkchan deleted the fast-jit-is-no-joke branch January 20, 2024 14:11
amurgshere pushed a commit to amurgshere/Ryujinx that referenced this pull request Jan 28, 2024
* Implement a new JIT for Arm devices

* Auto-format

* Make a lot of Assembler members read-only

* More read-only

* Fix more warnings

* ObjectDisposedException.ThrowIf

* New JIT cache for platforms that enforce W^X, currently unused

* Remove unused using

* Fix assert

* Pass memory manager type around

* Safe memory manager mode support + other improvements

* Actual safe memory manager mode masking support

* PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpu Related to ARMeilleure horizon Related to Ryujinx.HLE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants