Skip to content

Write barrier without any RWX pages #114982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
May 15, 2025

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Apr 23, 2025

Add a new config switch DOTNET_UseGCWriteBarrierCopy to control whether or not to use a copy of the WriteBarrier code instead of assembly code embedded in the coreclr binary. Typically we do this to improve performance, but this change enables a new path where we just have the barrier as assembly code in the coreclr binary that does not require any RWX behavior. Also, with this change, the 1 copy of the code which is used for this scenario is shared between NativeAOT and coreclr.

This change will change the write barrier used in some situations on CoreCLR to be the same as the write barrier used on NativeAOT.

In addition, we now have a copy of NativeAOT variant of the write barrier for Linux X86, although the other NativeAOT support is not present.

Finally, the Linux Arm32 version of the NativeAOT write barrier could now be enabled to use software write watch, like CoreCLR does. This is not enabled in this change, although the code has been written into the write barrier (as it was needed for CoreCLR).

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@@ -179,6 +179,10 @@ GLOBAL_LABEL RhpCheckedAssignRefAVLocation
LEAF_END RhpCheckedAssignRef\EXPORT_REG_NAME, _TEXT
.endm

LEAF_ENTRY RhpWriteBarriers, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this would not work anyway... we compile all code on Apple platforms as .subsections_via_symbols and the Apple linker is free to reorder the functions (or discard unused ones if told to). Compiling without subsections_via_symbols is not an option due to various bugs in different linker versions. The only other option to keep the desired order is to mark the successive symbols as .alt_entry, which also happens to run into linker bugs if not done extremely carefully...

@am11 am11 mentioned this pull request May 1, 2025
.intel_syntax noprefix
#include "AsmMacros_Shared.h"

// TODO! This is implemented, but not tested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test it since I already have the setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It passed some basic smoke tests. 👍🏿

Comment on lines +6 to +12
#if defined(__APPLE__)
// Currently the build is failing without this due to an issue if the first method in the assembly file has an alternate entry at the start of the file.
// Fix, but adding an empty, unused method
LEAF_ENTRY RhpWriteBarriersDoNotFailToBuild, _TEXT
ret
LEAF_END RhpWriteBarriersDoNotFailToBuild, _TEXT
#endif
Copy link
Member

@filipnavara filipnavara May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity, this was reported as FB14743667 to Apple. It should be fixed in Xcode 16.3. We should still keep the workaround until we bump the minimum Xcode requirement past this version.

The general issue is that the new linker (ld-prime) didn't correctly handle combination of symbols at overlapping addresses and alt-entry symbols. There are some cases where section symbols are implicitly generated at the beginning of each section in object file so it's easy to run into this. Old linker (ld64) doesn’t handle alt-entry symbols at a start of a section so it's problematic there too.

@davidwrighton davidwrighton changed the title [DRAFT] Write barrier without any RWX pages Write barrier without any RWX pages May 5, 2025
@davidwrighton
Copy link
Member Author

/azp list

@davidwrighton
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dotnet dotnet deleted a comment from azure-pipelines bot May 5, 2025
@davidwrighton davidwrighton marked this pull request as ready for review May 5, 2025 22:56
@davidwrighton davidwrighton requested a review from janvorli May 5, 2025 23:56
#endif

#ifdef TARGET_ARM
if ((writeBarrierAVLocations[i] | THUMB_CODE) == (uControlPc | THUMB_CODE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are one of these missing the THUMB bit? I would assume that both would have it set.
Anyways, we have PCODEToPINSTR(addr) to strip the thumb bit if it is necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bunch of unnecessary casts in this space. I'll get rid of them, and use PCODEToPINSTR, and see if it all works. If not, the fix will be simple..

ASSERT(*(uint8_t*)writeBarrierAVLocations[i] != 0xE9); // jmp XXXXXXXX
#endif

if (writeBarrierAVLocations[i] == PCODEToPINSTR(uControlPc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still strange to me that the writeBarrierAVLocations[i] would not have the THUMB bit set when it is a pointer to thumb code. Are you sure it is really the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They really don't have it:

nm ./artifacts/obj/coreclr/linux.arm.Debug/vm/wks/CMakeFiles/cee_wks_core.dir/__/__/runtime/arm/WriteBarriers.S.o
00000000 t $t
00000001 T RhpAssignRef
00000004 T RhpAssignRefAVLocation
00000004 T RhpAssignRefAvLocationr1
00000001 T RhpAssignRefr1
00000111 T RhpByRefAssignRef
00000114 T RhpByRefAssignRefAVLocation1
00000116 T RhpByRefAssignRefAVLocation2
00000077 T RhpCheckedAssignRef
0000007a T RhpCheckedAssignRefAVLocation
0000007a T RhpCheckedAssignRefAvLocationr1
00000077 T RhpCheckedAssignRefr1
         U __aeabi_unwind_cpp_pr0
         U g_card_table
         U g_ephemeral_high
         U g_ephemeral_low
         U g_highest_address
         U g_lowest_address
         U g_write_watch_table

I think it boils down to how ALTERNATE_ENTRY is defined:

.macro PATCH_LABEL Name
        .thumb_func
        .global C_FUNC(\Name)
C_FUNC(\Name):
.endm

.macro ALTERNATE_ENTRY Name
        .global C_FUNC(\Name)
        .type \Name, %function
C_FUNC(\Name):
.endm

I think the assembler treats it as ARM function label.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the macro doesn't have the .thumb_func, that's why. I wonder if that's intentional.

Copy link
Member

@filipnavara filipnavara May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat intentional. The definition is used in NativeAOT and there we are comparing it to already stripped instruction pointer. I don't have any strong option on unifying it one way or the other but when I was working on the linux-arm NativeAOT port I thought it makes sense to represent CODE_LOCATION as the actual memory location (ie. abstract out the Thumb weirdness early on)...

@davidwrighton davidwrighton requested a review from jkotas May 8, 2025 01:33
Copy link
Member

@am11 am11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 🎉

Fixes #64253
Unblocks #115339 (the last HMF dependency)

cc @jkotas, @janvorli

@@ -15,14 +24,14 @@

LEAF_ENTRY RhpInterfaceDispatch\entries, _TEXT

// r11 currently contains the indirection cell address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess that this was intentionally done first so that the data dependency below gets satisfied first. Is this really needed with the workaround above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the value of having a specific memory ordering between loading the MethodTable pointer and loading the cache block. I think we can safely load those in any order.

Copy link
Member

@jkotas jkotas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is functionally correct to load them in any order.

I meant that the existing order might have been chosen intentionally as a performance micro-optimization, so that [r11 + OFFSETOF__InterfaceDispatchCell__m_pCache] indirection gets handled a bit sooner or be more likely to be parallelized with [rdi] indirection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the order and it still produced correct symbols and unwind info on Xcode 16.3 on ARM64, at least in the object file: https://gist.github.com/filipnavara/bdd6dc77727fa68782bdb314a117f6d3.

That's why I asked for specific Xcode version so I could get to the bottom of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see the issue now in the linked .dylib:

[593] funcOffset=0x0065504C, encoding[  6]=0x02000000 (no frame, no saved registers                            ) _RhpInterfaceDispatch2
[594] funcOffset=0x00655050, encoding[  8]=0x00000000 (no frame, no saved registers                            ) _RhpInterfaceDispatchAVLocation2
[595] funcOffset=0x00655080, encoding[  6]=0x02000000 (no frame, no saved registers                            ) _RhpInterfaceDispatch4
[596] funcOffset=0x00655084, encoding[  8]=0x00000000 (no frame, no saved registers                            ) _RhpInterfaceDispatchAVLocation4

Copy link
Member

@filipnavara filipnavara May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Both ld64 and ld-prime produce this. unwinddump treats the null opcode as continuation of the previous entry (in this case it prints "no frame, no saved registers", but if I force some DWARF sequence it still shows the DWARF pointer for both entries) has a bug where it's missing a case for the null opcode on ARM64 (the bug is not present for other archs). llvm-libunwind doesn't seem to have any special treatment for the null opcode.

In summary, this really does look like a linker bug. I'll do a bit more poking and file Apple feedback if necessary. One possible workaround in this specific case is to start a new unwind info at the alt_entry location by issuing .cfi_endproc in front of it and .cfi_startproc behind it. Unfortunately that doesn't work for the general case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed with Apple as FB17568654
Repro code: Archive.zip
Let's see if they can provide some clarification or workarounds.

#include "AsmMacros_Shared.h"

#if defined(__APPLE__)
// Currently the build is failing without this due to an issue if the first method in the assembly file has an alternate entry at the start of the file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not failing in NativeAOT build today?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at it, this probably isn't necessary in the amd64 write barrier since it wasn't complaining before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This leads me to wonder if maybe the issue on MacOS arm64 with having an alternate entry as the first symbol in the object file was that since our alternate entry macro doesn't set the thumb bit, the offset for the alternate entry was before that of the normal function symbol for the first symbol in the object file. Time for an experiment. Sigh.

Copy link
Member

@filipnavara filipnavara May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thumb bit is only relevant for arm32 so that should not matter here.

Also, the issue with first symbol in object file is known ld64 bug, see #114982 (comment).

@@ -261,6 +269,14 @@ LEAF_END RhpCheckedXchg, _TEXT
LEAF_ENTRY RhpByRefAssignRef, _TEXT
ALTERNATE_ENTRY RhpByRefAssignRefAVLocation1
mov rcx, [rsi]
#ifdef TARGET_APPLE
// Apple's linker has issues which break unwind info if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the manifestation of the bad unwind info?

If it is just when single stepping in lldb, do we care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manifestation is that if we AV on this location, we are unable to unwind to the managed frame, so we failfast the process instead of producing a NullReferenceException which can be caught.

Copy link
Member

@jkotas jkotas May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use the same manual uwnind routine in regular CoreCLR as we use in NAOT to avoid this problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another approach where we don't use libunwind for unwinding these entries, instead we use our own unwinder which is special cased to know what to do. (This is what NativeAot does.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unlike NativeAOT, CoreCLR implements the Apple compact unwinding through a modified copy of the llvm-libunwind code; libunwind should not even come into play for this since it's covered by the compact codes.)

Copy link
Member

@filipnavara filipnavara May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was doing fixes of the osx-arm64 NativeAOT port I found a bug in llvm-libunwind comparison of the boundary comparison for IP addresses and function start/end (cb6fb60#diff-fa7a2ccbf98ccd844bfc21e48f2bc700137c2903cf97d487fa011e16b1f9b3c0).

I looked at the custom unwinding code in CoreCLR and there are similar comparisons that look sketchy:

if (ip < funcStart || ip > funcEnd) {
ERROR("ip %p not in regular second level\n", (void*)ip);
return false;
}

if (ip < funcStart || ip > funcEnd) {
ERROR("ip %p not in compressed second level\n", (void*)ip);
return false;
}

(in both cases the condition should presumably be ip >= funcEnd, not ip > funcEnd; shouldn't really cause an error for alt_entry though..)

@davidwrighton davidwrighton merged commit 3aa00d7 into dotnet:main May 15, 2025
108 checks passed
@a74nh
Copy link
Contributor

a74nh commented May 16, 2025

I've been working in a similar area in #111636. Essentially I'm making Arm64 use writebarriers in the same way as AMD64. I've just been rebasing off the top of your work.

AIUI, with this PR, when IsWriteBarrierCopyEnabled() returns true it means that none of the writebarrrier code in src/vm/ is used (specifically, I care about JIT_WriteBarrier in src/vm/arm64/patchedcode.S). Instead it uses a read only write barrier in runtime/<target>/WriteBarriers.S.

If so, then I think we're good and I don't need to worry about my PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants