-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882
base: main
Are you sure you want to change the base?
Conversation
Ping @colesbury , @Yhg1s , Windows doesn't like |
void *v = _Py_atomic_load_ptr_relaxed(s); | ||
_Py_atomic_store_ptr_relaxed(d, v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a relaxed load here (and in memmove), at least not for the list/tuple cases (writes always happen with the lock held for lists, and tuples are supposed to be immutable after they're shared). I'm not sure if we're going to end up using this in situations where we do need relaxed loads though, but I guess we could document it appropriately and revisit if we come across a need.
I'm not sure if the stores should be relaxed, though. We use release order when inserting new items in list_ass_slice_lock_held, but I'm not sure I understand why. I can't think of reorderings that would affect loads from those items. @colesbury what do you think? Do we need a single release fence, or do we need a strict ordering among the stores?
(The function name will need changing if we don't use relaxed loads/stores, and it should probably be something like _Py_atomic_memcpy_ptr_unsafe_release
to indicate it reads unsafely but release-stores.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context - like I said in the header I did something similar in array module and there I had to make reads atomic in resize because there were lock-free writes. I guess that doesn't apply here, but making reads non-atomic still compiles to essentially the same thing (on Intel), which makes sense since the data still needs to be written in ptr size chunks for the atomic write. So basically nothing is really gained by making reads non-atomic (relaxed) in this case (at least on Intel).
Non-atomic read with atomic write inner loop:
.L2929:
mov rdi, QWORD PTR [rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144: for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) {
add rsi, 64 # d,
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144: for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) {
add rax, 64 # s,
# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void **)obj, value, __ATOMIC_RELAXED); }
mov QWORD PTR -64[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -56[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -56[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -48[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -48[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -40[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -40[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -32[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -32[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -24[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -24[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -16[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -16[rsi], rdi #,, MEM[(void * *)s_289]
mov rdi, QWORD PTR -8[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289]
mov QWORD PTR -8[rsi], rdi #,, MEM[(void * *)s_289]
cmp rax, QWORD PTR 16[rsp] # s, %sfp
jne .L2929 #,
Waiting on RELEASE decision.
This PR exists because of #129069 and
Tools/tsan/suppressions_free_threading.txt
:It seems to hit the requirements: performance, atomicity assured (per pointer), TSAN shuts up. Pretty sure can remove the
list_ass_slice_lock_held
andlist_inplace_repeat_lock_held
suppressions, and if not yet then should be able to add the atomic memcpy tolist_resize
where needed to be able to do so.The "atomic memcpy" (atomic per ptr, not whole memcpy) functions are in the atomic wrappers header because they can be reused, if want otherwise though can move into
listobject.c
, or somewhere else. Can also make non-inline callable real functions. Or the inline funcs could go intopyatomic.h
?@colesbury, you may recognize this, I did something similar for the lock-free array module PR, but this is simpler. PyObject pointers are a bit more fragile than pure values though so I assume you want this here?
Here is an example of what the inner loop a
FT_ATOMIC_MEMMOVE_PTR_RELAXED
compiles to, its norep movsq
or other specialized move instructions, but its fast:Simple benchmark. Note the decrementing address copy case in
FT_ATOMIC_MEMMOVE_PTR_RELAXED
seems to hurt a bit cache-wise in the 'tins' simple bench, but it doesn't seem to make a difference in the large move or overall inpyperformance
. I've had this one jump around from parity with current to this (which is the worst case so I put it here). The difference disappears if the realmemmove
is used in this case but the difference seems to come from the realmemmove
using special instructions (which are not atomic), a la: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. The other two simple (tdel
andtins big
) are solid where they are. Test script:Times, average of 10 runs each:
pyperformance
benchmark. Difference in average performance is essentially nonexistent though individual tests can vary a bit (AMD 7950x running in VirtualBox):