Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Mar 29, 2025

This PR exists because of #129069 and Tools/tsan/suppressions_free_threading.txt:

# List resizing happens through different paths ending in memcpy or memmove
# (for efficiency), which will probably need to rewritten as explicit loops
# of ptr-sized copies to be thread-safe.

It seems to hit the requirements: performance, atomicity assured (per pointer), TSAN shuts up. Pretty sure can remove the list_ass_slice_lock_held and list_inplace_repeat_lock_held suppressions, and if not yet then should be able to add the atomic memcpy to list_resize where needed to be able to do so.

The "atomic memcpy" (atomic per ptr, not whole memcpy) functions are in the atomic wrappers header because they can be reused, if want otherwise though can move into listobject.c, or somewhere else. Can also make non-inline callable real functions. Or the inline funcs could go into pyatomic.h?

@colesbury, you may recognize this, I did something similar for the lock-free array module PR, but this is simpler. PyObject pointers are a bit more fragile than pure values though so I assume you want this here?

Here is an example of what the inner loop a FT_ATOMIC_MEMMOVE_PTR_RELAXED compiles to, its no rep movsq or other specialized move instructions, but its fast:

.L2971:
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR [rdi]	# _160,* s
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void **d = (void **)dest + n, **s = (void **)src + n, **e = (void **)dest - 1; d != e; d--, s--) {
	sub	rsi, 64	# d,
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void **d = (void **)dest + n, **s = (void **)src + n, **e = (void **)dest - 1; d != e; d--, s--) {
	sub	rdi, 64	# s,
# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void **)obj, value, __ATOMIC_RELAXED); }
	mov	QWORD PTR 64[rsi], rax	#,, _160
	mov	rax, QWORD PTR 56[rdi]	# _160,
	mov	QWORD PTR 56[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 48[rdi]	# _160,
	mov	QWORD PTR 48[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 40[rdi]	# _160,
	mov	QWORD PTR 40[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 32[rdi]	# _160,
	mov	QWORD PTR 32[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 24[rdi]	# _160,
	mov	QWORD PTR 24[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 16[rdi]	# _160,
	mov	QWORD PTR 16[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 8[rdi]	# _160,
	mov	QWORD PTR 8[rsi], rax	#,, _160
	cmp	QWORD PTR 32[rsp], rsi	# %sfp, d
	jne	.L2971	#,

Simple benchmark. Note the decrementing address copy case in FT_ATOMIC_MEMMOVE_PTR_RELAXED seems to hurt a bit cache-wise in the 'tins' simple bench, but it doesn't seem to make a difference in the large move or overall in pyperformance. I've had this one jump around from parity with current to this (which is the worst case so I put it here). The difference disappears if the real memmove is used in this case but the difference seems to come from the real memmove using special instructions (which are not atomic), a la: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. The other two simple (tdel and tins big) are solid where they are. Test script:

from timeit import timeit
from time import time

tdel = timeit('del dst[:1024]', 'dst = [None] * 4096', number=10000000)
print(f't del: {tdel}')

tins = timeit('dst[:0] = src', 'dst = [None] * 4096; dst.pop(); src = [None]', number=100000)  # pop() to make sure no realloc in timing
print(f't ins: {tins}')

dst = [None] * 0x10000000
src = [None]
dst.pop()

t0 = time()
dst[:0] = src
t1 = time()

print(f't ins big: {t1 - t0}')

Times, average of 10 runs each:

current main:
-------------
t del: 0.2191115931245804
t ins: 0.3050231862498549
t ins big: 0.0799584686756134

fix-129069:
-----------
t del: 0.21582267370067712
t ins: 1.072859044800134
t ins big: 0.08154952526092529

pyperformance benchmark. Difference in average performance is essentially nonexistent though individual tests can vary a bit (AMD 7950x running in VirtualBox):

                           current main   fix-129069    norm diff

async_generators                    250          254      -0.0157
asyncio_tcp                         248          247       0.0040
asyncio_tcp_ssl                    1.08         1.06       0.0189
asyncio_websockets                  344          342       0.0058
chaos                              42.3         42.4      -0.0024
comprehensions                     11.1           11       0.0091
bench_mp_pool                      17.5         17.4       0.0057
bench_thread_pool                  1.16         1.09       0.0642
coroutines                         12.8         12.8       0.0000
coverage                             57         54.5       0.0459
crypto_pyaes                       51.9         51.6       0.0058
deepcopy                            182          180       0.0111
deepcopy_reduce                    1.91         1.94      -0.0155
deepcopy_memo                      20.2         21.7      -0.0691
deltablue                           2.3          2.3       0.0000
django_template                    25.2           25       0.0080
docutils                           1.61         1.58       0.0190
dulwich_log                        26.3         25.8       0.0194
fannkuch                            260          257       0.0117
float                              42.4         43.5      -0.0253
create_gc_cycles                    494          512      -0.0352
gc_traversal                       1.17         1.27      -0.0787
generators                         20.9         21.4      -0.0234
genshi_text                        16.6         16.7      -0.0060
genshi_xml                         34.3         34.9      -0.0172
go                                 80.3         80.8      -0.0062
hexiom                             4.17         4.22      -0.0118
html5lib                           33.2           34      -0.0235
json_dumps                         7.43         7.31       0.0164
json_loads                         14.2         13.9       0.0216
logging_format                     4.58         4.61      -0.0065
logging_silent                     64.8         63.7       0.0173
logging_simple                      4.2         4.22      -0.0047
mako                               7.95         7.97      -0.0025
mdp                                1.78         1.98      -0.1010
meteor_contest                     68.2         69.7      -0.0215
nbody                              82.6         78.8       0.0482
nqueens                            58.6         61.1      -0.0409
pathlib                            13.8         13.9      -0.0072
pickle                             6.12         6.13      -0.0016
pickle_dict                        15.3         15.2       0.0066
pickle_list                         2.3         2.18       0.0550
pickle_pure_python                  204          203       0.0049
pidigits                            116          117      -0.0085
pprint_safe_repr                    499          513      -0.0273
pprint_pformat                     1.02         1.05      -0.0286
pyflate                             296          304      -0.0263
python_startup                     9.29         9.29       0.0000
python_startup_no_site             7.78         7.62       0.0210
raytrace                            192          193      -0.0052
regex_compile                      74.4         74.1       0.0040
regex_dna                           111          110       0.0091
regex_effbot                       1.69          1.7      -0.0059
regex_v8                           13.8         13.7       0.0073
richards                           32.1         32.4      -0.0093
richards_super                     36.8         37.3      -0.0134
scimark_fft                         232          231       0.0043
scimark_lu                           81         80.7       0.0037
scimark_monte_carlo                45.9         47.9      -0.0418
scimark_sor                        77.9         76.3       0.0210
scimark_sparse_mat_mult            3.83         3.78       0.0132
spectral_norm                      68.5           67       0.0224
sqlglot_normalize                   197          196       0.0051
sqlglot_optimize                   36.8         36.9      -0.0027
sqlglot_parse                       887          888      -0.0011
sqlglot_transpile                   1.1          1.1       0.0000
sqlite_synth                       1.43         1.44      -0.0069
sympy_expand                        277          276       0.0036
sympy_integrate                    12.8         12.7       0.0079
sympy_sum                          88.1         85.3       0.0328
sympy_str                           164          162       0.0123
telco                              5.47         5.39       0.0148
tomli_loads                        1.37         1.37       0.0000
typing_runtime_protocols            112          112       0.0000
unpack_sequence                    28.7         28.6       0.0035
unpickle                              8         8.18      -0.0220
unpickle_list                      2.87         2.75       0.0436
unpickle_pure_python                144          140       0.0286
xml_etree_parse                      74         72.6       0.0193
xml_etree_iterparse                49.6         47.8       0.0377
xml_etree_generate                 56.2         56.9      -0.0123
xml_etree_process                  40.4         40.3       0.0025

AVERAGE                                                   -0.0001

@tom-pytel
Copy link
Contributor Author

Ping @colesbury , @Yhg1s , Windows doesn't like size_t as a parameter, will fix, but would like a thumbs up or down on this PR in general?

@tom-pytel tom-pytel changed the title gh-129069: make list ass_slice and memory_repeat safe gh-129069: make list ass_slice and memory_repeat safe in free-threading Mar 29, 2025
Comment on lines 124 to 125
void *v = _Py_atomic_load_ptr_relaxed(s);
_Py_atomic_store_ptr_relaxed(d, v);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a relaxed load here (and in memmove), at least not for the list/tuple cases (writes always happen with the lock held for lists, and tuples are supposed to be immutable after they're shared). I'm not sure if we're going to end up using this in situations where we do need relaxed loads though, but I guess we could document it appropriately and revisit if we come across a need.

I'm not sure if the stores should be relaxed, though. We use release order when inserting new items in list_ass_slice_lock_held, but I'm not sure I understand why. I can't think of reorderings that would affect loads from those items. @colesbury what do you think? Do we need a single release fence, or do we need a strict ordering among the stores?

(The function name will need changing if we don't use relaxed loads/stores, and it should probably be something like _Py_atomic_memcpy_ptr_unsafe_release to indicate it reads unsafely but release-stores.)

Copy link
Contributor Author

@tom-pytel tom-pytel Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context - like I said in the header I did something similar in array module and there I had to make reads atomic in resize because there were lock-free writes. I guess that doesn't apply here, but making reads non-atomic still compiles to essentially the same thing (on Intel), which makes sense since the data still needs to be written in ptr size chunks for the atomic write. So basically nothing is really gained by making reads non-atomic (relaxed) in this case (at least on Intel).

Non-atomic read with atomic write inner loop:

.L2929:
	mov	rdi, QWORD PTR [rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144:         for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) {
	add	rsi, 64	# d,
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:144:         for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) {
	add	rax, 64	# s,
# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void **)obj, value, __ATOMIC_RELAXED); }
	mov	QWORD PTR -64[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -56[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -56[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -48[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -48[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -40[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -40[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -32[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -32[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -24[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -24[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -16[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -16[rsi], rdi	#,, MEM[(void * *)s_289]
	mov	rdi, QWORD PTR -8[rax]	# MEM[(void * *)s_289], MEM[(void * *)s_289]
	mov	QWORD PTR -8[rsi], rdi	#,, MEM[(void * *)s_289]
	cmp	rax, QWORD PTR 16[rsp]	# s, %sfp
	jne	.L2929	#,

Waiting on RELEASE decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants