gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

tom-pytel · 2025-03-29T20:11:55Z

This PR exists because of #129069 and Tools/tsan/suppressions_free_threading.txt:

# List resizing happens through different paths ending in memcpy or memmove
# (for efficiency), which will probably need to rewritten as explicit loops
# of ptr-sized copies to be thread-safe.

It seems to hit the requirements: performance, atomicity assured (per pointer), TSAN shuts up. Pretty sure can remove the list_ass_slice_lock_held and list_inplace_repeat_lock_held suppressions, and if not yet then should be able to add the atomic memcpy to list_resize where needed to be able to do so.

The "atomic memcpy" (atomic per ptr, not whole memcpy) functions are in the atomic wrappers header because they can be reused, if want otherwise though can move into listobject.c, or somewhere else. Can also make non-inline callable real functions. Or the inline funcs could go into pyatomic.h?

@colesbury, you may recognize this, I did something similar for the lock-free array module PR, but this is simpler. PyObject pointers are a bit more fragile than pure values though so I assume you want this here?

Here is an example of what the inner loop a FT_ATOMIC_MEMMOVE_PTR_RELAXED compiles to, its no rep movsq or other specialized move instructions, but its fast:

.L2971:
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR [rdi]	# _160,* s
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void **d = (void **)dest + n, **s = (void **)src + n, **e = (void **)dest - 1; d != e; d--, s--) {
	sub	rsi, 64	# d,
# ./Include/internal/pycore_pyatomic_ft_wrappers.h:143:         for (void **d = (void **)dest + n, **s = (void **)src + n, **e = (void **)dest - 1; d != e; d--, s--) {
	sub	rdi, 64	# s,
# ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void **)obj, value, __ATOMIC_RELAXED); }
	mov	QWORD PTR 64[rsi], rax	#,, _160
	mov	rax, QWORD PTR 56[rdi]	# _160,
	mov	QWORD PTR 56[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 48[rdi]	# _160,
	mov	QWORD PTR 48[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 40[rdi]	# _160,
	mov	QWORD PTR 40[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 32[rdi]	# _160,
	mov	QWORD PTR 32[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 24[rdi]	# _160,
	mov	QWORD PTR 24[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 16[rdi]	# _160,
	mov	QWORD PTR 16[rsi], rax	#,, _160
# ./Include/cpython/pyatomic_gcc.h:387: { return (void *)__atomic_load_n((void * const *)obj, __ATOMIC_RELAXED); }
	mov	rax, QWORD PTR 8[rdi]	# _160,
	mov	QWORD PTR 8[rsi], rax	#,, _160
	cmp	QWORD PTR 32[rsp], rsi	# %sfp, d
	jne	.L2971	#,

Simple benchmark. Note the decrementing address copy case in FT_ATOMIC_MEMMOVE_PTR_RELAXED seems to hurt a bit cache-wise in the 'tins' simple bench, but it doesn't seem to make a difference in the large move or overall in pyperformance. I've had this one jump around from parity with current to this (which is the worst case so I put it here). The difference disappears if the real memmove is used in this case but the difference seems to come from the real memmove using special instructions (which are not atomic), a la: https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html. The other two simple (tdel and tins big) are solid where they are. Test script:

from timeit import timeit
from time import time

tdel = timeit('del dst[:1024]', 'dst = [None] * 4096', number=10000000)
print(f't del: {tdel}')

tins = timeit('dst[:0] = src', 'dst = [None] * 4096; dst.pop(); src = [None]', number=100000)  # pop() to make sure no realloc in timing
print(f't ins: {tins}')

dst = [None] * 0x10000000
src = [None]
dst.pop()

t0 = time()
dst[:0] = src
t1 = time()

print(f't ins big: {t1 - t0}')

Times, average of 10 runs each:

current main:
-------------
t del: 0.2191115931245804
t ins: 0.3050231862498549
t ins big: 0.0799584686756134

fix-129069:
-----------
t del: 0.21582267370067712
t ins: 1.072859044800134
t ins big: 0.08154952526092529

pyperformance benchmark. Difference in average performance is essentially nonexistent though individual tests can vary a bit (AMD 7950x running in VirtualBox):

                           current main   fix-129069    norm diff

async_generators                    250          254      -0.0157
asyncio_tcp                         248          247       0.0040
asyncio_tcp_ssl                    1.08         1.06       0.0189
asyncio_websockets                  344          342       0.0058
chaos                              42.3         42.4      -0.0024
comprehensions                     11.1           11       0.0091
bench_mp_pool                      17.5         17.4       0.0057
bench_thread_pool                  1.16         1.09       0.0642
coroutines                         12.8         12.8       0.0000
coverage                             57         54.5       0.0459
crypto_pyaes                       51.9         51.6       0.0058
deepcopy                            182          180       0.0111
deepcopy_reduce                    1.91         1.94      -0.0155
deepcopy_memo                      20.2         21.7      -0.0691
deltablue                           2.3          2.3       0.0000
django_template                    25.2           25       0.0080
docutils                           1.61         1.58       0.0190
dulwich_log                        26.3         25.8       0.0194
fannkuch                            260          257       0.0117
float                              42.4         43.5      -0.0253
create_gc_cycles                    494          512      -0.0352
gc_traversal                       1.17         1.27      -0.0787
generators                         20.9         21.4      -0.0234
genshi_text                        16.6         16.7      -0.0060
genshi_xml                         34.3         34.9      -0.0172
go                                 80.3         80.8      -0.0062
hexiom                             4.17         4.22      -0.0118
html5lib                           33.2           34      -0.0235
json_dumps                         7.43         7.31       0.0164
json_loads                         14.2         13.9       0.0216
logging_format                     4.58         4.61      -0.0065
logging_silent                     64.8         63.7       0.0173
logging_simple                      4.2         4.22      -0.0047
mako                               7.95         7.97      -0.0025
mdp                                1.78         1.98      -0.1010
meteor_contest                     68.2         69.7      -0.0215
nbody                              82.6         78.8       0.0482
nqueens                            58.6         61.1      -0.0409
pathlib                            13.8         13.9      -0.0072
pickle                             6.12         6.13      -0.0016
pickle_dict                        15.3         15.2       0.0066
pickle_list                         2.3         2.18       0.0550
pickle_pure_python                  204          203       0.0049
pidigits                            116          117      -0.0085
pprint_safe_repr                    499          513      -0.0273
pprint_pformat                     1.02         1.05      -0.0286
pyflate                             296          304      -0.0263
python_startup                     9.29         9.29       0.0000
python_startup_no_site             7.78         7.62       0.0210
raytrace                            192          193      -0.0052
regex_compile                      74.4         74.1       0.0040
regex_dna                           111          110       0.0091
regex_effbot                       1.69          1.7      -0.0059
regex_v8                           13.8         13.7       0.0073
richards                           32.1         32.4      -0.0093
richards_super                     36.8         37.3      -0.0134
scimark_fft                         232          231       0.0043
scimark_lu                           81         80.7       0.0037
scimark_monte_carlo                45.9         47.9      -0.0418
scimark_sor                        77.9         76.3       0.0210
scimark_sparse_mat_mult            3.83         3.78       0.0132
spectral_norm                      68.5           67       0.0224
sqlglot_normalize                   197          196       0.0051
sqlglot_optimize                   36.8         36.9      -0.0027
sqlglot_parse                       887          888      -0.0011
sqlglot_transpile                   1.1          1.1       0.0000
sqlite_synth                       1.43         1.44      -0.0069
sympy_expand                        277          276       0.0036
sympy_integrate                    12.8         12.7       0.0079
sympy_sum                          88.1         85.3       0.0328
sympy_str                           164          162       0.0123
telco                              5.47         5.39       0.0148
tomli_loads                        1.37         1.37       0.0000
typing_runtime_protocols            112          112       0.0000
unpack_sequence                    28.7         28.6       0.0035
unpickle                              8         8.18      -0.0220
unpickle_list                      2.87         2.75       0.0436
unpickle_pure_python                144          140       0.0286
xml_etree_parse                      74         72.6       0.0193
xml_etree_iterparse                49.6         47.8       0.0377
xml_etree_generate                 56.2         56.9      -0.0123
xml_etree_process                  40.4         40.3       0.0025

AVERAGE                                                   -0.0001

Issue: Race in concurrent list mutation and item retrieval #129069

tom-pytel · 2025-03-29T20:37:31Z

Ping @colesbury , @Yhg1s , Windows doesn't like size_t as a parameter, will fix, but would like a thumbs up or down on this PR in general?

Include/internal/pycore_pyatomic_ft_wrappers.h

Yhg1s · 2025-04-01T08:21:45Z

Include/internal/pycore_pyatomic_ft_wrappers.h

+            void *v = _Py_atomic_load_ptr_relaxed(s);
+            _Py_atomic_store_ptr_relaxed(d, v);


I don't think we need a relaxed load here (and in memmove), at least not for the list/tuple cases (writes always happen with the lock held for lists, and tuples are supposed to be immutable after they're shared). I'm not sure if we're going to end up using this in situations where we do need relaxed loads though, but I guess we could document it appropriately and revisit if we come across a need.

I'm not sure if the stores should be relaxed, though. We use release order when inserting new items in list_ass_slice_lock_held, but I'm not sure I understand why. I can't think of reorderings that would affect loads from those items. @colesbury what do you think? Do we need a single release fence, or do we need a strict ordering among the stores?

(The function name will need changing if we don't use relaxed loads/stores, and it should probably be something like _Py_atomic_memcpy_ptr_unsafe_release to indicate it reads unsafely but release-stores.)

Some context - like I said in the header I did something similar in array module and there I had to make reads atomic in resize because there were lock-free writes. I guess that doesn't apply here, but making reads non-atomic still compiles to essentially the same thing (on Intel), which makes sense since the data still needs to be written in ptr size chunks for the atomic write. So basically nothing is really gained by making reads non-atomic (relaxed) in this case (at least on Intel).

Non-atomic read with atomic write inner loop:

.L2929: mov rdi, QWORD PTR [rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] # ./Include/internal/pycore_pyatomic_ft_wrappers.h:144: for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) { add rsi, 64 # d, # ./Include/internal/pycore_pyatomic_ft_wrappers.h:144: for (void **d = (void **)dest, **s = (void **)src, **e = d + n / sizeof(void *); d != e; d++, s++) { add rax, 64 # s, # ./Include/cpython/pyatomic_gcc.h:509: { __atomic_store_n((void **)obj, value, __ATOMIC_RELAXED); } mov QWORD PTR -64[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -56[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -56[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -48[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -48[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -40[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -40[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -32[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -32[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -24[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -24[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -16[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -16[rsi], rdi #,, MEM[(void * *)s_289] mov rdi, QWORD PTR -8[rax] # MEM[(void * *)s_289], MEM[(void * *)s_289] mov QWORD PTR -8[rsi], rdi #,, MEM[(void * *)s_289] cmp rax, QWORD PTR 16[rsp] # s, %sfp jne .L2929 #,

Waiting on RELEASE decision.

Let's continue to use relaxed atomics for these functions for now.

The release ordering for storing new items is important so that all the non-atomic initializations of the item (like reference count, ob_type, etc.) are visible before the store of the object into the list.

Moving items around within the list is less clear, but "relaxed" is the same behavior that we use for moving items in ins1() so let's stick with it for now and revisit it later.

Removed atomic load and renamed to stuff like _Py_atomic_memcpy_ptr_store_relaxed. Probably better to do only what you explicitly need. If you want a full relaxed version in future just copy and tweak.

Misc/NEWS.d/next/Core_and_Builtins/2025-03-29-20-12-28.gh-issue-129069.QwbbWV.rst

python-cla-bot · 2025-04-06T13:56:29Z

All commit authors signed the Contributor License Agreement.

Include/cpython/pyatomic.h

pythongh-129069: make list ass_slice and memory_repeat safe

0c8dcfc

bedevere-app bot added the awaiting review label Mar 29, 2025

bedevere-app bot mentioned this pull request Mar 29, 2025

Race in concurrent list mutation and item retrieval #129069

Open

blurb-it bot and others added 2 commits March 29, 2025 20:12

📜🤖 Added by blurb_it.

67c9459

include atomic wrappers in pycore_list.h

e460b5a

provisional fix to compile on windows

fcdb634

tom-pytel changed the title ~~gh-129069: make list ass_slice and memory_repeat safe~~ gh-129069: make list ass_slice and memory_repeat safe in free-threading Mar 29, 2025

Yhg1s reviewed Apr 1, 2025

View reviewed changes

tom-pytel added 3 commits April 1, 2025 06:45

requested changes so far

b1e9216

Merge branch 'main' into fix-issue-129069

944abf0

Merge branch 'main' into fix-issue-129069

3a9e249

tom-pytel added 4 commits April 8, 2025 07:49

remove atomic load and rename

0f96d58

Merge branch 'main' into fix-issue-129069

47f3d81

Merge branch 'main' into fix-issue-129069

adf2c15

Merge branch 'main' into fix-issue-129069

b569865

ZeroIntensity added the topic-free-threading label May 18, 2025

ZeroIntensity reviewed May 18, 2025

View reviewed changes

Include/cpython/pyatomic.h Outdated Show resolved Hide resolved

tom-pytel added 2 commits May 18, 2025 07:21

Merge branch 'main' into fix-issue-129069

87d48ff

requested variable name changes

26ee081

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

Uh oh!

tom-pytel commented Mar 29, 2025 •

edited

Loading

Uh oh!

tom-pytel commented Mar 29, 2025

Uh oh!

Uh oh!

Uh oh!

Yhg1s Apr 1, 2025

Uh oh!

tom-pytel Apr 1, 2025 •

edited

Loading

Uh oh!

colesbury Apr 7, 2025

Uh oh!

tom-pytel Apr 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

		void *v = _Py_atomic_load_ptr_relaxed(s);
		_Py_atomic_store_ptr_relaxed(d, v);

Uh oh!

gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

Are you sure you want to change the base?

gh-129069: make list ass_slice and memory_repeat safe in free-threading #131882

Uh oh!

Conversation

tom-pytel commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tom-pytel commented Mar 29, 2025

Uh oh!

Uh oh!

Uh oh!

Yhg1s Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

tom-pytel Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colesbury Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

tom-pytel Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

tom-pytel commented Mar 29, 2025 •

edited

Loading

tom-pytel Apr 1, 2025 •

edited

Loading

tom-pytel Apr 8, 2025 •

edited

Loading