Skip to content

Conversation

@zoybai
Copy link
Contributor

@zoybai zoybai commented Jun 7, 2018

Also need to fix redefined warning in osq_lock and tbb

zoybai and others added 16 commits May 22, 2018 16:28
Sync to upstream ARM-software synchronization-benchmarks latest master
TODO: There is an optimization bug for osq_lock, gcc would optimize
out lock pointer variable and cause segmentation fault. Even if we
use volatile, this identifier would be discarded after WRITE_ONCE
macro. Adding an extraneous printf("%llx", lock) could sovle this
issue, however, we chose to use more conservative method #pragma
GCC optimize ("O0") at last.
Linux kernel restricts llsc cmpxchg functions to use x16/x17
Tell the compiler to treat all general purpose registers (with the
exception of the IP registers, which are already handled by the caller
in case of a PLT) as callee-saved, which allows for efficient runtime
patching of the bl instruction in the caller with an atomic instruction
when supported by the CPU. Result and argument registers are handled
correctly, based on the function prototype.

lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o
CFLAGS_atomic_ll_sc.o   := -fcall-used-x0 -ffixed-x1 -ffixed-x2         \
                   -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6          \
                   -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9           \
                   -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12   \
                   -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15   \
                   -fcall-saved-x18 -fomit-frame-pointer

It also uses a define __LL_SC_CLOBBERS "x16", "x17", "x30" to avoid
using x16/x17 as local registers.

Therefore Linux kernel __ll_sc___cmpxchg_case_4 would not use register
x3 or x4 as automatic registers, e.g.

Dump of assembler code for function __ll_sc___cmpxchg_case_acq_4:
0xffff000008a91950 <+0>:     11 00 80 f9     prfm pstl1strm, [x0]
0xffff000008a91954 <+4>:     10 fc 5f 88     ldaxr w16, [x0]
0xffff000008a91958 <+8>:     11 02 01 4a     eor  w17, w16, w1
0xffff000008a9195c <+12>:    71 00 00 35     cbnz w17, 0xffff000008a91968 <+24>
0xffff000008a91960 <+16>:    02 7c 11 88     stxr w17, w2, [x0]
0xffff000008a91964 <+20>:    91 ff ff 35     cbnz w17, 0xffff000008a91954 <+4>
0xffff000008a91968 <+24>:    e0 03 10 aa     mov  x0, x16
0xffff000008a9196c <+28>:    c0 03 5f d6     ret
End of assembler dump.

When we first ported kernel cmpxchg.h to lockhammer lk_cmpxchg.h, we
didn't add above gcc parameters to restrict register allocation. Default
allocation scheme will use x3/x4 instead of x16/x17, which may conflict
with caller's local variable. This is all because Arm64 LSE implementation
utilizes runtime binary patching technique to substitute 3 instructions
depending on system's capability: whether or not supporting Armv8.1 LSE.

These 3 instructions may be bl/nop/nop for Armv8.0 LL/SC core or mov/cas/mov
for Armv8.1 LSE core. Not restricting register allocation and mixing hard
BL(branch and link) assemblies with other automatic registers may corrupt
caller variables or pointers.

We have modified our cmpxchg implementation to disable runtime binary patching
and use LL/SC assemblies directly without hard branch and link. This may
have better performance than kernel implementation and more representative
to userspace applications.

We also made some minor modifications to random backoff usleep.
Linux kernel restricts llsc cmpxchg functions to use x16/x17
Tell the compiler to treat all general purpose registers (with the
exception of the IP registers, which are already handled by the caller
in case of a PLT) as callee-saved, which allows for efficient runtime
patching of the bl instruction in the caller with an atomic instruction
when supported by the CPU. Result and argument registers are handled
correctly, based on the function prototype.

lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o
CFLAGS_atomic_ll_sc.o   := -fcall-used-x0 -ffixed-x1 -ffixed-x2         \
                   -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6          \
                   -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9           \
                   -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12   \
                   -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15   \
                   -fcall-saved-x18 -fomit-frame-pointer

It also uses a define __LL_SC_CLOBBERS "x16", "x17", "x30" to avoid
using x16/x17 as local registers.

Therefore Linux kernel __ll_sc___cmpxchg_case_4 would not use register
x3 or x4 as automatic registers, e.g.

Dump of assembler code for function __ll_sc___cmpxchg_case_acq_4:
0xffff000008a91950 <+0>:     11 00 80 f9     prfm pstl1strm, [x0]
0xffff000008a91954 <+4>:     10 fc 5f 88     ldaxr w16, [x0]
0xffff000008a91958 <+8>:     11 02 01 4a     eor  w17, w16, w1
0xffff000008a9195c <+12>:    71 00 00 35     cbnz w17, 0xffff000008a91968 <+24>
0xffff000008a91960 <+16>:    02 7c 11 88     stxr w17, w2, [x0]
0xffff000008a91964 <+20>:    91 ff ff 35     cbnz w17, 0xffff000008a91954 <+4>
0xffff000008a91968 <+24>:    e0 03 10 aa     mov  x0, x16
0xffff000008a9196c <+28>:    c0 03 5f d6     ret
End of assembler dump.

When we first ported kernel cmpxchg.h to lockhammer lk_cmpxchg.h, we
didn't add above gcc parameters to restrict register allocation. Default
allocation scheme will use x3/x4 instead of x16/x17, which may conflict
with caller's local variable. This is all because Arm64 LSE implementation
utilizes runtime binary patching technique to substitute 3 instructions
depending on system's capability: whether or not supporting Armv8.1 LSE.

These 3 instructions may be bl/nop/nop for Armv8.0 LL/SC core or mov/cas/mov
for Armv8.1 LSE core. Not restricting register allocation and mixing hard
BL(branch and link) assemblies with other automatic registers may corrupt
caller variables or pointers.

We have modified our cmpxchg implementation to disable runtime binary patching
and use LL/SC assemblies directly without hard branch and link. This may
have better performance than kernel implementation and more representative
to userspace applications.

We also made some minor modifications to random backoff sleep time.

There are two extended parameters which can be tuned:
-u: unqueue_retry, determines the max spin retry time before unqueue the lock
-s: max_sleep_us, determines the max sleep time in microseconds after osq node
    unqueue itself
Because backoff sleep after unqueue significantly changes osq_lock
performance, we need fine tune this parameter via '-s'. Previous
default 10us may not reflect the real performance. Therefore we
disable the sleep by default.
Merged initial lh_osq_lock, can tune unqueue_retry and backoff max_sleep time to emulate kernel mutex optimistic_spin_queue
@geoffreyblake
Copy link
Contributor

Can one of the admins verify this patch?

@geoffreyblake
Copy link
Contributor

This is fine as long as it is working fine with all the other software in the repository that got merged recently.

zoybai added 3 commits June 8, 2018 17:52
1. A detail introduction comment has been added before osq_lock.h
implementations.

2. Redefine all functions and variables to static to limit the scope.

3. Modify variable names to more reflect their use.
1. A detail introduction comment has been added before osq_lock.h
implementations.

2. Redefine all functions and variables to static to limit the scope.

3. Modify variable names to more reflect their use.
@zoybai
Copy link
Contributor Author

zoybai commented Jun 8, 2018

All comments addressed and a new introduction has been added to osq_lock.h

@zoybai zoybai merged commit ec2760c into master Jun 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants