Merge osq_lock from integration to master #29

zoybai · 2018-06-07T23:26:00Z

Also need to fix redefined warning in osq_lock and tbb

Sync to upstream ARM-software synchronization-benchmarks latest master

TODO: There is an optimization bug for osq_lock, gcc would optimize out lock pointer variable and cause segmentation fault. Even if we use volatile, this identifier would be discarded after WRITE_ONCE macro. Adding an extraneous printf("%llx", lock) could sovle this issue, however, we chose to use more conservative method #pragma GCC optimize ("O0") at last.

…on-benchmarks into lh_osq_lock

Linux kernel restricts llsc cmpxchg functions to use x16/x17 Tell the compiler to treat all general purpose registers (with the exception of the IP registers, which are already handled by the caller in case of a PLT) as callee-saved, which allows for efficient runtime patching of the bl instruction in the caller with an atomic instruction when supported by the CPU. Result and argument registers are handled correctly, based on the function prototype. lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -ffixed-x1 -ffixed-x2 \ -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6 \ -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9 \ -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12 \ -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15 \ -fcall-saved-x18 -fomit-frame-pointer It also uses a define __LL_SC_CLOBBERS "x16", "x17", "x30" to avoid using x16/x17 as local registers. Therefore Linux kernel __ll_sc___cmpxchg_case_4 would not use register x3 or x4 as automatic registers, e.g. Dump of assembler code for function __ll_sc___cmpxchg_case_acq_4: 0xffff000008a91950 <+0>: 11 00 80 f9 prfm pstl1strm, [x0] 0xffff000008a91954 <+4>: 10 fc 5f 88 ldaxr w16, [x0] 0xffff000008a91958 <+8>: 11 02 01 4a eor w17, w16, w1 0xffff000008a9195c <+12>: 71 00 00 35 cbnz w17, 0xffff000008a91968 <+24> 0xffff000008a91960 <+16>: 02 7c 11 88 stxr w17, w2, [x0] 0xffff000008a91964 <+20>: 91 ff ff 35 cbnz w17, 0xffff000008a91954 <+4> 0xffff000008a91968 <+24>: e0 03 10 aa mov x0, x16 0xffff000008a9196c <+28>: c0 03 5f d6 ret End of assembler dump. When we first ported kernel cmpxchg.h to lockhammer lk_cmpxchg.h, we didn't add above gcc parameters to restrict register allocation. Default allocation scheme will use x3/x4 instead of x16/x17, which may conflict with caller's local variable. This is all because Arm64 LSE implementation utilizes runtime binary patching technique to substitute 3 instructions depending on system's capability: whether or not supporting Armv8.1 LSE. These 3 instructions may be bl/nop/nop for Armv8.0 LL/SC core or mov/cas/mov for Armv8.1 LSE core. Not restricting register allocation and mixing hard BL(branch and link) assemblies with other automatic registers may corrupt caller variables or pointers. We have modified our cmpxchg implementation to disable runtime binary patching and use LL/SC assemblies directly without hard branch and link. This may have better performance than kernel implementation and more representative to userspace applications. We also made some minor modifications to random backoff usleep.

Linux kernel restricts llsc cmpxchg functions to use x16/x17 Tell the compiler to treat all general purpose registers (with the exception of the IP registers, which are already handled by the caller in case of a PLT) as callee-saved, which allows for efficient runtime patching of the bl instruction in the caller with an atomic instruction when supported by the CPU. Result and argument registers are handled correctly, based on the function prototype. lib-$(CONFIG_ARM64_LSE_ATOMICS) += atomic_ll_sc.o CFLAGS_atomic_ll_sc.o := -fcall-used-x0 -ffixed-x1 -ffixed-x2 \ -ffixed-x3 -ffixed-x4 -ffixed-x5 -ffixed-x6 \ -ffixed-x7 -fcall-saved-x8 -fcall-saved-x9 \ -fcall-saved-x10 -fcall-saved-x11 -fcall-saved-x12 \ -fcall-saved-x13 -fcall-saved-x14 -fcall-saved-x15 \ -fcall-saved-x18 -fomit-frame-pointer It also uses a define __LL_SC_CLOBBERS "x16", "x17", "x30" to avoid using x16/x17 as local registers. Therefore Linux kernel __ll_sc___cmpxchg_case_4 would not use register x3 or x4 as automatic registers, e.g. Dump of assembler code for function __ll_sc___cmpxchg_case_acq_4: 0xffff000008a91950 <+0>: 11 00 80 f9 prfm pstl1strm, [x0] 0xffff000008a91954 <+4>: 10 fc 5f 88 ldaxr w16, [x0] 0xffff000008a91958 <+8>: 11 02 01 4a eor w17, w16, w1 0xffff000008a9195c <+12>: 71 00 00 35 cbnz w17, 0xffff000008a91968 <+24> 0xffff000008a91960 <+16>: 02 7c 11 88 stxr w17, w2, [x0] 0xffff000008a91964 <+20>: 91 ff ff 35 cbnz w17, 0xffff000008a91954 <+4> 0xffff000008a91968 <+24>: e0 03 10 aa mov x0, x16 0xffff000008a9196c <+28>: c0 03 5f d6 ret End of assembler dump. When we first ported kernel cmpxchg.h to lockhammer lk_cmpxchg.h, we didn't add above gcc parameters to restrict register allocation. Default allocation scheme will use x3/x4 instead of x16/x17, which may conflict with caller's local variable. This is all because Arm64 LSE implementation utilizes runtime binary patching technique to substitute 3 instructions depending on system's capability: whether or not supporting Armv8.1 LSE. These 3 instructions may be bl/nop/nop for Armv8.0 LL/SC core or mov/cas/mov for Armv8.1 LSE core. Not restricting register allocation and mixing hard BL(branch and link) assemblies with other automatic registers may corrupt caller variables or pointers. We have modified our cmpxchg implementation to disable runtime binary patching and use LL/SC assemblies directly without hard branch and link. This may have better performance than kernel implementation and more representative to userspace applications. We also made some minor modifications to random backoff sleep time. There are two extended parameters which can be tuned: -u: unqueue_retry, determines the max spin retry time before unqueue the lock -s: max_sleep_us, determines the max sleep time in microseconds after osq node unqueue itself

…on-benchmarks into lh_osq_lock

Because backoff sleep after unqueue significantly changes osq_lock performance, we need fine tune this parameter via '-s'. Previous default 10us may not reflect the real performance. Therefore we disable the sleep by default.

Merged initial lh_osq_lock, can tune unqueue_retry and backoff max_sleep time to emulate kernel mutex optimistic_spin_queue

geoffreyblake · 2018-06-08T06:50:01Z

Can one of the admins verify this patch?

geoffreyblake · 2018-06-08T16:36:12Z

This is fine as long as it is working fine with all the other software in the repository that got merged recently.

1. A detail introduction comment has been added before osq_lock.h implementations. 2. Redefine all functions and variables to static to limit the scope. 3. Modify variable names to more reflect their use.

…nization-benchmarks into integration

zoybai · 2018-06-08T23:04:56Z

All comments addressed and a new introduction has been added to osq_lock.h

zoybai and others added 16 commits May 22, 2018 16:28

Merge pull request #1 from ARM-software/master

8c5d672

Sync to upstream ARM-software synchronization-benchmarks latest master

Add linux kernel osq_lock to lockhammer test

fb20b14

Merge branch 'integration' into lh_osq_lock

e1ffe17

Update Makefile

a6f4748

Merge branch 'lh_osq_lock' of https://github.com/zoybai/synchronizati…

dc5b721

…on-benchmarks into lh_osq_lock

Merge branch 'lh_osq_lock' of https://github.com/zoybai/synchronizati…

9ccabcc

…on-benchmarks into lh_osq_lock

Merge branch 'lh_osq_lock' of https://github.com/zoybai/synchronizati…

aeae748

…on-benchmarks into lh_osq_lock

Merge branch 'lh_osq_lock' of https://github.com/zoybai/synchronizati…

7f53bcc

…on-benchmarks into lh_osq_lock

Merge branch 'lh_osq_lock' of https://github.com/zoybai/synchronizati…

6726408

…on-benchmarks into lh_osq_lock

Change default sleep to 0us

56fe3b0

Because backoff sleep after unqueue significantly changes osq_lock performance, we need fine tune this parameter via '-s'. Previous default 10us may not reflect the real performance. Therefore we disable the sleep by default.

Add more comments for changes in lk_cmpxchg.h and osq_lock.h

632fd1e

Merge pull request #24 from zoybai/lh_osq_lock

8c33399

Merged initial lh_osq_lock, can tune unqueue_retry and backoff max_sleep time to emulate kernel mutex optimistic_spin_queue

Fixed osq_lock and tbb_spin_rw_mutex redefine warnings

4f32ba4

zoybai added 3 commits June 8, 2018 17:52

Add detail introduction for osq_lock workload

d2b704b

1. A detail introduction comment has been added before osq_lock.h implementations. 2. Redefine all functions and variables to static to limit the scope. 3. Modify variable names to more reflect their use.

Add detail introduction for osq_lock workload

ae08976

1. A detail introduction comment has been added before osq_lock.h implementations. 2. Redefine all functions and variables to static to limit the scope. 3. Modify variable names to more reflect their use.

Merge branch 'integration' of https://github.com/ARM-software/synchro…

478677d

…nization-benchmarks into integration

zoybai merged commit ec2760c into master Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge osq_lock from integration to master #29

Merge osq_lock from integration to master #29

Uh oh!

zoybai commented Jun 7, 2018

Uh oh!

geoffreyblake commented Jun 8, 2018

Uh oh!

geoffreyblake commented Jun 8, 2018

Uh oh!

zoybai commented Jun 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Merge osq_lock from integration to master #29

Merge osq_lock from integration to master #29

Uh oh!

Conversation

zoybai commented Jun 7, 2018

Uh oh!

geoffreyblake commented Jun 8, 2018

Uh oh!

geoffreyblake commented Jun 8, 2018

Uh oh!

zoybai commented Jun 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants