Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add <mstd_tuple> and ARMC5 <tuple> #11265

Merged
merged 3 commits into from Aug 28, 2019

Conversation

@kjbracey-arm
Copy link
Contributor

commented Aug 20, 2019

Description

tuples will be useful for things like mbed::Event and mbed::Callback - storing parameter packs from variadic templates.

Create a C++14(ish) <tuple> for ARMC5, and a <mstd_tuple> that adds apply and make_from_tuple from C++17.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[X] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Release Notes

This is an extension to #11039, and doesn't need any further notes beyond that.

@ciarmcom ciarmcom requested review from ARMmbed/mbed-os-maintainers Aug 20, 2019
@ciarmcom

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

@kjbracey-arm, thank you for your changes.
@ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

Copy link
Member

left a comment

Early review, it looks very good. I still need to review the infamous tuple_cat.

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
struct is_tuple : std::false_type { };

template <typename... T>
struct is_tuple<tuple<int, T...>> : std::true_type { };

This comment has been minimized.

Copy link
@pan-

pan- Aug 21, 2019

Member

I'm not sure to understand why we need the int or index_sequence as the first template parameter; a tuple can begin with a different template argument and even with no argument at all : tuple<>

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm Aug 21, 2019

Author Contributor

I think that was intended to be tuple_base<index_sequence<I...>, T...>

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 21, 2019

the infamous tuple_cat

It was fun, but I don't think you want to ever use that implementation. It's there for completeness, but without constexpr, and with ARMC5's rather limited optimisation past inlining, it doesn't generate nice code. Indeed that applies to some extent to the entirety of tuple.

Other compilers can do quite a good job with tuple, but even their tuple_cat is rarely going to be run-time efficient. Best only used with constant inputs for compile-time evaluation.

One remaining oddity and potentially slight inefficiency is that tuple ends up not using the same calling ABI as a normal struct, in any ARM compiler; fn(tuple<int, int>) doesn't pass the tuple in R0 and R1, but instead passes a pointer to the tuple in R0.

Seems weird, and contrary to all the ARM EABI docs I can find, which say that C/C++ structures are always passed embedded in the parameter list - only languages with non-constant size structures are supposed to pass them by-value via a pointer. (puzzled emoji here)

@pan-

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

Code generated with a regular struct is very similar (somehow tuple code is smaller 😕 ): https://godbolt.org/z/puSLsq

Edit: similar when using a temporary: https://godbolt.org/z/Gg_GRt
Edit2: I need some rest, ldmdb is used for the struct version; I'm surprised compilers are not able to optimise that.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Code generated with a regular struct is very similar

Largely because GCC is bad at handling structs. With the normal EABI, the actual tuple or struct on the stack can in principle be eliminated altogether

ARMC5 would happily do this - it has a structure-splitting pass that can break a structure v into separate int v_a and int v_b variables, letting them be subject to full scalar-object optimisation. (As long as v is only ever processed by value, so no-one ever needs to address the structure as a whole).

So with the normal ABI, I would expect that to compile to:

 MOV r0, #42
 MOV r1, #87
 B foo

That would be impossible with the modified tuple ABI you see.

The ARM compiler added this optimisation at the same time as it added 64-bit integers. They are treated as two-word structures initially, but that loses loads of optimisations. The structure-splitting pass re-optimises them, as long as you don't take their address, and is generalised to do this for all small uniform "tuples", not just long longs.

GCC lacks this optimisation, afaict, and produces bad small struct and 64-bit int code by comparison.

Not sure how clang/ARMC6 copes.

@pan-

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

after some tests it looks like this optimisation is present with GCC on ARM64 and x86 😞 .

Thanks for the explanation, after some tests it looks like this optimisation is present with GCC on ARM64 and x86 😞 . I've spotted more things that needs to be addressed. I will list them latter today.

Edit: There's also an overhead in defining the move constructor by hand (tried on GCC ARM 64). I'm not sure why and if it impacts ARMCC or not.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Whoops - edited your comment rather than replying to it. Silly UI. That's quite confusing now :)

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

There's also an overhead in defining the move constructor by hand (tried on GCC ARM 64). I'm not sure why and if it impacts ARMCC or not.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

But I should be able to avoid defining at at all for trivial structures, so it falls back to the built-in copy.

@pan-

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

You sure it's not just fitting that 64-bit structure into a single 64-bit register? You'd need to compare with a two-int64_t structure/tuple.

Yes, that's what I tried: https://godbolt.org/z/koUtFJ (you can also try for yourself the impact of default move constructor). I guess that optimisation is in the backend and not present for 32 bits arm.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

Wouldn't it be possible to not declare it explicitly and use the old implicit instantiation mechanism ? Of course overloads that swallow a copy/move constructors have to be disabled for the type.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

I guess that optimisation is in the backend and not present for 32 bits arm.

Well, that's horrible. I imagine the optimisation may be there, but is being inhibited by sub-optimal tracking of "address-taken" attributes. It needs to be confident that no-one really needs the address. Maybe something in the ARM32 ABI leaves "address-taken" set due to an internal "memcpy" of the structure into position for the function call.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

Wouldn't it be possible to not declare it explicitly and use the old implicit instantiation mechanism ? Of course overloads that swallow a copy/move constructors have to be disabled for the type.

If I don't declare it, then I believe you just get the copy constructor. The move constructor or assignment are never implicitly generated, right? So I think I can not declare it when I detect that I am both trivially_move_assignable|constructible and trivially_copy_assignable|constructible

@pan-

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Rules for implicitly declared move constructors are:

If no user-defined move constructors are provided for a class type (struct, class, 
or union), and all of the following is true:
- there are no user-declared copy constructors;
- there are no user-declared copy assignment operators;
- there are no user-declared move assignment operators;
- there are no user-declared destructors;

So yes it can be implicitly generated.

@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Hmm, I'm violating the rule of 3/5/0, aren't I? None of the tuple machinery should need to actually declare any of those, right?

I think I've only ended up like this because the standard itself violates the rule, by declaring 4 of them - copy/move assignment/construction as = default. There's no need to = default those in real code, right?

@pan-

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

Hmm, I'm violating the rule of 3/5/0, aren't I? None of the tuple machinery should need to actually declare any of those, right?

I suppose we can remove copy/move constructors, assignment operators and follow the rule of 0 route as ARMCC doesn't allow us to use the rule of 5.

@kjbracey-arm kjbracey-arm force-pushed the kjbracey-arm:tuple branch from d4c9c30 to 849a3d0 Aug 22, 2019
@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Adjusted following comments - T->Types or Ts, etc, make copy/move implicit, minor tidies.

@pan-
pan- approved these changes Aug 27, 2019
Copy link
Member

left a comment

Beside minor import issues; it all looks good to me. Some tests that exercise the implementation would be useful.

*
* http://blogs.microsoft.co.il/sasha/2015/01/12/implementing-tuple-part-1/
*
* tuple_cat based on Peter Dimov's article "Simple C++11 metaprogramming",

This comment has been minimized.

Copy link
@pan-

pan- Aug 27, 2019

Member

Thanks for putting the references; it could help future maintainers. Dimov's article really is brilliant and easy to understand.

This comment has been minimized.

Copy link
@kjbracey-arm

kjbracey-arm Aug 27, 2019

Author Contributor

Wasn't going to claim I'd devised all that myself :)

future maintainers

This is only a bridge for as long as ARMC5 limps on, so I'm not expecting a particularly extended lifetime in Mbed OS anyway.

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
@0xc0170 0xc0170 added needs: CI and removed needs: review labels Aug 27, 2019
tuples will be useful for things like `mbed::Event` and
`mbed::Callback` - storing parameter packs from variadic templates.

Create a C++14(ish) `<tuple>` for ARMC5, and a `<mstd_tuple>` that
adds `apply` and `make_from_tuple` from C++17.
@kjbracey-arm kjbracey-arm force-pushed the kjbracey-arm:tuple branch from 849a3d0 to 4b4859c Aug 27, 2019
@pan-
pan- approved these changes Aug 27, 2019
Copy link
Member

left a comment

LGTM

@0xc0170

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

CI started

@mbed-ci

This comment has been minimized.

Copy link

commented Aug 27, 2019

Test run: FAILED

Summary: 1 of 4 test jobs failed
Build number : 1
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_build-IAR
@0xc0170

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

IAR returned error -11 but no error msg? I restarted to confirm, please check.

@mbed-ci

This comment has been minimized.

Copy link

commented Aug 28, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 2
Build artifacts

@0xc0170 0xc0170 added ready for merge and removed needs: CI labels Aug 28, 2019
@0xc0170 0xc0170 merged commit 4cdca93 into ARMmbed:master Aug 28, 2019
25 checks passed
25 checks passed
continuous-integration/jenkins/pr-head This commit looks good
Details
jenkins-ci/build-ARM Success
Details
jenkins-ci/build-GCC_ARM Success
Details
jenkins-ci/build-IAR Success
Details
jenkins-ci/cloud-client-test Success
Details
jenkins-ci/dynamic-memory-usage RTOS ROM(+0 bytes) RAM(+0 bytes)
Details
jenkins-ci/exporter Success
Details
jenkins-ci/greentea-test Success
Details
jenkins-ci/mbed2-build-ARM Success
Details
jenkins-ci/mbed2-build-GCC_ARM Success
Details
jenkins-ci/mbed2-build-IAR Success
Details
jenkins-ci/unittests Success
Details
travis-ci/astyle Success!
Details
travis-ci/docs Success!
Details
travis-ci/doxy-spellcheck Success!
Details
travis-ci/events Success! Runtime is 8625 cycles.
Details
travis-ci/gitattributestest Success!
Details
travis-ci/include_check Success!
Details
travis-ci/licence_check Success!
Details
travis-ci/littlefs Success! Code size is 8464B.
Details
travis-ci/psa-autogen Success!
Details
travis-ci/tools-py2.7 Success!
Details
travis-ci/tools-py3.5 Success!
Details
travis-ci/tools-py3.6 Success!
Details
travis-ci/tools-py3.7 Success!
Details
@kjbracey-arm

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

For the record, @pan- and I did a bit more digging on the efficiency issues, and it's now understood.

The generic (ie Itanium) C++ ABI that ARM uses says:

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference

A type is considered non-trivial for the purposes of calls if:

  • it has a non-trivial copy constructor, move constructor, or destructor, or
  • all of its copy and move constructors are deleted.

So, if a tuple implementation has non-trivial copy or move constructors, then the compiler has to pass it by reference. That then inhibits any structure-splitting optimisation on a tuple within a function that passes it by value to another function.

The std::tuple in GCC's library has an explicitly-declared move constructor. This makes it non-trivial, so inhibiting that optimisation. My std::tuple implementation and clang's both have default copy/move constructors, so that optimisation is not inhibited.

Therefore GCC std::tuple<long,long> performs worse than struct { long; long; } when passing by value, but clang and this one do not.

However, GCC has another issue, which means that the structure splitting optimisation apparently never activates for ARM32.; itdoes for ARM64 or x86-64. This means that tuple and struct code are broadly equivalent in GCC ARM32, but only because they're both bad. This bad code generation also affects uint64_t.

Phew.

@kjbracey-arm kjbracey-arm deleted the kjbracey-arm:tuple branch Aug 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.