Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add <mstd_tuple> and ARMC5 <tuple> #11265

Merged
merged 3 commits into from
Aug 28, 2019
Merged

Add <mstd_tuple> and ARMC5 <tuple> #11265

merged 3 commits into from
Aug 28, 2019

Conversation

kjbracey
Copy link
Contributor

Description

tuples will be useful for things like mbed::Event and mbed::Callback - storing parameter packs from variadic templates.

Create a C++14(ish) <tuple> for ARMC5, and a <mstd_tuple> that adds apply and make_from_tuple from C++17.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[X] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Release Notes

This is an extension to #11039, and doesn't need any further notes beyond that.

@ciarmcom ciarmcom requested review from a team August 20, 2019 15:00
@ciarmcom
Copy link
Member

@kjbracey-arm, thank you for your changes.
@ARMmbed/mbed-os-core @ARMmbed/mbed-os-maintainers please review.

Copy link
Member

@pan- pan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early review, it looks very good. I still need to review the infamous tuple_cat.

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
struct is_tuple : std::false_type { };

template <typename... T>
struct is_tuple<tuple<int, T...>> : std::true_type { };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand why we need the int or index_sequence as the first template parameter; a tuple can begin with a different template argument and even with no argument at all : tuple<>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that was intended to be tuple_base<index_sequence<I...>, T...>

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
@kjbracey
Copy link
Contributor Author

the infamous tuple_cat

It was fun, but I don't think you want to ever use that implementation. It's there for completeness, but without constexpr, and with ARMC5's rather limited optimisation past inlining, it doesn't generate nice code. Indeed that applies to some extent to the entirety of tuple.

Other compilers can do quite a good job with tuple, but even their tuple_cat is rarely going to be run-time efficient. Best only used with constant inputs for compile-time evaluation.

One remaining oddity and potentially slight inefficiency is that tuple ends up not using the same calling ABI as a normal struct, in any ARM compiler; fn(tuple<int, int>) doesn't pass the tuple in R0 and R1, but instead passes a pointer to the tuple in R0.

Seems weird, and contrary to all the ARM EABI docs I can find, which say that C/C++ structures are always passed embedded in the parameter list - only languages with non-constant size structures are supposed to pass them by-value via a pointer. (puzzled emoji here)

@pan-
Copy link
Member

pan- commented Aug 21, 2019

Code generated with a regular struct is very similar (somehow tuple code is smaller 😕 ): https://godbolt.org/z/puSLsq

Edit: similar when using a temporary: https://godbolt.org/z/Gg_GRt
Edit2: I need some rest, ldmdb is used for the struct version; I'm surprised compilers are not able to optimise that.

@kjbracey
Copy link
Contributor Author

kjbracey commented Aug 22, 2019

Code generated with a regular struct is very similar

Largely because GCC is bad at handling structs. With the normal EABI, the actual tuple or struct on the stack can in principle be eliminated altogether

ARMC5 would happily do this - it has a structure-splitting pass that can break a structure v into separate int v_a and int v_b variables, letting them be subject to full scalar-object optimisation. (As long as v is only ever processed by value, so no-one ever needs to address the structure as a whole).

So with the normal ABI, I would expect that to compile to:

 MOV r0, #42
 MOV r1, #87
 B foo

That would be impossible with the modified tuple ABI you see.

The ARM compiler added this optimisation at the same time as it added 64-bit integers. They are treated as two-word structures initially, but that loses loads of optimisations. The structure-splitting pass re-optimises them, as long as you don't take their address, and is generalised to do this for all small uniform "tuples", not just long longs.

GCC lacks this optimisation, afaict, and produces bad small struct and 64-bit int code by comparison.

Not sure how clang/ARMC6 copes.

@pan-
Copy link
Member

pan- commented Aug 22, 2019

after some tests it looks like this optimisation is present with GCC on ARM64 and x86 😞 .

Thanks for the explanation, after some tests it looks like this optimisation is present with GCC on ARM64 and x86 😞 . I've spotted more things that needs to be addressed. I will list them latter today.

Edit: There's also an overhead in defining the move constructor by hand (tried on GCC ARM 64). I'm not sure why and if it impacts ARMCC or not.

@kjbracey
Copy link
Contributor Author

Whoops - edited your comment rather than replying to it. Silly UI. That's quite confusing now :)

@kjbracey
Copy link
Contributor Author

There's also an overhead in defining the move constructor by hand (tried on GCC ARM 64). I'm not sure why and if it impacts ARMCC or not.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

But I should be able to avoid defining at at all for trivial structures, so it falls back to the built-in copy.

@pan-
Copy link
Member

pan- commented Aug 22, 2019

You sure it's not just fitting that 64-bit structure into a single 64-bit register? You'd need to compare with a two-int64_t structure/tuple.

Yes, that's what I tried: https://godbolt.org/z/koUtFJ (you can also try for yourself the impact of default move constructor). I guess that optimisation is in the backend and not present for 32 bits arm.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

Wouldn't it be possible to not declare it explicitly and use the old implicit instantiation mechanism ? Of course overloads that swallow a copy/move constructors have to be disabled for the type.

@kjbracey
Copy link
Contributor Author

kjbracey commented Aug 22, 2019

I guess that optimisation is in the backend and not present for 32 bits arm.

Well, that's horrible. I imagine the optimisation may be there, but is being inhibited by sub-optimal tracking of "address-taken" attributes. It needs to be confident that no-one really needs the address. Maybe something in the ARM32 ABI leaves "address-taken" set due to an internal "memcpy" of the structure into position for the function call.

Well, I have to define it by hand for ARMC5 - it doesn't support = default for move constructor or copy.

Wouldn't it be possible to not declare it explicitly and use the old implicit instantiation mechanism ? Of course overloads that swallow a copy/move constructors have to be disabled for the type.

If I don't declare it, then I believe you just get the copy constructor. The move constructor or assignment are never implicitly generated, right? So I think I can not declare it when I detect that I am both trivially_move_assignable|constructible and trivially_copy_assignable|constructible

@pan-
Copy link
Member

pan- commented Aug 22, 2019

Rules for implicitly declared move constructors are:

If no user-defined move constructors are provided for a class type (struct, class, 
or union), and all of the following is true:
- there are no user-declared copy constructors;
- there are no user-declared copy assignment operators;
- there are no user-declared move assignment operators;
- there are no user-declared destructors;

So yes it can be implicitly generated.

@kjbracey
Copy link
Contributor Author

Hmm, I'm violating the rule of 3/5/0, aren't I? None of the tuple machinery should need to actually declare any of those, right?

I think I've only ended up like this because the standard itself violates the rule, by declaring 4 of them - copy/move assignment/construction as = default. There's no need to = default those in real code, right?

@pan-
Copy link
Member

pan- commented Aug 22, 2019

Hmm, I'm violating the rule of 3/5/0, aren't I? None of the tuple machinery should need to actually declare any of those, right?

I suppose we can remove copy/move constructors, assignment operators and follow the rule of 0 route as ARMCC doesn't allow us to use the rule of 5.

@kjbracey
Copy link
Contributor Author

Adjusted following comments - T->Types or Ts, etc, make copy/move implicit, minor tidies.

Copy link
Member

@pan- pan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside minor import issues; it all looks good to me. Some tests that exercise the implementation would be useful.

*
* http://blogs.microsoft.co.il/sasha/2015/01/12/implementing-tuple-part-1/
*
* tuple_cat based on Peter Dimov's article "Simple C++11 metaprogramming",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting the references; it could help future maintainers. Dimov's article really is brilliant and easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't going to claim I'd devised all that myself :)

future maintainers

This is only a bridge for as long as ARMC5 limps on, so I'm not expecting a particularly extended lifetime in Mbed OS anyway.

platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
platform/cxxsupport/TOOLCHAIN_ARMC5/tuple Outdated Show resolved Hide resolved
tuples will be useful for things like `mbed::Event` and
`mbed::Callback` - storing parameter packs from variadic templates.

Create a C++14(ish) `<tuple>` for ARMC5, and a `<mstd_tuple>` that
adds `apply` and `make_from_tuple` from C++17.
Copy link
Member

@pan- pan- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 27, 2019

CI started

@mbed-ci
Copy link

mbed-ci commented Aug 27, 2019

Test run: FAILED

Summary: 1 of 4 test jobs failed
Build number : 1
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_build-IAR

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 28, 2019

IAR returned error -11 but no error msg? I restarted to confirm, please check.

@mbed-ci
Copy link

mbed-ci commented Aug 28, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 2
Build artifacts

@0xc0170 0xc0170 merged commit 4cdca93 into ARMmbed:master Aug 28, 2019
@kjbracey
Copy link
Contributor Author

For the record, @pan- and I did a bit more digging on the efficiency issues, and it's now understood.

The generic (ie Itanium) C++ ABI that ARM uses says:

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference

A type is considered non-trivial for the purposes of calls if:

  • it has a non-trivial copy constructor, move constructor, or destructor, or
  • all of its copy and move constructors are deleted.

So, if a tuple implementation has non-trivial copy or move constructors, then the compiler has to pass it by reference. That then inhibits any structure-splitting optimisation on a tuple within a function that passes it by value to another function.

The std::tuple in GCC's library has an explicitly-declared move constructor. This makes it non-trivial, so inhibiting that optimisation. My std::tuple implementation and clang's both have default copy/move constructors, so that optimisation is not inhibited.

Therefore GCC std::tuple<long,long> performs worse than struct { long; long; } when passing by value, but clang and this one do not.

However, GCC has another issue, which means that the structure splitting optimisation apparently never activates for ARM32.; itdoes for ARM64 or x86-64. This means that tuple and struct code are broadly equivalent in GCC ARM32, but only because they're both bad. This bad code generation also affects uint64_t.

Phew.

@kjbracey kjbracey deleted the tuple branch August 29, 2019 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants