-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder SpatialVelocity::Shift to avoid issues with ARM #21436
Reorder SpatialVelocity::Shift to avoid issues with ARM #21436
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@sherm1 for feature review
cc @jwnimmer-tri for further comment
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
auto ret = SpatialVelocity<T>(*this); ret.ShiftInPlace(offset); return ret;
Thanks for figuring out a workaround -- I'm sure it wasn't easy!
If this spelling is required to keep some version of a compiler from generating bad ARM code, we can include it but it's very important to explain why it is here in detail (in comments next to the code). Otherwise the next programmer to come along might put it back the way it was to tidy up the code! Also, compiler bugs are usually specific to particular compilers and particular versions of those compilers. We need to know what was broken when so we can safely remove the hack later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
Previously, sherm1 (Michael Sherman) wrote…
Thanks for figuring out a workaround -- I'm sure it wasn't easy!
If this spelling is required to keep some version of a compiler from generating bad ARM code, we can include it but it's very important to explain why it is here in detail (in comments next to the code). Otherwise the next programmer to come along might put it back the way it was to tidy up the code! Also, compiler bugs are usually specific to particular compilers and particular versions of those compilers. We need to know what was broken when so we can safely remove the hack later.
My current intuition is that the old code here was fine and it's just the tests that are broken (by requiring a too-tight tolerance). Next steps here for me are to collect the specific evidence of failures (a more specific inventory of exactly failures and their out-of-tolerance violations) for more careful study.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
My current intuition is that the old code here was fine and it's just the tests that are broken (by requiring a too-tight tolerance). Next steps here for me are to collect the specific evidence of failures (a more specific inventory of exactly failures and their out-of-tolerance violations) for more careful study.
Well okay, looking more closely at the test failures this fixes, now it does smell more like compiler bug than a tolerance thing. I'll continue exploring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Well okay, looking more closely at the test failures this fixes, now it does smell more like compiler bug than a tolerance thing. I'll continue exploring.
Oh, hmm. Possibly it's because of the reinterpret_casts in SpatialVector's translational() and rotational() accessors? Those smell like UB.
They date back from 2017, but maybe are finally biting us now:
https://reviewable.io/reviews/robotlocomotion/drake/5143#-Kcskz2q3qjn-hZpvEOL:-Kcskz2rHvw3J2D5EG58:b9r61t6
I'll see if I can figure out any way to check if those are the problem or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
@sherm1
... tidy up the code ...
The current version on master is actually the untidy one. It's calling the copy constructor twice for no apparent reason. One copy happens from SpatialVelocity<T>(*this)
as directly apparent in the code, but a second copy happens as the SpatialVelocity<T>&
temporary is copied into the return value.
Even without the seeming miscompilation on arm64, this new spelling is shorter by 2 mov instructions (on both x86_64 and arm64). With fat objects like autodiff the improvement is probably even more apparent.
The pattern return foo.BooInPlace()
is absurd in every circumstance, so we should just nuke it across the entire source tree. I'll open a separate PR myself with that cleanup. There's only about 15 instances, all in this same general area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)
multibody/math/spatial_velocity.h
line 114 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
... tidy up the code ...
The current version on master is actually the untidy one. It's calling the copy constructor twice for no apparent reason. One copy happens from
SpatialVelocity<T>(*this)
as directly apparent in the code, but a second copy happens as theSpatialVelocity<T>&
temporary is copied into the return value.Even without the seeming miscompilation on arm64, this new spelling is shorter by 2 mov instructions (on both x86_64 and arm64). With fat objects like autodiff the improvement is probably even more apparent.
The pattern
return foo.BooInPlace()
is absurd in every circumstance, so we should just nuke it across the entire source tree. I'll open a separate PR myself with that cleanup. There's only about 15 instances, all in this same general area.
=> #21438
Closing in favor of #21438 |
The method
Shift
fromSpatialVelocity
seems to have unexpected behavior in some specific situations when running underarm64
. This seemed to be the root cause of several tests failing, even tho the actual tests formath/spatial_velocity
wouldn't fail.The way this has been found was through some failing test cases in
//multibody/plant:energy_and_power_test
(for example this one). The failure would come from the correct results coming out from the simulation, but the expected value wrongly calculated. By debugging and comparing results between thex86
andarm64
runs, the value of the variablev_QW
in this line was found different between architectures.I couldn't figure out if there is a compiler optimization/reorder of operations going on (although this would still happen when building in debug) but something interesting to note is that calculating the same value once again (meaning, duplicating the line) would cause different results. The variable that then is used along the tests would hold a wrong value, whereas the duplicated one would hold the expected value and no other call to the same operation over the same
SpatialVelocity
would cause any other unexpected behavior. Another option would be an issue with lifetime, since we're mixing a temporal, with a change to a reference and then returning by value, but that wouldn't explain why it works forx86
and even more, why it works as expected when called repeatedly.The proposed solution solves several tests in the codebase (again, for arm64) but there's no really foundation on why the previous approach wouldn't work, so any feedback is welcomed.
The lists of tests that, at least locally, get solved by this change are:
//multibody/plant:box_test
//multibody/plant:compliant_contact_manager_test
//multibody/plant:energy_and_power_test
//multibody/plant:inclined_plane_test
//multibody/plant:multibody_plant_com_test
//multibody/plant:multibody_plant_jacobians_test
//multibody/plant:multibody_plant_reaction_forces_test
This change is