Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder SpatialVelocity::Shift to avoid issues with ARM #21436

Conversation

cbarcelo-bdai
Copy link
Contributor

@cbarcelo-bdai cbarcelo-bdai commented May 15, 2024

The method Shift from SpatialVelocity seems to have unexpected behavior in some specific situations when running under arm64. This seemed to be the root cause of several tests failing, even tho the actual tests for math/spatial_velocity wouldn't fail.

The way this has been found was through some failing test cases in //multibody/plant:energy_and_power_test (for example this one). The failure would come from the correct results coming out from the simulation, but the expected value wrongly calculated. By debugging and comparing results between the x86 and arm64 runs, the value of the variable v_QW in this line was found different between architectures.
I couldn't figure out if there is a compiler optimization/reorder of operations going on (although this would still happen when building in debug) but something interesting to note is that calculating the same value once again (meaning, duplicating the line) would cause different results. The variable that then is used along the tests would hold a wrong value, whereas the duplicated one would hold the expected value and no other call to the same operation over the same SpatialVelocity would cause any other unexpected behavior. Another option would be an issue with lifetime, since we're mixing a temporal, with a change to a reference and then returning by value, but that wouldn't explain why it works for x86 and even more, why it works as expected when called repeatedly.

The proposed solution solves several tests in the codebase (again, for arm64) but there's no really foundation on why the previous approach wouldn't work, so any feedback is welcomed.

The lists of tests that, at least locally, get solved by this change are:

  • //multibody/plant:box_test
  • //multibody/plant:compliant_contact_manager_test
  • //multibody/plant:energy_and_power_test
  • //multibody/plant:inclined_plane_test
  • //multibody/plant:multibody_plant_com_test
  • //multibody/plant:multibody_plant_jacobians_test
  • //multibody/plant:multibody_plant_reaction_forces_test

This change is Reviewable

@jwnimmer-tri jwnimmer-tri self-assigned this May 15, 2024
@sherm1 sherm1 self-assigned this May 15, 2024
Copy link
Member

@sherm1 sherm1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@sherm1 for feature review
cc @jwnimmer-tri for further comment

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):

    auto ret = SpatialVelocity<T>(*this);
    ret.ShiftInPlace(offset);
    return ret;

Thanks for figuring out a workaround -- I'm sure it wasn't easy!
If this spelling is required to keep some version of a compiler from generating bad ARM code, we can include it but it's very important to explain why it is here in detail (in comments next to the code). Otherwise the next programmer to come along might put it back the way it was to tidy up the code! Also, compiler bugs are usually specific to particular compilers and particular versions of those compilers. We need to know what was broken when so we can safely remove the hack later.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

Thanks for figuring out a workaround -- I'm sure it wasn't easy!
If this spelling is required to keep some version of a compiler from generating bad ARM code, we can include it but it's very important to explain why it is here in detail (in comments next to the code). Otherwise the next programmer to come along might put it back the way it was to tidy up the code! Also, compiler bugs are usually specific to particular compilers and particular versions of those compilers. We need to know what was broken when so we can safely remove the hack later.

My current intuition is that the old code here was fine and it's just the tests that are broken (by requiring a too-tight tolerance). Next steps here for me are to collect the specific evidence of failures (a more specific inventory of exactly failures and their out-of-tolerance violations) for more careful study.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

My current intuition is that the old code here was fine and it's just the tests that are broken (by requiring a too-tight tolerance). Next steps here for me are to collect the specific evidence of failures (a more specific inventory of exactly failures and their out-of-tolerance violations) for more careful study.

Well okay, looking more closely at the test failures this fixes, now it does smell more like compiler bug than a tolerance thing. I'll continue exploring.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Well okay, looking more closely at the test failures this fixes, now it does smell more like compiler bug than a tolerance thing. I'll continue exploring.

Oh, hmm. Possibly it's because of the reinterpret_casts in SpatialVector's translational() and rotational() accessors? Those smell like UB.

They date back from 2017, but maybe are finally biting us now:
https://reviewable.io/reviews/robotlocomotion/drake/5143#-Kcskz2q3qjn-hZpvEOL:-Kcskz2rHvw3J2D5EG58:b9r61t6

I'll see if I can figure out any way to check if those are the problem or not.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):
@sherm1

... tidy up the code ...

The current version on master is actually the untidy one. It's calling the copy constructor twice for no apparent reason. One copy happens from SpatialVelocity<T>(*this) as directly apparent in the code, but a second copy happens as the SpatialVelocity<T>& temporary is copied into the return value.

Even without the seeming miscompilation on arm64, this new spelling is shorter by 2 mov instructions (on both x86_64 and arm64). With fat objects like autodiff the improvement is probably even more apparent.

The pattern return foo.BooInPlace() is absurd in every circumstance, so we should just nuke it across the entire source tree. I'll open a separate PR myself with that cleanup. There's only about 15 instances, all in this same general area.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @cbarcelo-bdai)


multibody/math/spatial_velocity.h line 114 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

@sherm1

... tidy up the code ...

The current version on master is actually the untidy one. It's calling the copy constructor twice for no apparent reason. One copy happens from SpatialVelocity<T>(*this) as directly apparent in the code, but a second copy happens as the SpatialVelocity<T>& temporary is copied into the return value.

Even without the seeming miscompilation on arm64, this new spelling is shorter by 2 mov instructions (on both x86_64 and arm64). With fat objects like autodiff the improvement is probably even more apparent.

The pattern return foo.BooInPlace() is absurd in every circumstance, so we should just nuke it across the entire source tree. I'll open a separate PR myself with that cleanup. There's only about 15 instances, all in this same general area.

=> #21438

@cbarcelo-bdai
Copy link
Contributor Author

Closing in favor of #21438

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants