Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking measurement data from //examples/multibody/cassie_benchmark #13902

Open
rpoyner-tri opened this issue Aug 18, 2020 · 51 comments
Open

tracking measurement data from //examples/multibody/cassie_benchmark #13902

rpoyner-tri opened this issue Aug 18, 2020 · 51 comments
Assignees
Labels
component: multibody plant MultibodyPlant and supporting code status: tracker Perpetually open

Comments

@rpoyner-tri
Copy link
Contributor

rpoyner-tri commented Aug 18, 2020

/cc @sherm1
We would like to capture some sort of baseline data from the cassie benchmark program to help guide development of performance optimizations. I have argued it should be stored out-of-tree because

  • it is probably only useful for local, same-machine comparisons
  • could be discovered to be invalid after the fact
  • will become irrelevant as machine targets evolve

Even with those challenges, I think it it is useful to have a place to store and perhaps discuss baseline data. I propose to use this issue for that purpose until we arrive at better solutions.

cc'ing myself @amcastro-tri so that I can follow this.

@sherm1
Copy link
Member

sherm1 commented Aug 18, 2020

Sounds good! What info should we record here? For my own purposes (making changes to MBP and verifying that I'm moving in the right direction) I would like to note the machine & compiler info, plus the best time over a series of runs for each of the 6 tests. I'd be happy to record more than that if you think it's useful, but that's all I need. Can we standardize on the data to be posted here, perhaps by running the benchmark in a prescribed manner and collecting the output?

@jwnimmer-tri jwnimmer-tri added component: multibody plant MultibodyPlant and supporting code status: tracker Perpetually open unused team: dynamics labels Aug 18, 2020
@rpoyner-tri
Copy link
Contributor Author

My first proposal is to have a handwritten summary of the data set, including just the single repetition output of cassie_bench, and also attach the zip file from :record_results. This probably still has some holes, but goes in the right direction. I'll post an example and we can discuss that, maybe.

@rpoyner-tri
Copy link
Contributor Author

Experiment: cassie_bench on rico's Puget, early baseline.
Software: master at hash bc8daa4
Environment: Ubuntu 18.04, gcc 7.5
Hardware: rico's Puget workstation

Quick results:

rico@Puget-161804-10:~/checkout/drake$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
2020-08-18 17:22:18
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (48 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 256 KiB (x24)
  L3 Unified 30720 KiB (x2)
Load Average: 0.13, 0.18, 0.12
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix               26141 ns        26141 ns        25051
CassieDoubleFixture/DoubleInverseDynamics          19460 ns        19461 ns        36137
CassieDoubleFixture/DoubleForwardDynamics          39425 ns        39425 ns        17750
CassieAutodiffFixture/AutodiffMassMatrix         4213121 ns      4213172 ns          166
CassieAutodiffFixture/AutodiffInverseDynamics    5118358 ns      5118326 ns          138
CassieAutodiffFixture/AutodiffForwardDynamics    7525557 ns      7525703 ns           93

Zip file from record_results:
outputs.zip

@rpoyner-tri
Copy link
Contributor Author

Experiment: cassie_bench on rico's T460p laptop, early baseline
Software: master at hash bc8daa4
Environment: Ubuntu 18.04, gcc 7.5
Hardware: rico's Lenovo T460p laptop

Quick results:

2020-08-18 17:29:55
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (8 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.58, 0.85, 1.37
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix               27105 ns        27038 ns        26175
CassieDoubleFixture/DoubleInverseDynamics          21544 ns        21492 ns        33749
CassieDoubleFixture/DoubleForwardDynamics          39815 ns        39729 ns        15053
CassieAutodiffFixture/AutodiffMassMatrix         4325573 ns      4306426 ns          161
CassieAutodiffFixture/AutodiffInverseDynamics    5176258 ns      5159467 ns          135
CassieAutodiffFixture/AutodiffForwardDynamics    7662577 ns      7635458 ns           85

Zip file from record_results:
outputs.zip

@rpoyner-tri
Copy link
Contributor Author

rpoyner-tri commented Aug 18, 2020

So, the above reports used a built tree at the listed hash. Quick results were just the output of:

$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench

The outputs.zip is produced by running:

$ bazel run //examples/multibody/cassie_benchmark:record_results
$ examples/multibody/cassie_benchmark/copy_results_to [SOME_NEW_DIRECTORY]

and then using drag-and-drop or paste to attach [SOME_NEW_DIRECTORY]/outputs.zip to the issue comment.

@rpoyner-tri rpoyner-tri self-assigned this Aug 18, 2020
@rpoyner-tri rpoyner-tri added this to the autodiff speedup milestone Aug 18, 2020
@rpoyner-tri
Copy link
Contributor Author

Fun fact, DWARF debug tables have compiler and command line switches in some cases, at least for gcc. Checking whether clang has a similar feature. If so, it might be a way to automate the compiler info currently capture by hand above.

@sherm1
Copy link
Member

sherm1 commented Aug 18, 2020

I like the idea of capturing the default output of cassie_bench here in issue comments. But for me the one-off output captured above is too variable to be representative (even on my own machine). I'm anticipating that many of the changes I'll be making will be modest improvements, say 5-10% ish. Even with CPU scaling disabled on my Puget I see 6% variation from run to run in the cassie_bench output. For the at-a-glance summary here in the issue, what would you think about reporting instead the min times from the multiple-run output, ideally with CPU scaling disabled? I think that is the most stable report we have, but at the moment it is not easy to extract in a quotable format.

@rpoyner-tri
Copy link
Contributor Author

That's certainly a thing we could do, and probably get :record_results to craft it for us eventually. Automating the disablement of CPU scaling is both platform specific (macos???) and remarkably resistant to automation (sudo, no state save/restore commands).
So if we want to care about wall-time numbers, we probably have to add some additional semi-automation scripts.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Aug 18, 2020

I find 6% variance astonishing. I would suggest trying to debug why that is happening first, before just blindly taking the min. Anything that severe could end up confounding the results anyway beyond what the min can correct for. Perhaps not all scaling was disabled, or the cpu affinity was not pinned, or chrome ate your gpu and locked the bus, or ...

@sherm1
Copy link
Member

sherm1 commented Aug 18, 2020

Really? I'm disappointed, but not surprised by that much variance on recent machines -- would be great to see repeatable timings but it's been a long time since I've seen any.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Aug 19, 2020

Really?

Yes, really.

First, I did:

$ sudo cpupower frequency-set --governor performance
$ sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

With that, I am seeing around 0.5% typical run-to-run inconsistency, topping out at 1.5% rarely.

Even a 3% improvement in throughput from a code change under consideration would easily stand out in that background. And you're no longer at the mercy of your cpu temperature while compiling and then executing right after.

@sherm1
Copy link
Member

sherm1 commented Aug 19, 2020

Thanks -- "no_turbo" helps a lot! Runs a lot slower but with +/- 1% ish variation on my Puget. Looking at the report_results summary (9 runs with stats) I'm still seeing >2% difference between the reported min and max run times. The min times are more repeatable across multiple 9x runs, I haven't seen more than 0.5%.

(BTW timings on my Ubuntu VM were hopelessly variable.)

@jwnimmer-tri
Copy link
Collaborator

Great! So for me, that crosses off the "something else is going wrong" risk, and now you have at least moderately robust numbers. I don't have any strong opinions on how you take it from here in terms of taking the min, capturing data, etc.

@sherm1
Copy link
Member

sherm1 commented Aug 19, 2020

What is the correct way to capture the compiler in use? With Bazel I'm not sure even whether I'm using gcc or clang.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Aug 19, 2020

Call drake/tools/workspace/cc/identify_compiler and save its output. (You'll need to add a BUILD rule for it, first.)

@rpoyner-tri
Copy link
Contributor Author

As for compiler capture, I should probably open a separate issue. There are a couple of techniques possible, but they all have holes, and the MacOS situation is often the sticking point.

#13905 opened for compiler-in-use capture.

rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 20, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 20, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 20, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 24, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 24, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
@sherm1
Copy link
Member

sherm1 commented Aug 24, 2020

Here's a baseline data point for my own use (before any attempted speedups).

  • Drake master at 3da9e75
  • Sherm's Puget, Ubuntu 18.04, gcc 7.5, 2.9MHz max but run in no_turbo/powersave using conduct_experiment
  • Note: slowing the CPU clock for repeatability distorts the relative speed of floating point vs memory access; it will underestimate the actual benefit of eliminating heap allocations, for example.

Best of 36 runs (4x conduct_experiment, nothing else running). Each of those experiments produces a "best of 9" minᵢ, we report min(min₁,min₂,min₃,min₄). To give a sense of variance max(min₁,min₂,min₃,min₄) is shown as the percent by which the "maximum min" exceeds the "minimum min". Times are in μs from the CPU column of summary.txt, rounded to three significant digits.

test base %var autodiff %var
mass matrix 41.3 3% 6690 0.5%
inverse dynamics 30.6 0.6% 7920 1.9%
forward dynamics 60.8 6% 11900 1.3%

rpoyner-tri added a commit to rpoyner-tri/drake that referenced this issue Aug 24, 2020
Relevant to RobotLocomotion#13902.

Add a script to run controlled experiments, and refuse to run on unsupported
configurations. Update documentation to describe the new capability.
@amcastro-tri
Copy link
Contributor

To be honest it is difficult for me to find the data I need to understand the improvements in some sort of timeline within this issue. I wonder if you happen to have some sort of spreadsheet @rpoyner-tri ?

Also surprised by these latest results. Questions in my mind are:

  1. How many allocations did we get rid of since you started working on this? what was the speedup improvement?
  2. How many allocations does the hacky version get rid of? what is the speed improvement?

Then from that I'd like to see if things are linear? it seems like not?

@rpoyner-tri
Copy link
Contributor Author

rpoyner-tri commented Oct 12, 2020

@amcastro-tri things are definitely not linear. What I'm seeing now is that the remaining hot spots are within Eigen, and (based on our current build settings) not very effectively vectorized. We could perhaps work on vectorization more, but all of that work will likely be very platform-specific.

I'll have to do some digging to give a complete history, but here are some highlights. I'll be using numbers based on the reorganized benchmark of #14146, and various restated results based on that. All results are from my Puget, using the conduct_experiment script, with --benchmark_repetitions=100.

Allocations baseline, as remeasured above:

case allocations min time (ns)
mass matrix 61,405 3,001,181
inverse dynamics 68,798 3,770,812
forward dynamics 102,904 5,563,365

Sherm did some cache and temp-value reductions, that helped MassMatrix only (#13928):

case allocations min time (ns)
mass matrix 61,197 2,938,893
inverse dynamics 68,798 3,770,812
forward dynamics 102,904 5,563,365

My first rewrite of AutoDiffXd arithmetic (#13988):

case allocations min time (ns)
mass matrix 34,922 2,223,561
inverse dynamics 41,523 3,339,226
forward dynamics 61,233 4,439,808

Some additional gains came from Sherm's #13962, but went unnoticed until #14115:

case allocations min time (ns)
mass matrix 31,426 2,012,921
inverse dynamics 38,027 2,969,139
forward dynamics 57,693 4,238,981

Note that my #14045 (move-aware rewrites of less-used math) improved neither run-times nor allocation count.

My #14171 improved running times without changing allocation counts at all:

case allocations min time (ns)
mass matrix 31,426 1,864,845
inverse dynamics 38,027 2,720,218
forward dynamics 57,693 3,862,702

To compare with my little tables above, here is a similar table for the fixed-size autodiff experiment:

case allocations min time (ns)
mass matrix n/a n/a
inverse dynamics 4 2,774,637
forward dynamics n/a n/a

None of this should be a surprise; memory allocation is just one of many kinds of potential performance problems. It should be apparent that any improvement of a single kind of problem would be subject to the law of diminishing returns.

@sherm1
Copy link
Member

sherm1 commented Oct 12, 2020

This is still surprising and not easily explained by diminishing returns. Here is a quick analysis:

Assumptions:
(1) each operation i ∈ {mass, inverse, forward} consists of: time to do heap allocations plus time for other work.
(2) average heap allocation time h is the same for all operations.
(3) eliminating heap allocations doesn't change the amount of other work.

For each operation i, we can write that tᵢ = h xᵢ + cᵢ where xᵢ is how many heap allocations we do, and cᵢ accounts for everything else. h and the three cᵢ are unknown but xᵢ and tᵢ are given by your first and second-to-last tables above, with two equations for each operation. That gives us 6 equations for 4 unknowns (h, c_mass, c_inverse, c_forward). Each pair of equations gives a value for h and fortunately they are all very close -- .035, .037, .037 μs. As a least squares problem in Octave I got h=.037 μs / heap allocation (that is, 37ns which seems about right).

So with h=.037μs the reduction in inverse dynamics heap allocations from 38,027 to 4 in the last table should have removed 38023*h=1407μs reducing the time by about 50% from 2720μs to 1313μs. But no time was actually saved. What happened to that 1400μs of time saved not doing heap allocation? Or what is wrong with my analysis above?

To be precise here is the system I set up and least-squares solved:

A =
    31426        1        0        0
    61405        1        0        0
    38027        0        1        0
    68798        0        1        0
    57693        0        0        1
   102904        0        0        1

b =
   1864
   3001
   2720
   3770
   3863
   5563

A\b
      0.036830       h
    723.025225       c_mass
   1277.826934       c_inverse
   1755.620240       c_forward

@rpoyner-tri
Copy link
Contributor Author

I could be wrong, but here's my hot take: you can't do continuous-domain physics with computation steps like you are trying to do above. Most likely, not every allocation has uniform cost, based on heap state, cache states, etc. Certainly, not every free (likely the more costly) has uniform cost, for the same reasons.

Just for completeness, there is another possibility worth exploring: perhaps measured statistics aren't particularly stable, even with 100x repetitions, and all the other measures we've implemented. I can take a look at that if you think it is of interest.

@sherm1
Copy link
Member

sherm1 commented Oct 12, 2020

Although I agree about the likely time variations, I think when we're talking about 30,000 heap allocations those should wash out. I think there is a mystery here that bears investigating.

@sherm1
Copy link
Member

sherm1 commented Oct 13, 2020

BTW it occurs to me that (unless we're leaking) each heap allocation is really a malloc/free pair. So that 37ns should include one of each.

@amcastro-tri
Copy link
Contributor

Thank you for the very detailed data @rpoyner-tri and great analysis @sherm1.

I actually run the benchmark locally and did a very lil math. I simply computed the quotient

S = "cost of operation (AutoDiffXd)"/"cost of operation (double)"/"number of derivatives"

The reason I computed this number is because if, for instance, if I use simple forward differences, I'd expect the cost of the derivatives to be "(number of derivatives) x (cost of operation (double))". Therefore with this super simple ideal case analysis I'd expect convergence to S = 1 (though not sure if possible in software, but at least good as a reference I believe).

Ok, so Cassie has nq = 23, nv = 22 and nu=10. The tests are setup so that the number of derivatives nd is; Mass matrix nd = nq+nv=45, Inverse Dynamics nd=nq+2*nv=67 and, Forward Dynamics nd=nq+nv+nu=55. Using the timings in my laptop I get:

Mass Matrix. S = 1950380/14530/45 = 2.98.
Inverse Dynamics. S = 2838891/21058/67 = 2.01
Forward Dynamics. S = 4014724/43338/55 = 1.68.

So very close to one. It cold be that we did hit the bottom. Not sure if we could drive them to S=1.

@sherm1
Copy link
Member

sherm1 commented Oct 13, 2020

Thanks for that insightful analysis, Alejandro! I do think we should be able to get closer to S=1 than we are, though. In theory at least, we should be able to squeeze this down towards S=0.25 using SIMD instructions since the derivative part consists of repetitive calculations. All the machines we currently use support AVX2 and FMA instructions that allow for 4 double operations per cycle, with FMA able to do 8 under optimal circumstances. It would takes some work to put those to use but the results could be great.

@rpoyner-tri
Copy link
Contributor Author

FWIW, the Autodiff67d benchmark case spends a large fraction of its time in the Eigen dense_assignment_loop code, which makes me wonder if the fixed-size implementation is just somehow trading allocation for more copying.

Here are some snapshots from perf record [command]; perf report, isolating just the 67d inverse dynamics and the Xd inverse dynamics.

Xd inverse dynamics hotspots, as found by sampling with perf:
perf-xd-inverse-dyn

67d inverse dynamics hotspots, as found by sampling with perf:
perf-67d-inverse-dyn

Perhaps it is worth noting that my opaque-seeming math fiddles in #14171 had the effect of trading calls to the eigen dense assignment kernel with inlined instructions. It could be that we could pick up another sizeable win by convincing Eigen to do better vectorizing.

@jwnimmer-tri
Copy link
Collaborator

Just in case anyone is unclear -- #14171 was not relevant for 67d. All of the changes to autodiffxd.h are only for the Xd specialization. The 67d code is using only upstream's code.

Also note that in the Xd benchmark, there could be some scalars with zero derivatives (when you have constants) which would short-circuit the chain rule. Because the 67d benchmark used RowsAtCompileTime instead of MaxRowsAtCompileTime, you're using the chain rule everywhere even if there are no partials. It's not apples to apples with the Xd timings.

@rpoyner-tri
Copy link
Contributor Author

@jwnimmer-tri agreed on 67d using only upstream code.

I don't understand your comment about RowsAtCompileTime() -- can you clarify?

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Oct 13, 2020

I'm referring to the distinction between

  • fixed size of 67 (RowsAtCompileTime == MaxRowsAtCompileTime == 67) vs
  • dynamic size with max of 67 (RowsAtCompileTime == Dynamic, MaxRowsAtCompileTime == 67).

In both cases, we never allocate the derivatives on the heap.

But in the latter case, we still check .derivatives().size() during the chain rule and short-circuit when empty.

In most of the proposals I've seen to use "fixed size" autodiff, we would only set MaxRows, not Rows. Most users will not need all 67 all the time.

@amcastro-tri
Copy link
Contributor

Because the 67d benchmark used RowsAtCompileTime instead of MaxRowsAtCompileTime, you're using the chain rule everywhere even if there are no partials.

@jwnimmer-tri, I haven't seen the upstream implementation for 67d. So you believe that the upstream implementation is only provided for fixed-sizes but not optimized for fixed-max-sizes?(by checking .derivatives().size() to short circuit computations).

@sherm1
Copy link
Member

sherm1 commented Oct 13, 2020

Good points, Jeremy. Rico, should be easy to try this with UpTo67d rather than 67d to see if that changes anything.
Also with either of those all your nice move optimizations switch back to copies.

This makes me think we would see much better results by specializing AutoDiffScalar<VectorXd> to use customized memory allocation where we would have a hope to get that 37ns new/delete time down to 1 or 2ns.

@sherm1
Copy link
Member

sherm1 commented Oct 13, 2020

(And continue to benefit from move and all-zero-derivatives optimizations.)

@sherm1
Copy link
Member

sherm1 commented Oct 13, 2020

@amcastro-tri for AutoDiffScalar<Vector67d>, derivatives().size() is always 67.

@amcastro-tri
Copy link
Contributor

oh, I see! I thought the whole time @rpoyner-tri was using UpTo67d. My bad. Yes, we should be using UpTo67d instead I believe.

@rpoyner-tri
Copy link
Contributor Author

Initial experiments with UpTo67d show maybe a 20% speedup over Xd, but are more painful to work with. I still haven't got the forward dynamics case working; it dies trying to evaluate an expression tree containing an empty derivative deep inside Eigen.

I'll have more to show from those experiments tomorrow.

@rpoyner-tri
Copy link
Contributor Author

In order to get all the cassie_bench cases working with UpTo67d autodiff, I had to locally install and use eigen 3.3.8, which has a fix for its make_coherent mechanism in the presence of 0-size derivative vectors. Bottom line, we remove essentially all memory allocation and gain ~20% over the AutoDiffXd currently in master.

Comparison of timings from AutoDiffXd to AutoDiff67d (implemented as up-to):
up-to-67d-eigen3 3 8

Summary output, showing allocation counts and selected object sizes:

rico@Puget-161804-10:~/checkout/drake$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench --benchmark_counters_tabular
2020-10-14T14:26:06-04:00
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (48 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 256 KiB (x24)
  L3 Unified 30720 KiB (x2)
Load Average: 0.12, 0.11, 0.09
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev
------------------------------------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix                 12482 ns        12481 ns        47478          0           0          0             0
CassieDoubleFixture/DoubleInverseDynamics            18090 ns        18088 ns        38678          3           3          3             0
CassieDoubleFixture/DoubleForwardDynamics            39421 ns        39416 ns        18192         22          22         22             0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev ads_sizeof autodiff_size
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiffFixture/AutodiffMassMatrix           1619013 ns      1618843 ns          436    31.426k     31.426k    31.426k             0         24            45
CassieAutodiffFixture/AutodiffInverseDynamics      2325750 ns      2325501 ns          303    38.027k     38.027k    38.027k             0         24            67
CassieAutodiffFixture/AutodiffForwardDynamics      3317453 ns      3317033 ns          211    57.693k     57.693k    57.693k             0         24            55
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev autodiff_size
--------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffMassMatrix         1336231 ns      1336041 ns          533          0           0          0             0            45
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev ads_sizeof autodiff_size
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffInverseDynamics    1833856 ns      1833632 ns          382          4           4          4             0        552            67
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev autodiff_size
--------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffForwardDynamics    2782219 ns      2781852 ns          252         22          22         22             0            55

Limitations: why aren't we just doing Upto128d or something?

  • we've fixed a bunch of bugs and instabilities in Eigen's implementation
    • upgrading to 3.3.8 only addresses the worst of them
    • upgrading to 3.3.8 is probably far-off owing to compatibility constraints and Ubuntu upgrade schedule
    • various math methods would still have bugs we care about
  • using a MaxRowsAtCompileTime type is not as fluent an API
    • probably need a ton of Eigen::Ref<> hacks or similar
  • replicating the support level of AutoDiffXd across Drake is probably a multi-thousand-line change
    • (maybe avoidable by cleverness)

@sherm1
Copy link
Member

sherm1 commented Jun 29, 2021

Current cassie_bench performance snapshot. This includes the effects of the AVX transform composition functions.

  • Drake master at c272fa4.
  • Sherm's Puget, Ubuntu 18.04, gcc 7.5, 2.9MHz max but run in no_turbo/powersave/cpu1 using conduct_experiment.
  • see above comment for test conditions. Times are in microseconds.
test base %var autodiff %var
mass matrix 18.5 0.6% 2583 0.7%
inverse dynamics 25.1 0.3% 3740 0.3%
forward dynamics 57.0 0.5% 5404 0.5%

From the initial measurements, this is a 41.3/18.5= 2.2X speedup for the mass matrix (or a 55% discount if you prefer). Inverse dynamics is 18% faster and forward dynamics is 6% faster (those haven't been worked on yet). The AutoDiff timings here improved by around 5% over the previous times reported here or about 30% overall.

@rpoyner-tri
Copy link
Contributor Author

I don't foresee doing much further work here. Handing back to @sherm1 for re-triage.

@xuchenhan-tri
Copy link
Contributor

It would be nice to see an update to this once the changes from #16877 lands

@sherm1
Copy link
Member

sherm1 commented May 24, 2022

Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: multibody plant MultibodyPlant and supporting code status: tracker Perpetually open
Projects
None yet
Development

No branches or pull requests

6 participants