tracking measurement data from //examples/multibody/cassie_benchmark #13902

rpoyner-tri · 2020-08-18T19:20:26Z

/cc @sherm1
We would like to capture some sort of baseline data from the cassie benchmark program to help guide development of performance optimizations. I have argued it should be stored out-of-tree because

it is probably only useful for local, same-machine comparisons
could be discovered to be invalid after the fact
will become irrelevant as machine targets evolve

Even with those challenges, I think it it is useful to have a place to store and perhaps discuss baseline data. I propose to use this issue for that purpose until we arrive at better solutions.

cc'ing myself @amcastro-tri so that I can follow this.

sherm1 · 2020-08-18T21:00:46Z

Sounds good! What info should we record here? For my own purposes (making changes to MBP and verifying that I'm moving in the right direction) I would like to note the machine & compiler info, plus the best time over a series of runs for each of the 6 tests. I'd be happy to record more than that if you think it's useful, but that's all I need. Can we standardize on the data to be posted here, perhaps by running the benchmark in a prescribed manner and collecting the output?

rpoyner-tri · 2020-08-18T21:16:36Z

My first proposal is to have a handwritten summary of the data set, including just the single repetition output of cassie_bench, and also attach the zip file from :record_results. This probably still has some holes, but goes in the right direction. I'll post an example and we can discuss that, maybe.

rpoyner-tri · 2020-08-18T21:24:24Z

Experiment: cassie_bench on rico's Puget, early baseline.
Software: master at hash bc8daa4
Environment: Ubuntu 18.04, gcc 7.5
Hardware: rico's Puget workstation

Quick results:

rico@Puget-161804-10:~/checkout/drake$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
2020-08-18 17:22:18
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (48 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 256 KiB (x24)
  L3 Unified 30720 KiB (x2)
Load Average: 0.13, 0.18, 0.12
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix               26141 ns        26141 ns        25051
CassieDoubleFixture/DoubleInverseDynamics          19460 ns        19461 ns        36137
CassieDoubleFixture/DoubleForwardDynamics          39425 ns        39425 ns        17750
CassieAutodiffFixture/AutodiffMassMatrix         4213121 ns      4213172 ns          166
CassieAutodiffFixture/AutodiffInverseDynamics    5118358 ns      5118326 ns          138
CassieAutodiffFixture/AutodiffForwardDynamics    7525557 ns      7525703 ns           93

Zip file from record_results:
outputs.zip

rpoyner-tri · 2020-08-18T21:34:33Z

Experiment: cassie_bench on rico's T460p laptop, early baseline
Software: master at hash bc8daa4
Environment: Ubuntu 18.04, gcc 7.5
Hardware: rico's Lenovo T460p laptop

Quick results:

2020-08-18 17:29:55
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (8 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.58, 0.85, 1.37
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------
Benchmark                                              Time             CPU   Iterations
----------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix               27105 ns        27038 ns        26175
CassieDoubleFixture/DoubleInverseDynamics          21544 ns        21492 ns        33749
CassieDoubleFixture/DoubleForwardDynamics          39815 ns        39729 ns        15053
CassieAutodiffFixture/AutodiffMassMatrix         4325573 ns      4306426 ns          161
CassieAutodiffFixture/AutodiffInverseDynamics    5176258 ns      5159467 ns          135
CassieAutodiffFixture/AutodiffForwardDynamics    7662577 ns      7635458 ns           85

Zip file from record_results:
outputs.zip

rpoyner-tri · 2020-08-18T21:38:13Z

So, the above reports used a built tree at the listed hash. Quick results were just the output of:

$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench

The outputs.zip is produced by running:

$ bazel run //examples/multibody/cassie_benchmark:record_results
$ examples/multibody/cassie_benchmark/copy_results_to [SOME_NEW_DIRECTORY]

and then using drag-and-drop or paste to attach [SOME_NEW_DIRECTORY]/outputs.zip to the issue comment.

rpoyner-tri · 2020-08-18T22:33:21Z

Fun fact, DWARF debug tables have compiler and command line switches in some cases, at least for gcc. Checking whether clang has a similar feature. If so, it might be a way to automate the compiler info currently capture by hand above.

sherm1 · 2020-08-18T23:02:08Z

I like the idea of capturing the default output of cassie_bench here in issue comments. But for me the one-off output captured above is too variable to be representative (even on my own machine). I'm anticipating that many of the changes I'll be making will be modest improvements, say 5-10% ish. Even with CPU scaling disabled on my Puget I see 6% variation from run to run in the cassie_bench output. For the at-a-glance summary here in the issue, what would you think about reporting instead the min times from the multiple-run output, ideally with CPU scaling disabled? I think that is the most stable report we have, but at the moment it is not easy to extract in a quotable format.

rpoyner-tri · 2020-08-18T23:22:08Z

That's certainly a thing we could do, and probably get :record_results to craft it for us eventually. Automating the disablement of CPU scaling is both platform specific (macos???) and remarkably resistant to automation (sudo, no state save/restore commands).
So if we want to care about wall-time numbers, we probably have to add some additional semi-automation scripts.

jwnimmer-tri · 2020-08-18T23:26:21Z

I find 6% variance astonishing. I would suggest trying to debug why that is happening first, before just blindly taking the min. Anything that severe could end up confounding the results anyway beyond what the min can correct for. Perhaps not all scaling was disabled, or the cpu affinity was not pinned, or chrome ate your gpu and locked the bus, or ...

sherm1 · 2020-08-18T23:50:43Z

Really? I'm disappointed, but not surprised by that much variance on recent machines -- would be great to see repeatable timings but it's been a long time since I've seen any.

jwnimmer-tri · 2020-08-19T03:58:31Z

Really?

Yes, really.

First, I did:

$ sudo cpupower frequency-set --governor performance
$ sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo'

With that, I am seeing around 0.5% typical run-to-run inconsistency, topping out at 1.5% rarely.

Even a 3% improvement in throughput from a code change under consideration would easily stand out in that background. And you're no longer at the mercy of your cpu temperature while compiling and then executing right after.

sherm1 · 2020-08-19T16:21:40Z

Thanks -- "no_turbo" helps a lot! Runs a lot slower but with +/- 1% ish variation on my Puget. Looking at the report_results summary (9 runs with stats) I'm still seeing >2% difference between the reported min and max run times. The min times are more repeatable across multiple 9x runs, I haven't seen more than 0.5%.

(BTW timings on my Ubuntu VM were hopelessly variable.)

jwnimmer-tri · 2020-08-19T16:25:16Z

Great! So for me, that crosses off the "something else is going wrong" risk, and now you have at least moderately robust numbers. I don't have any strong opinions on how you take it from here in terms of taking the min, capturing data, etc.

sherm1 · 2020-08-19T16:26:21Z

What is the correct way to capture the compiler in use? With Bazel I'm not sure even whether I'm using gcc or clang.

jwnimmer-tri · 2020-08-19T16:36:54Z

Call drake/tools/workspace/cc/identify_compiler and save its output. (You'll need to add a BUILD rule for it, first.)

rpoyner-tri · 2020-08-19T16:39:22Z

As for compiler capture, I should probably open a separate issue. There are a couple of techniques possible, but they all have holes, and the MacOS situation is often the sticking point.

#13905 opened for compiler-in-use capture.

Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.

sherm1 · 2020-08-24T18:47:31Z

Here's a baseline data point for my own use (before any attempted speedups).

Drake master at 3da9e75
Sherm's Puget, Ubuntu 18.04, gcc 7.5, 2.9MHz max but run in no_turbo/powersave using conduct_experiment
Note: slowing the CPU clock for repeatability distorts the relative speed of floating point vs memory access; it will underestimate the actual benefit of eliminating heap allocations, for example.

Best of 36 runs (4x conduct_experiment, nothing else running). Each of those experiments produces a "best of 9" minᵢ, we report min(min₁,min₂,min₃,min₄). To give a sense of variance max(min₁,min₂,min₃,min₄) is shown as the percent by which the "maximum min" exceeds the "minimum min". Times are in μs from the CPU column of summary.txt, rounded to three significant digits.

test	base	%var	autodiff	%var
mass matrix	41.3	3%	6690	0.5%
inverse dynamics	30.6	0.6%	7920	1.9%
forward dynamics	60.8	6%	11900	1.3%

Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.

amcastro-tri · 2020-10-12T19:12:10Z

To be honest it is difficult for me to find the data I need to understand the improvements in some sort of timeline within this issue. I wonder if you happen to have some sort of spreadsheet @rpoyner-tri ?

Also surprised by these latest results. Questions in my mind are:

How many allocations did we get rid of since you started working on this? what was the speedup improvement?
How many allocations does the hacky version get rid of? what is the speed improvement?

Then from that I'd like to see if things are linear? it seems like not?

rpoyner-tri · 2020-10-12T20:34:34Z

@amcastro-tri things are definitely not linear. What I'm seeing now is that the remaining hot spots are within Eigen, and (based on our current build settings) not very effectively vectorized. We could perhaps work on vectorization more, but all of that work will likely be very platform-specific.

I'll have to do some digging to give a complete history, but here are some highlights. I'll be using numbers based on the reorganized benchmark of #14146, and various restated results based on that. All results are from my Puget, using the conduct_experiment script, with --benchmark_repetitions=100.

Allocations baseline, as remeasured above:

case	allocations	min time (ns)
mass matrix	61,405	3,001,181
inverse dynamics	68,798	3,770,812
forward dynamics	102,904	5,563,365

Sherm did some cache and temp-value reductions, that helped MassMatrix only (#13928):

case	allocations	min time (ns)
mass matrix	61,197	2,938,893
inverse dynamics	68,798	3,770,812
forward dynamics	102,904	5,563,365

My first rewrite of AutoDiffXd arithmetic (#13988):

case	allocations	min time (ns)
mass matrix	34,922	2,223,561
inverse dynamics	41,523	3,339,226
forward dynamics	61,233	4,439,808

Some additional gains came from Sherm's #13962, but went unnoticed until #14115:

case	allocations	min time (ns)
mass matrix	31,426	2,012,921
inverse dynamics	38,027	2,969,139
forward dynamics	57,693	4,238,981

Note that my #14045 (move-aware rewrites of less-used math) improved neither run-times nor allocation count.

My #14171 improved running times without changing allocation counts at all:

case	allocations	min time (ns)
mass matrix	31,426	1,864,845
inverse dynamics	38,027	2,720,218
forward dynamics	57,693	3,862,702

To compare with my little tables above, here is a similar table for the fixed-size autodiff experiment:

case	allocations	min time (ns)
mass matrix	n/a	n/a
inverse dynamics	4	2,774,637
forward dynamics	n/a	n/a

None of this should be a surprise; memory allocation is just one of many kinds of potential performance problems. It should be apparent that any improvement of a single kind of problem would be subject to the law of diminishing returns.

sherm1 · 2020-10-12T23:15:20Z

This is still surprising and not easily explained by diminishing returns. Here is a quick analysis:

Assumptions:
(1) each operation i ∈ {mass, inverse, forward} consists of: time to do heap allocations plus time for other work.
(2) average heap allocation time h is the same for all operations.
(3) eliminating heap allocations doesn't change the amount of other work.

For each operation i, we can write that tᵢ = h xᵢ + cᵢ where xᵢ is how many heap allocations we do, and cᵢ accounts for everything else. h and the three cᵢ are unknown but xᵢ and tᵢ are given by your first and second-to-last tables above, with two equations for each operation. That gives us 6 equations for 4 unknowns (h, c_mass, c_inverse, c_forward). Each pair of equations gives a value for h and fortunately they are all very close -- .035, .037, .037 μs. As a least squares problem in Octave I got h=.037 μs / heap allocation (that is, 37ns which seems about right).

So with h=.037μs the reduction in inverse dynamics heap allocations from 38,027 to 4 in the last table should have removed 38023*h=1407μs reducing the time by about 50% from 2720μs to 1313μs. But no time was actually saved. What happened to that 1400μs of time saved not doing heap allocation? Or what is wrong with my analysis above?

To be precise here is the system I set up and least-squares solved:

A =
    31426        1        0        0
    61405        1        0        0
    38027        0        1        0
    68798        0        1        0
    57693        0        0        1
   102904        0        0        1

b =
   1864
   3001
   2720
   3770
   3863
   5563

A\b
      0.036830       h
    723.025225       c_mass
   1277.826934       c_inverse
   1755.620240       c_forward

rpoyner-tri · 2020-10-12T23:28:35Z

I could be wrong, but here's my hot take: you can't do continuous-domain physics with computation steps like you are trying to do above. Most likely, not every allocation has uniform cost, based on heap state, cache states, etc. Certainly, not every free (likely the more costly) has uniform cost, for the same reasons.

Just for completeness, there is another possibility worth exploring: perhaps measured statistics aren't particularly stable, even with 100x repetitions, and all the other measures we've implemented. I can take a look at that if you think it is of interest.

sherm1 · 2020-10-12T23:40:59Z

Although I agree about the likely time variations, I think when we're talking about 30,000 heap allocations those should wash out. I think there is a mystery here that bears investigating.

sherm1 · 2020-10-13T00:01:38Z

BTW it occurs to me that (unless we're leaking) each heap allocation is really a malloc/free pair. So that 37ns should include one of each.

amcastro-tri · 2020-10-13T12:57:52Z

Thank you for the very detailed data @rpoyner-tri and great analysis @sherm1.

I actually run the benchmark locally and did a very lil math. I simply computed the quotient

S = "cost of operation (AutoDiffXd)"/"cost of operation (double)"/"number of derivatives"

The reason I computed this number is because if, for instance, if I use simple forward differences, I'd expect the cost of the derivatives to be "(number of derivatives) x (cost of operation (double))". Therefore with this super simple ideal case analysis I'd expect convergence to S = 1 (though not sure if possible in software, but at least good as a reference I believe).

Ok, so Cassie has nq = 23, nv = 22 and nu=10. The tests are setup so that the number of derivatives nd is; Mass matrix nd = nq+nv=45, Inverse Dynamics nd=nq+2*nv=67 and, Forward Dynamics nd=nq+nv+nu=55. Using the timings in my laptop I get:

Mass Matrix. S = 1950380/14530/45 = 2.98.
Inverse Dynamics. S = 2838891/21058/67 = 2.01
Forward Dynamics. S = 4014724/43338/55 = 1.68.

So very close to one. It cold be that we did hit the bottom. Not sure if we could drive them to S=1.

sherm1 · 2020-10-13T15:26:47Z

Thanks for that insightful analysis, Alejandro! I do think we should be able to get closer to S=1 than we are, though. In theory at least, we should be able to squeeze this down towards S=0.25 using SIMD instructions since the derivative part consists of repetitive calculations. All the machines we currently use support AVX2 and FMA instructions that allow for 4 double operations per cycle, with FMA able to do 8 under optimal circumstances. It would takes some work to put those to use but the results could be great.

rpoyner-tri · 2020-10-13T15:58:09Z

FWIW, the Autodiff67d benchmark case spends a large fraction of its time in the Eigen dense_assignment_loop code, which makes me wonder if the fixed-size implementation is just somehow trading allocation for more copying.

Here are some snapshots from perf record [command]; perf report, isolating just the 67d inverse dynamics and the Xd inverse dynamics.

Xd inverse dynamics hotspots, as found by sampling with perf:

67d inverse dynamics hotspots, as found by sampling with perf:

Perhaps it is worth noting that my opaque-seeming math fiddles in #14171 had the effect of trading calls to the eigen dense assignment kernel with inlined instructions. It could be that we could pick up another sizeable win by convincing Eigen to do better vectorizing.

jwnimmer-tri · 2020-10-13T16:03:14Z

Just in case anyone is unclear -- #14171 was not relevant for 67d. All of the changes to autodiffxd.h are only for the Xd specialization. The 67d code is using only upstream's code.

Also note that in the Xd benchmark, there could be some scalars with zero derivatives (when you have constants) which would short-circuit the chain rule. Because the 67d benchmark used RowsAtCompileTime instead of MaxRowsAtCompileTime, you're using the chain rule everywhere even if there are no partials. It's not apples to apples with the Xd timings.

rpoyner-tri · 2020-10-13T16:09:12Z

@jwnimmer-tri agreed on 67d using only upstream code.

I don't understand your comment about RowsAtCompileTime() -- can you clarify?

jwnimmer-tri · 2020-10-13T16:12:31Z

I'm referring to the distinction between

fixed size of 67 (RowsAtCompileTime == MaxRowsAtCompileTime == 67) vs
dynamic size with max of 67 (RowsAtCompileTime == Dynamic, MaxRowsAtCompileTime == 67).

In both cases, we never allocate the derivatives on the heap.

But in the latter case, we still check .derivatives().size() during the chain rule and short-circuit when empty.

In most of the proposals I've seen to use "fixed size" autodiff, we would only set MaxRows, not Rows. Most users will not need all 67 all the time.

amcastro-tri · 2020-10-13T16:20:02Z

Because the 67d benchmark used RowsAtCompileTime instead of MaxRowsAtCompileTime, you're using the chain rule everywhere even if there are no partials.

@jwnimmer-tri, I haven't seen the upstream implementation for 67d. So you believe that the upstream implementation is only provided for fixed-sizes but not optimized for fixed-max-sizes?(by checking .derivatives().size() to short circuit computations).

sherm1 · 2020-10-13T16:27:21Z

Good points, Jeremy. Rico, should be easy to try this with UpTo67d rather than 67d to see if that changes anything.
Also with either of those all your nice move optimizations switch back to copies.

This makes me think we would see much better results by specializing AutoDiffScalar<VectorXd> to use customized memory allocation where we would have a hope to get that 37ns new/delete time down to 1 or 2ns.

sherm1 · 2020-10-13T16:28:09Z

(And continue to benefit from move and all-zero-derivatives optimizations.)

sherm1 · 2020-10-13T16:40:06Z

@amcastro-tri for AutoDiffScalar<Vector67d>, derivatives().size() is always 67.

amcastro-tri · 2020-10-13T17:28:12Z

oh, I see! I thought the whole time @rpoyner-tri was using UpTo67d. My bad. Yes, we should be using UpTo67d instead I believe.

rpoyner-tri · 2020-10-13T23:44:49Z

Initial experiments with UpTo67d show maybe a 20% speedup over Xd, but are more painful to work with. I still haven't got the forward dynamics case working; it dies trying to evaluate an expression tree containing an empty derivative deep inside Eigen.

I'll have more to show from those experiments tomorrow.

rpoyner-tri · 2020-10-14T18:35:30Z

In order to get all the cassie_bench cases working with UpTo67d autodiff, I had to locally install and use eigen 3.3.8, which has a fix for its make_coherent mechanism in the presence of 0-size derivative vectors. Bottom line, we remove essentially all memory allocation and gain ~20% over the AutoDiffXd currently in master.

source code (very grubby): https://github.com/rpoyner-tri/drake/releases/tag/autodiff-fixed-dataset1
full experiment output: outputs.zip

Comparison of timings from AutoDiffXd to AutoDiff67d (implemented as up-to):

Summary output, showing allocation counts and selected object sizes:

rico@Puget-161804-10:~/checkout/drake$ bazel-bin/examples/multibody/cassie_benchmark/cassie_bench --benchmark_counters_tabular
2020-10-14T14:26:06-04:00
Running bazel-bin/examples/multibody/cassie_benchmark/cassie_bench
Run on (48 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 256 KiB (x24)
  L3 Unified 30720 KiB (x2)
Load Average: 0.12, 0.11, 0.09
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev
------------------------------------------------------------------------------------------------------------------------------------------
CassieDoubleFixture/DoubleMassMatrix                 12482 ns        12481 ns        47478          0           0          0             0
CassieDoubleFixture/DoubleInverseDynamics            18090 ns        18088 ns        38678          3           3          3             0
CassieDoubleFixture/DoubleForwardDynamics            39421 ns        39416 ns        18192         22          22         22             0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev ads_sizeof autodiff_size
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiffFixture/AutodiffMassMatrix           1619013 ns      1618843 ns          436    31.426k     31.426k    31.426k             0         24            45
CassieAutodiffFixture/AutodiffInverseDynamics      2325750 ns      2325501 ns          303    38.027k     38.027k    38.027k             0         24            67
CassieAutodiffFixture/AutodiffForwardDynamics      3317453 ns      3317033 ns          211    57.693k     57.693k    57.693k             0         24            55
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev autodiff_size
--------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffMassMatrix         1336231 ns      1336041 ns          533          0           0          0             0            45
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev ads_sizeof autodiff_size
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffInverseDynamics    1833856 ns      1833632 ns          382          4           4          4             0        552            67
--------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations Allocs.max Allocs.mean Allocs.min Allocs.stddev autodiff_size
--------------------------------------------------------------------------------------------------------------------------------------------------------
CassieAutodiff67Fixture/AutodiffForwardDynamics    2782219 ns      2781852 ns          252         22          22         22             0            55

Limitations: why aren't we just doing Upto128d or something?

we've fixed a bunch of bugs and instabilities in Eigen's implementation
- upgrading to 3.3.8 only addresses the worst of them
- upgrading to 3.3.8 is probably far-off owing to compatibility constraints and Ubuntu upgrade schedule
- various math methods would still have bugs we care about
using a MaxRowsAtCompileTime type is not as fluent an API
- probably need a ton of Eigen::Ref<> hacks or similar
replicating the support level of AutoDiffXd across Drake is probably a multi-thousand-line change
- (maybe avoidable by cleverness)

sherm1 · 2021-06-29T16:57:11Z

Current cassie_bench performance snapshot. This includes the effects of the AVX transform composition functions.

Drake master at c272fa4.
Sherm's Puget, Ubuntu 18.04, gcc 7.5, 2.9MHz max but run in no_turbo/powersave/cpu1 using conduct_experiment.
see above comment for test conditions. Times are in microseconds.

test	base	%var	autodiff	%var
mass matrix	18.5	0.6%	2583	0.7%
inverse dynamics	25.1	0.3%	3740	0.3%
forward dynamics	57.0	0.5%	5404	0.5%

From the initial measurements, this is a 41.3/18.5= 2.2X speedup for the mass matrix (or a 55% discount if you prefer). Inverse dynamics is 18% faster and forward dynamics is 6% faster (those haven't been worked on yet). The AutoDiff timings here improved by around 5% over the previous times reported here or about 30% overall.

rpoyner-tri · 2022-03-18T18:14:49Z

I don't foresee doing much further work here. Handing back to @sherm1 for re-triage.

xuchenhan-tri · 2022-05-24T20:49:19Z

It would be nice to see an update to this once the changes from #16877 lands

sherm1 · 2022-05-24T22:55:03Z

Will do.

jwnimmer-tri added component: multibody plant MultibodyPlant and supporting code status: tracker Perpetually open unused team: dynamics labels Aug 18, 2020

rpoyner-tri self-assigned this Aug 18, 2020

rpoyner-tri added this to the autodiff speedup milestone Aug 18, 2020

sherm1 mentioned this issue Aug 19, 2020

Add a build rule and a final newline for identify_compiler #13906

Merged

rpoyner-tri mentioned this issue Aug 20, 2020

cassie_bench: More support for experiments #13917

Merged

rpoyner-tri mentioned this issue Oct 15, 2020

autodiff speedups via pool allocator #14215

Closed

This was referenced Dec 8, 2020

Benchmark speed for autodiff in RigidBodyTree, MultibodyPlant and AcrobotPlant #8482

Closed

More flexible autodiff benchmark #14449

Closed

rpoyner-tri mentioned this issue Jul 21, 2021

Eigen 3.4 vs drake::AutoDiffXd #15401

Closed

rpoyner-tri assigned sherm1 and unassigned rpoyner-tri Mar 18, 2022

jwnimmer-tri removed the unused team: dynamics label May 3, 2022

tracking measurement data from //examples/multibody/cassie_benchmark #13902

tracking measurement data from //examples/multibody/cassie_benchmark #13902

Comments

rpoyner-tri commented Aug 18, 2020 • edited by amcastro-tri Loading

sherm1 commented Aug 18, 2020

rpoyner-tri commented Aug 18, 2020

rpoyner-tri commented Aug 18, 2020

rpoyner-tri commented Aug 18, 2020

rpoyner-tri commented Aug 18, 2020 • edited Loading

rpoyner-tri commented Aug 18, 2020

sherm1 commented Aug 18, 2020

rpoyner-tri commented Aug 18, 2020

jwnimmer-tri commented Aug 18, 2020 • edited Loading

sherm1 commented Aug 18, 2020

jwnimmer-tri commented Aug 19, 2020 • edited Loading

sherm1 commented Aug 19, 2020

jwnimmer-tri commented Aug 19, 2020

sherm1 commented Aug 19, 2020

jwnimmer-tri commented Aug 19, 2020 • edited Loading

rpoyner-tri commented Aug 19, 2020

sherm1 commented Aug 24, 2020

amcastro-tri commented Oct 12, 2020

rpoyner-tri commented Oct 12, 2020 • edited Loading

sherm1 commented Oct 12, 2020

rpoyner-tri commented Oct 12, 2020

sherm1 commented Oct 12, 2020

sherm1 commented Oct 13, 2020

amcastro-tri commented Oct 13, 2020

sherm1 commented Oct 13, 2020

rpoyner-tri commented Oct 13, 2020

jwnimmer-tri commented Oct 13, 2020

rpoyner-tri commented Oct 13, 2020

jwnimmer-tri commented Oct 13, 2020 • edited Loading

amcastro-tri commented Oct 13, 2020

sherm1 commented Oct 13, 2020

sherm1 commented Oct 13, 2020

sherm1 commented Oct 13, 2020

amcastro-tri commented Oct 13, 2020

rpoyner-tri commented Oct 13, 2020

rpoyner-tri commented Oct 14, 2020

sherm1 commented Jun 29, 2021 • edited Loading

rpoyner-tri commented Mar 18, 2022

xuchenhan-tri commented May 24, 2022

sherm1 commented May 24, 2022

rpoyner-tri commented Aug 18, 2020 •

edited by amcastro-tri

Loading

rpoyner-tri commented Aug 18, 2020 •

edited

Loading

jwnimmer-tri commented Aug 18, 2020 •

edited

Loading

jwnimmer-tri commented Aug 19, 2020 •

edited

Loading

jwnimmer-tri commented Aug 19, 2020 •

edited

Loading

rpoyner-tri commented Oct 12, 2020 •

edited

Loading

jwnimmer-tri commented Oct 13, 2020 •

edited

Loading

sherm1 commented Jun 29, 2021 •

edited

Loading