-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracking measurement data from //examples/multibody/cassie_benchmark #13902
Comments
Sounds good! What info should we record here? For my own purposes (making changes to MBP and verifying that I'm moving in the right direction) I would like to note the machine & compiler info, plus the best time over a series of runs for each of the 6 tests. I'd be happy to record more than that if you think it's useful, but that's all I need. Can we standardize on the data to be posted here, perhaps by running the benchmark in a prescribed manner and collecting the output? |
My first proposal is to have a handwritten summary of the data set, including just the single repetition output of cassie_bench, and also attach the zip file from :record_results. This probably still has some holes, but goes in the right direction. I'll post an example and we can discuss that, maybe. |
Experiment: cassie_bench on rico's Puget, early baseline. Quick results:
Zip file from record_results: |
Experiment: cassie_bench on rico's T460p laptop, early baseline Quick results:
Zip file from record_results: |
So, the above reports used a built tree at the listed hash. Quick results were just the output of:
The outputs.zip is produced by running:
and then using drag-and-drop or paste to attach [SOME_NEW_DIRECTORY]/outputs.zip to the issue comment. |
Fun fact, DWARF debug tables have compiler and command line switches in some cases, at least for gcc. Checking whether clang has a similar feature. If so, it might be a way to automate the compiler info currently capture by hand above. |
I like the idea of capturing the default output of cassie_bench here in issue comments. But for me the one-off output captured above is too variable to be representative (even on my own machine). I'm anticipating that many of the changes I'll be making will be modest improvements, say 5-10% ish. Even with CPU scaling disabled on my Puget I see 6% variation from run to run in the cassie_bench output. For the at-a-glance summary here in the issue, what would you think about reporting instead the min times from the multiple-run output, ideally with CPU scaling disabled? I think that is the most stable report we have, but at the moment it is not easy to extract in a quotable format. |
That's certainly a thing we could do, and probably get :record_results to craft it for us eventually. Automating the disablement of CPU scaling is both platform specific (macos???) and remarkably resistant to automation (sudo, no state save/restore commands). |
I find 6% variance astonishing. I would suggest trying to debug why that is happening first, before just blindly taking the min. Anything that severe could end up confounding the results anyway beyond what the min can correct for. Perhaps not all scaling was disabled, or the cpu affinity was not pinned, or chrome ate your gpu and locked the bus, or ... |
Really? I'm disappointed, but not surprised by that much variance on recent machines -- would be great to see repeatable timings but it's been a long time since I've seen any. |
Yes, really. First, I did: $ sudo cpupower frequency-set --governor performance
$ sudo sh -c 'echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo' With that, I am seeing around 0.5% typical run-to-run inconsistency, topping out at 1.5% rarely. Even a 3% improvement in throughput from a code change under consideration would easily stand out in that background. And you're no longer at the mercy of your cpu temperature while compiling and then executing right after. |
Thanks -- "no_turbo" helps a lot! Runs a lot slower but with +/- 1% ish variation on my Puget. Looking at the (BTW timings on my Ubuntu VM were hopelessly variable.) |
Great! So for me, that crosses off the "something else is going wrong" risk, and now you have at least moderately robust numbers. I don't have any strong opinions on how you take it from here in terms of taking the min, capturing data, etc. |
What is the correct way to capture the compiler in use? With Bazel I'm not sure even whether I'm using gcc or clang. |
Call |
As for compiler capture, I should probably open a separate issue. There are a couple of techniques possible, but they all have holes, and the MacOS situation is often the sticking point. #13905 opened for compiler-in-use capture. |
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
Here's a baseline data point for my own use (before any attempted speedups).
Best of 36 runs (4x conduct_experiment, nothing else running). Each of those experiments produces a "best of 9" minᵢ, we report min(min₁,min₂,min₃,min₄). To give a sense of variance max(min₁,min₂,min₃,min₄) is shown as the percent by which the "maximum min" exceeds the "minimum min". Times are in μs from the CPU column of summary.txt, rounded to three significant digits.
|
Relevant to RobotLocomotion#13902. Add a script to run controlled experiments, and refuse to run on unsupported configurations. Update documentation to describe the new capability.
To be honest it is difficult for me to find the data I need to understand the improvements in some sort of timeline within this issue. I wonder if you happen to have some sort of spreadsheet @rpoyner-tri ? Also surprised by these latest results. Questions in my mind are:
Then from that I'd like to see if things are linear? it seems like not? |
@amcastro-tri things are definitely not linear. What I'm seeing now is that the remaining hot spots are within Eigen, and (based on our current build settings) not very effectively vectorized. We could perhaps work on vectorization more, but all of that work will likely be very platform-specific. I'll have to do some digging to give a complete history, but here are some highlights. I'll be using numbers based on the reorganized benchmark of #14146, and various restated results based on that. All results are from my Puget, using the Allocations baseline, as remeasured above:
Sherm did some cache and temp-value reductions, that helped MassMatrix only (#13928):
My first rewrite of AutoDiffXd arithmetic (#13988):
Some additional gains came from Sherm's #13962, but went unnoticed until #14115:
Note that my #14045 (move-aware rewrites of less-used math) improved neither run-times nor allocation count. My #14171 improved running times without changing allocation counts at all:
To compare with my little tables above, here is a similar table for the fixed-size autodiff experiment:
None of this should be a surprise; memory allocation is just one of many kinds of potential performance problems. It should be apparent that any improvement of a single kind of problem would be subject to the law of diminishing returns. |
This is still surprising and not easily explained by diminishing returns. Here is a quick analysis: Assumptions: For each operation i, we can write that So with h=.037μs the reduction in inverse dynamics heap allocations from 38,027 to 4 in the last table should have removed 38023*h=1407μs reducing the time by about 50% from 2720μs to 1313μs. But no time was actually saved. What happened to that 1400μs of time saved not doing heap allocation? Or what is wrong with my analysis above? To be precise here is the system I set up and least-squares solved:
|
I could be wrong, but here's my hot take: you can't do continuous-domain physics with computation steps like you are trying to do above. Most likely, not every allocation has uniform cost, based on heap state, cache states, etc. Certainly, not every free (likely the more costly) has uniform cost, for the same reasons. Just for completeness, there is another possibility worth exploring: perhaps measured statistics aren't particularly stable, even with 100x repetitions, and all the other measures we've implemented. I can take a look at that if you think it is of interest. |
Although I agree about the likely time variations, I think when we're talking about 30,000 heap allocations those should wash out. I think there is a mystery here that bears investigating. |
BTW it occurs to me that (unless we're leaking) each heap allocation is really a malloc/free pair. So that 37ns should include one of each. |
Thank you for the very detailed data @rpoyner-tri and great analysis @sherm1. I actually run the benchmark locally and did a very lil math. I simply computed the quotient
The reason I computed this number is because if, for instance, if I use simple forward differences, I'd expect the cost of the derivatives to be "(number of derivatives) x (cost of operation (double))". Therefore with this super simple ideal case analysis I'd expect convergence to S = 1 (though not sure if possible in software, but at least good as a reference I believe). Ok, so Cassie has nq = 23, nv = 22 and nu=10. The tests are setup so that the number of derivatives nd is; Mass matrix nd = nq+nv=45, Inverse Dynamics nd=nq+2*nv=67 and, Forward Dynamics nd=nq+nv+nu=55. Using the timings in my laptop I get:
So very close to one. It cold be that we did hit the bottom. Not sure if we could drive them to S=1. |
Thanks for that insightful analysis, Alejandro! I do think we should be able to get closer to S=1 than we are, though. In theory at least, we should be able to squeeze this down towards S=0.25 using SIMD instructions since the derivative part consists of repetitive calculations. All the machines we currently use support AVX2 and FMA instructions that allow for 4 double operations per cycle, with FMA able to do 8 under optimal circumstances. It would takes some work to put those to use but the results could be great. |
FWIW, the Autodiff67d benchmark case spends a large fraction of its time in the Eigen Here are some snapshots from Xd inverse dynamics hotspots, as found by sampling with 67d inverse dynamics hotspots, as found by sampling with Perhaps it is worth noting that my opaque-seeming math fiddles in #14171 had the effect of trading calls to the eigen dense assignment kernel with inlined instructions. It could be that we could pick up another sizeable win by convincing Eigen to do better vectorizing. |
Just in case anyone is unclear -- #14171 was not relevant for 67d. All of the changes to Also note that in the Xd benchmark, there could be some scalars with zero derivatives (when you have constants) which would short-circuit the chain rule. Because the 67d benchmark used RowsAtCompileTime instead of MaxRowsAtCompileTime, you're using the chain rule everywhere even if there are no partials. It's not apples to apples with the Xd timings. |
@jwnimmer-tri agreed on 67d using only upstream code. I don't understand your comment about RowsAtCompileTime() -- can you clarify? |
I'm referring to the distinction between
In both cases, we never allocate the derivatives on the heap. But in the latter case, we still check In most of the proposals I've seen to use "fixed size" autodiff, we would only set MaxRows, not Rows. Most users will not need all 67 all the time. |
@jwnimmer-tri, I haven't seen the upstream implementation for 67d. So you believe that the upstream implementation is only provided for fixed-sizes but not optimized for fixed-max-sizes?(by checking .derivatives().size() to short circuit computations). |
Good points, Jeremy. Rico, should be easy to try this with UpTo67d rather than 67d to see if that changes anything. This makes me think we would see much better results by specializing |
(And continue to benefit from move and all-zero-derivatives optimizations.) |
@amcastro-tri for |
oh, I see! I thought the whole time @rpoyner-tri was using UpTo67d. My bad. Yes, we should be using UpTo67d instead I believe. |
Initial experiments with UpTo67d show maybe a 20% speedup over Xd, but are more painful to work with. I still haven't got the forward dynamics case working; it dies trying to evaluate an expression tree containing an empty derivative deep inside Eigen. I'll have more to show from those experiments tomorrow. |
In order to get all the cassie_bench cases working with UpTo67d autodiff, I had to locally install and use eigen 3.3.8, which has a fix for its
Comparison of timings from AutoDiffXd to AutoDiff67d (implemented as up-to): Summary output, showing allocation counts and selected object sizes:
Limitations: why aren't we just doing Upto128d or something?
|
Current cassie_bench performance snapshot. This includes the effects of the AVX transform composition functions.
From the initial measurements, this is a 41.3/18.5= 2.2X speedup for the mass matrix (or a 55% discount if you prefer). Inverse dynamics is 18% faster and forward dynamics is 6% faster (those haven't been worked on yet). The AutoDiff timings here improved by around 5% over the previous times reported here or about 30% overall. |
I don't foresee doing much further work here. Handing back to @sherm1 for re-triage. |
It would be nice to see an update to this once the changes from #16877 lands |
Will do. |
/cc @sherm1
We would like to capture some sort of baseline data from the cassie benchmark program to help guide development of performance optimizations. I have argued it should be stored out-of-tree because
Even with those challenges, I think it it is useful to have a place to store and perhaps discuss baseline data. I propose to use this issue for that purpose until we arrive at better solutions.
cc'ing myself @amcastro-tri so that I can follow this.
The text was updated successfully, but these errors were encountered: