Performance: significant speedup and memory usage improvement for PCA().run()#5352
Performance: significant speedup and memory usage improvement for PCA().run()#5352jberg5 wants to merge 2 commits intoMDAnalysis:developfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5352 +/- ##
===========================================
- Coverage 93.84% 93.83% -0.01%
===========================================
Files 182 182
Lines 22492 22495 +3
Branches 3199 3199
===========================================
+ Hits 21107 21109 +2
- Misses 923 924 +1
Partials 462 462 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| @@ -290,6 +295,7 @@ def _prepare(self): | |||
| ) | |||
| n_dim = self._n_atoms * 3 | |||
| self.cov = np.zeros((n_dim, n_dim)) | |||
There was a problem hiding this comment.
Technically we can remove this, but I wasn't sure if anyone out there would expect it to be accessible from _prepare onwards. No strong feelings either way, it's relatively cheap to initialize and maximizes backwards compatibility , but future readers might be confused by it.
BradyAJohnston
left a comment
There was a problem hiding this comment.
I'm going to block this until we have some core dev discussion about it.
Our AI Policy is currently quite restrictive on AI usage. It's this way to help us deal with the wave of slop PRs that we get, and would currently exclude this PR.
Given the performance improvements, limited scope & demonstrated understanding I would argue this would be worthy of an exception but that is different discussion that we need to have.
|
@BradyAJohnston thanks for the context, totally understand. Just to give a little more background on me and how I have been using Claude: I work in quant finance, where I've picked up some practical experience with using the scientific computing stack to try to pull signal out of noise. My thesis with Claude is that if I use it as a translator, it can help me map my experience onto different domains (perhaps with a little more benefit to society than my day job lol). Turns out PCA is good for more than just looking for factors driving equity prices !!! The workflow I'm using looks something like 0) clone a cool project like MDAnalysis 1) find or write suitable benchmarks of realistic workflows 2) tell Claude "here are some patterns I want you to look for" (python loops that can be vectorized, linear algebra operations that don't fully exploit the problem structure, etc), and then 3) sift through the slop to find legit improvements. I try really hard to stick to small diffs that I can completely stand behind. For this particular PR I ended up rewriting most of what Claude proposed anyway, and I think the end result is pretty much identical to what I would have done by hand, the only difference is that AI helped me find the opportunity and iterate much more quickly than I otherwise would have. Anyway, I know that there's a lot of garbage out there these days, and it's gotta be a lot of work to wade through it, so I'll respect whatever decision you make. Really appreciate your time! |
| else: | ||
| x = self._atoms.positions.ravel() | ||
| x -= self._xmean | ||
| self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T) |
There was a problem hiding this comment.
One reason for building the covariance matrix incrementally is that it ensures bounds on memory usage. In your approach, memory consumption grows linearly with trajectory length because you're copying coordinates into memory and thus a longer trajectory will eventually fill up the memory. The test trajectory is very short. Real trajectories are easily 1000-100,000 times longer.
If there's really an advantage to multiplying huge matrices then perhaps a batched approach is a good middle ground.
There was a problem hiding this comment.
Some manual benchmarking:
import numpy as np
import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data
u = mda.Universe(data.PSF, data.DCD)
x = u.atoms.positions.ravel()shows that replacing the np.dot with transposes with np.outer will help already
>>> %timeit np.dot(x[:, np.newaxis], x[:, np.newaxis].T)
61.4 ms ± 466 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit np.outer(x, x)
35.4 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)However, the addition update is actually slow
%%timeit cov = np.zeros((len(x), len(x))); xx = np.outer(x,x)
...: cov += xx
...:
97.5 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)so it makes sense to do the addition as part of matrix multiplication.
Batching is probably the way to go (but that makes streaming more difficult).
Fixes #5351
Changes made in this Pull Request:
_conclude, rather than incrementally on every frame. This uses a much better optimized openblas gemm routine. This is where the bulk of the performance improvement comes from.np.linalg.eigtonp.linalg.eighto take advantage of covariance matrix being symmetric and real-valued by construction. This is where the bulk of the memory usage improvement comes from -eighlapack routine allocates a smaller workspace.Results from running a PCA analysis on adk equilibrium data on a gcloud c2d-standard-4:
Here's the gist I used, heads up though Claude wrote it: https://gist.github.com/jberg5/522fd1892f02453145e59e90e5a41e8b
LLM / AI generated code disclosure
LLMs or other AI-powered tools (beyond simple IDE use cases) were used in this contribution: yes
First pass of this was done by Claude (with opus 4.6). I've modified it a little bit to remove some unnecessary changes, improve variable names, etc. I also used Claude extensively to write benchmarking scripts.
PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5352.org.readthedocs.build/en/5352/