Performance: significant speedup and memory usage improvement for PCA().run() by jberg5 · Pull Request #5352 · MDAnalysis/mdanalysis

jberg5 · 2026-03-29T13:07:12Z

Changes made in this Pull Request:

Build covariance matrix once, during _conclude, rather than incrementally on every frame. This uses a much better optimized openblas gemm routine. This is where the bulk of the performance improvement comes from.
Switch from np.linalg.eig to np.linalg.eigh to take advantage of covariance matrix being symmetric and real-valued by construction. This is where the bulk of the memory usage improvement comes from - eigh lapack routine allocates a smaller workspace.

Results from running a PCA analysis on adk equilibrium data on a gcloud c2d-standard-4:


  ┌──────────┬───────┬────────┬───────┬─────────┬───────────────┬──────────────┐
  │  Select  │ Atoms │ Before │ After │ Speedup │ Memory Before │ Memory After │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ CA       │ 214   │ 5.2s   │ 3.9s  │ 1.3x    │ 13 MB         │ 27 MB        │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ Backbone │ 855   │ 39.0s  │ 6.4s  │ 6.1x    │ 201 MB        │ 183 MB       │
  ├──────────┼───────┼────────┼───────┼─────────┼───────────────┼──────────────┤
  │ All-atom │ 3341  │ 880s   │ 84s   │ 10.5x   │ 3067 MB       │ 1854 MB      │
  └──────────┴───────┴────────┴───────┴─────────┴───────────────┴──────────────┘

Here's the gist I used, heads up though Claude wrote it: https://gist.github.com/jberg5/522fd1892f02453145e59e90e5a41e8b

LLM / AI generated code disclosure

LLMs or other AI-powered tools (beyond simple IDE use cases) were used in this contribution: yes

First pass of this was done by Claude (with opus 4.6). I've modified it a little bit to remove some unnecessary changes, improve variable names, etc. I also used Claude extensively to write benchmarking scripts.

PR Checklist

Issue raised/referenced?
Tests updated/added? No - I think existing coverage is pretty good already, and this is a performance improvement
Documentation updated/added?
package/CHANGELOG file updated?
Is your name in package/AUTHORS? (If it is not, add it!)
LLM/AI disclosure was updated.

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.

📚 Documentation preview 📚: https://mdanalysis--5352.org.readthedocs.build/en/5352/

codecov · 2026-03-29T13:20:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.83%. Comparing base (c32fe45) to head (372e8d1).
⚠️ Report is 3 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #5352      +/-   ##
===========================================
- Coverage    93.84%   93.83%   -0.01%     
===========================================
  Files          182      182              
  Lines        22492    22495       +3     
  Branches      3199     3199              
===========================================
+ Hits         21107    21109       +2     
- Misses         923      924       +1     
  Partials       462      462

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jberg5 · 2026-03-29T13:21:47Z

package/MDAnalysis/analysis/pca.py

@@ -290,6 +295,7 @@ def _prepare(self):
            )
        n_dim = self._n_atoms * 3
        self.cov = np.zeros((n_dim, n_dim))


Technically we can remove this, but I wasn't sure if anyone out there would expect it to be accessible from _prepare onwards. No strong feelings either way, it's relatively cheap to initialize and maximizes backwards compatibility , but future readers might be confused by it.

BradyAJohnston

I'm going to block this until we have some core dev discussion about it.

Our AI Policy is currently quite restrictive on AI usage. It's this way to help us deal with the wave of slop PRs that we get, and would currently exclude this PR.

Given the performance improvements, limited scope & demonstrated understanding I would argue this would be worthy of an exception but that is different discussion that we need to have.

jberg5 · 2026-03-30T00:55:36Z

@BradyAJohnston thanks for the context, totally understand. Just to give a little more background on me and how I have been using Claude: I work in quant finance, where I've picked up some practical experience with using the scientific computing stack to try to pull signal out of noise. My thesis with Claude is that if I use it as a translator, it can help me map my experience onto different domains (perhaps with a little more benefit to society than my day job lol). Turns out PCA is good for more than just looking for factors driving equity prices !!!

The workflow I'm using looks something like 0) clone a cool project like MDAnalysis 1) find or write suitable benchmarks of realistic workflows 2) tell Claude "here are some patterns I want you to look for" (python loops that can be vectorized, linear algebra operations that don't fully exploit the problem structure, etc), and then 3) sift through the slop to find legit improvements.

I try really hard to stick to small diffs that I can completely stand behind. For this particular PR I ended up rewriting most of what Claude proposed anyway, and I think the end result is pretty much identical to what I would have done by hand, the only difference is that AI helped me find the opportunity and iterate much more quickly than I otherwise would have.

Anyway, I know that there's a lot of garbage out there these days, and it's gotta be a lot of work to wade through it, so I'll respect whatever decision you make. Really appreciate your time!

orbeckst · 2026-03-31T01:05:18Z

package/MDAnalysis/analysis/pca.py

        else:
            x = self._atoms.positions.ravel()
        x -= self._xmean
-        self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)


One reason for building the covariance matrix incrementally is that it ensures bounds on memory usage. In your approach, memory consumption grows linearly with trajectory length because you're copying coordinates into memory and thus a longer trajectory will eventually fill up the memory. The test trajectory is very short. Real trajectories are easily 1000-100,000 times longer.

If there's really an advantage to multiplying huge matrices then perhaps a batched approach is a good middle ground.

Some manual benchmarking:

import numpy as np import MDAnalysis as mda; from MDAnalysis.tests import datafiles as data u = mda.Universe(data.PSF, data.DCD) x = u.atoms.positions.ravel()

shows that replacing the np.dot with transposes with np.outer will help already

>>> %timeit np.dot(x[:, np.newaxis], x[:, np.newaxis].T) 61.4 ms ± 466 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit np.outer(x, x) 35.4 ms ± 249 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

However, the addition update is actually slow

%%timeit cov = np.zeros((len(x), len(x))); xx = np.outer(x,x) ...: cov += xx ...: 97.5 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it makes sense to do the addition as part of matrix multiplication.

Batching is probably the way to go (but that makes streaming more difficult).

jberg5 added 2 commits March 29, 2026 10:55

Compute covariance matrix once and use eigh for significant speedup

29e0e21

docs

372e8d1

jberg5 commented Mar 29, 2026

View reviewed changes

BradyAJohnston requested changes Mar 30, 2026

View reviewed changes

orbeckst reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: significant speedup and memory usage improvement for PCA().run()#5352

Performance: significant speedup and memory usage improvement for PCA().run()#5352
jberg5 wants to merge 2 commits intoMDAnalysis:developfrom
jberg5:pca-perf

jberg5 commented Mar 29, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

jberg5 Mar 29, 2026 •

edited

Loading

Uh oh!

BradyAJohnston left a comment

Uh oh!

jberg5 commented Mar 30, 2026

Uh oh!

orbeckst Mar 31, 2026

Uh oh!

orbeckst Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jberg5 commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM / AI generated code disclosure

PR Checklist

Developers Certificate of Origin

Uh oh!

codecov bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jberg5 Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BradyAJohnston left a comment

Choose a reason for hiding this comment

Uh oh!

jberg5 commented Mar 30, 2026

Uh oh!

orbeckst Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

orbeckst Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jberg5 commented Mar 29, 2026 •

edited

Loading

codecov bot commented Mar 29, 2026 •

edited

Loading

jberg5 Mar 29, 2026 •

edited

Loading