-
Notifications
You must be signed in to change notification settings - Fork 830
Description
Is your feature request related to a problem?
tl;dr PCA().run() could be much faster if we compute the covariance matrix in one shot rather than build it incrementally per-frame, and also would benefit a lot from switching from np.linalg.eig to np.linalg.eigh.
Taking the ADK equilibrium trajectory data as an example workflow, we might do something like:
import MDAnalysis as mda
from MDAnalysisData import datasets
from MDAnalysis.analysis import pca
adk = datasets.fetch_adk_equilibrium()
universe = mda.Universe(adk.topology, adk.trajectory)
pc = pca.PCA(universe, align=True)
pc.run()This takes 881 seconds (~15 mins) on a GCloud c2d-standard-4 (only 5.2s though if you only want to look at Calcium atoms or something).
Among the largest contributors to the runtime are the incremental covariance matrix computation (one dot call per frame):
mdanalysis/package/MDAnalysis/analysis/pca.py
Line 332 in 35e4334
| self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T) |
And also using eig to calculate the eigenvalues and eigenvectors of that covariance matrix:
mdanalysis/package/MDAnalysis/analysis/pca.py
Line 336 in 35e4334
| e_vals, e_vects = np.linalg.eig(self.cov) |
Describe the solution you'd like
Two changes here would make a big difference:
- Computing the covariance matrix once during
_conclude(can use the highly optimized blas gemm kernel). - Using
eighinstead ofeig, since the covariance matrix is symmetric and real-valued by construction.
Describe alternatives you've considered
The alternative is to do one or the other or nothing, but I think both of these changes are fairly straightforward. There's going to be a modest change in peak memory usage; building the covariance matrix in one go will require more than before, but eigh allocates a smaller workspace, so it's a bit of a wash. I'll put some memory profiling in my PR.