Skip to content

Performance improvement: PCA could be much faster #5351

@jberg5

Description

@jberg5

Is your feature request related to a problem?

tl;dr PCA().run() could be much faster if we compute the covariance matrix in one shot rather than build it incrementally per-frame, and also would benefit a lot from switching from np.linalg.eig to np.linalg.eigh.

Taking the ADK equilibrium trajectory data as an example workflow, we might do something like:

import MDAnalysis as mda
from MDAnalysisData import datasets
from MDAnalysis.analysis import pca

adk = datasets.fetch_adk_equilibrium()
universe = mda.Universe(adk.topology, adk.trajectory)
pc = pca.PCA(universe, align=True)
pc.run()

This takes 881 seconds (~15 mins) on a GCloud c2d-standard-4 (only 5.2s though if you only want to look at Calcium atoms or something).
Among the largest contributors to the runtime are the incremental covariance matrix computation (one dot call per frame):

self.cov += np.dot(x[:, np.newaxis], x[:, np.newaxis].T)

And also using eig to calculate the eigenvalues and eigenvectors of that covariance matrix:

e_vals, e_vects = np.linalg.eig(self.cov)

Describe the solution you'd like

Two changes here would make a big difference:

  • Computing the covariance matrix once during _conclude (can use the highly optimized blas gemm kernel).
  • Using eigh instead of eig, since the covariance matrix is symmetric and real-valued by construction.

Describe alternatives you've considered

The alternative is to do one or the other or nothing, but I think both of these changes are fairly straightforward. There's going to be a modest change in peak memory usage; building the covariance matrix in one go will require more than before, but eigh allocates a smaller workspace, so it's a bit of a wash. I'll put some memory profiling in my PR.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions