Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

Closed
smbrougham opened this issue Mar 13, 2020 · 3 comments · Fixed by #2613
Closed

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

smbrougham opened this issue Mar 13, 2020 · 3 comments · Fixed by #2613

Comments

@smbrougham
Copy link

Expected behavior

p_components is a matrix that defines the first n principal components, where n is specified upon initialization. cumulated_variance is a vector that reflects the proportion of the total variance that is explained by the corresponding principal component and the ones that precede it, no matter how many PCs are kept.

Actual behavior

When n_components is specified, p_components is a matrix with n rows. This produces an incorrect result because the components are represented as the column vectors. The values of the cumulated_variance vector change depending on n_components, and always converge to 1.

Code to reproduce the behavior

This code uses my trajectory, but it illustrates the point. There are 238 atoms, and selecting 5 components. p_components is a 5x714 matrix where it should instead be a 714x5. The cumulated_variance converges to 1 even though the first 5 components do not actually explain 100% of the variation, but rather about 62%.

import MDAnalysis.analysis.pca as pca

p = pca.PCA(u, select='name CA', n_components=5).run()
p.p_components.shape
# (5, 714)
p.cumulated_variance
# array([0.6930588 , 0.80182406, 0.88727841, 0.94546782, 1.        ])

Current version of MDAnalysis

MDAnalysis 0.21.1
Python 3.7.5
Ubuntu 18.04

Proposed solution

The _conclude function (line 272-281) of pca.py is currently:

    def _conclude(self):
        self.cov /= self.n_frames - 1
        e_vals, e_vects = np.linalg.eig(self.cov)
        sort_idx = np.argsort(e_vals)[::-1]
        self.variance = e_vals[sort_idx]
        self.variance = self.variance[:self.n_components]
        self.p_components = e_vects[:self.n_components, sort_idx]
        self.cumulated_variance = (np.cumsum(self.variance) /
                                   np.sum(self.variance))
        self._calculated = True

Fix:

    def _conclude(self):
        self.cov /= self.n_frames - 1
        e_vals, e_vects = np.linalg.eig(self.cov)
        sort_idx = np.argsort(e_vals)[::-1]
        self.variance = e_vals[sort_idx]
        self.cumulated_variance = (np.cumsum(self.variance) /
                                   np.sum(self.variance)) # calculated before variance slice
        self.variance = self.variance[:self.n_components]
        self.cumulated_variance = self.cumulated_variance[:self.n_components]
        self.p_components = e_vects[:, sort_idx[:self.n_components]] # all rows, slice of columns
        self._calculated = True

And the documentation at line 136 should read "p_components: array, (n_atoms * 3, n_components)" instead of "p_components: array, (n_components, n_atoms * 3)"

@shfrz
Copy link
Contributor

shfrz commented Mar 14, 2020

Hi @smbrougham, I have implemented the suggested fix and opened a PR :)

@Purva-Chaudhari
Copy link

Hello everyone, Is the issue still open ?

@orbeckst
Copy link
Member

cc @ianmkenney @VOD555 – this is the same issue that you mentioned recently in our internal discussions

@orbeckst orbeckst added this to the 1.0 milestone May 14, 2020
lilyminium added a commit that referenced this issue May 18, 2020
#2613)

- fixes #2623 and now correctly computes cumulated variance
- adds root mean square inner product and cumulative overlap method as ways to compare subspaces
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment