PCA p_components and cumulated_variance incorrect when n_components is specified #2623

smbrougham · 2020-03-13T16:18:01Z

Expected behavior

p_components is a matrix that defines the first n principal components, where n is specified upon initialization. cumulated_variance is a vector that reflects the proportion of the total variance that is explained by the corresponding principal component and the ones that precede it, no matter how many PCs are kept.

Actual behavior

When n_components is specified, p_components is a matrix with n rows. This produces an incorrect result because the components are represented as the column vectors. The values of the cumulated_variance vector change depending on n_components, and always converge to 1.

Code to reproduce the behavior

This code uses my trajectory, but it illustrates the point. There are 238 atoms, and selecting 5 components. p_components is a 5x714 matrix where it should instead be a 714x5. The cumulated_variance converges to 1 even though the first 5 components do not actually explain 100% of the variation, but rather about 62%.

import MDAnalysis.analysis.pca as pca

p = pca.PCA(u, select='name CA', n_components=5).run()
p.p_components.shape
# (5, 714)
p.cumulated_variance
# array([0.6930588 , 0.80182406, 0.88727841, 0.94546782, 1.        ])

Current version of MDAnalysis

MDAnalysis 0.21.1
Python 3.7.5
Ubuntu 18.04

Proposed solution

The _conclude function (line 272-281) of pca.py is currently:

    def _conclude(self):
        self.cov /= self.n_frames - 1
        e_vals, e_vects = np.linalg.eig(self.cov)
        sort_idx = np.argsort(e_vals)[::-1]
        self.variance = e_vals[sort_idx]
        self.variance = self.variance[:self.n_components]
        self.p_components = e_vects[:self.n_components, sort_idx]
        self.cumulated_variance = (np.cumsum(self.variance) /
                                   np.sum(self.variance))
        self._calculated = True

Fix:

    def _conclude(self):
        self.cov /= self.n_frames - 1
        e_vals, e_vects = np.linalg.eig(self.cov)
        sort_idx = np.argsort(e_vals)[::-1]
        self.variance = e_vals[sort_idx]
        self.cumulated_variance = (np.cumsum(self.variance) /
                                   np.sum(self.variance)) # calculated before variance slice
        self.variance = self.variance[:self.n_components]
        self.cumulated_variance = self.cumulated_variance[:self.n_components]
        self.p_components = e_vects[:, sort_idx[:self.n_components]] # all rows, slice of columns
        self._calculated = True

And the documentation at line 136 should read "p_components: array, (n_atoms * 3, n_components)" instead of "p_components: array, (n_components, n_atoms * 3)"

shfrz · 2020-03-14T05:28:04Z

Hi @smbrougham, I have implemented the suggested fix and opened a PR :)

Purva-Chaudhari · 2020-04-03T20:03:06Z

Hello everyone, Is the issue still open ?

orbeckst · 2020-05-14T10:06:04Z

cc @ianmkenney @VOD555 – this is the same issue that you mentioned recently in our internal discussions

#2613) - fixes #2623 and now correctly computes cumulated variance - adds root mean square inner product and cumulative overlap method as ways to compare subspaces

lilyminium added the Component-Analysis label Mar 14, 2020

shfrz mentioned this issue Mar 14, 2020

Issue2623 #2625

Closed

lilyminium added the GSOC Starter label Mar 22, 2020

lilyminium mentioned this issue Apr 11, 2020

Added PCA subspace comparison methods and fixed n_components selection #2613

Merged

4 tasks

orbeckst assigned lilyminium May 14, 2020

orbeckst added the defect label May 14, 2020

orbeckst added this to the 1.0 milestone May 14, 2020

lilyminium closed this as completed in #2613 May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

smbrougham commented Mar 13, 2020

shfrz commented Mar 14, 2020

Purva-Chaudhari commented Apr 3, 2020

orbeckst commented May 14, 2020

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

PCA p_components and cumulated_variance incorrect when n_components is specified #2623

Comments

smbrougham commented Mar 13, 2020

Expected behavior

Actual behavior

Code to reproduce the behavior

Current version of MDAnalysis

Proposed solution

shfrz commented Mar 14, 2020

Purva-Chaudhari commented Apr 3, 2020

orbeckst commented May 14, 2020