2 PCA implementations that give same results but different from Python scikit-learn implementation ... #81

SergeStinckwich · 2018-08-03T14:48:48Z

We have 2 implementations of PCA :

one based on Jacobi Transformation of the covariance matrix
another one based on SVD.

They give the same results but the result are different from the one you can find with sci-kit learn in Python:

import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
pca.components_
pca.transform(X)

pca.components_ returns :

array([[-0.83849224, -0.54491354],
       [ 0.54491354, -0.83849224]])

pca.transform(X) returns:

array([[ 1.38340578,  0.2935787 ],
       [ 2.22189802, -0.25133484],
       [ 3.6053038 ,  0.04224385],
       [-1.38340578, -0.2935787 ],
       [-2.22189802,  0.25133484],
       [-3.6053038 , -0.04224385]])

I try to implement a flipsvd method like this one : https://github.com/scikit-learn/scikit-learn/blob/4c65d8e615c9331d37cbb6225c5b67c445a5c959/sklearn/utils/extmath.py#L609
but fails until now.

Please have a look to tests of PMPrincipalComponentAnalyserTest.

The text was updated successfully, but these errors were encountered:

hemalvarambhia · 2018-11-23T22:23:51Z

Maybe this is an odd suggestion, but to debug this can we write a simpler test, say, with the identity matrix as X? I wrote this test in Pharo and the output is:

( -1/sqrt(2), 1/sqrt(2) )
( 1/sqrt(2), 1/sqrt(2) )

which, interestingly, with numpy the value of pca.components:

[[-0.70710678 0.70710678]
[ 0.70710678 0.70710678]]

hemalvarambhia · 2019-01-16T18:47:04Z

Hi, @SergeStinckwich I looked into the failing test for the Jacobi Transformation form of PCA a little deeper and noted that transposing and negating the matrix of eigenvectors results in the test passing:

The Scikit Learn library, from what I can see, uses the SVD implementation which involves flipping signs. For this test, then, does it make sense to directly compare the Jacobi matrix of eigenvectors with their SVD computed one? Might we instead assert that the negated transpose of Jacobi equals the SVD one?

Please forgive my ignorance- I'm not too familiar with this domain, so I'm sure I'm missing something!

hemalvarambhia · 2019-01-16T21:39:48Z

@SergeStinckwich I noted a similar thing with PolyMath's PCA implementation using SVD, where transposing and negating the returned matrix from transformMatrix also ets the test passing:

This phenomenon is interesting, I think, perhaps worth investigating.

Also I note that flipEigenvectorsSign has been commented out in the fit: message because it doesn't work. Could you elaborate on this a little more, please?

SergeStinckwich · 2019-01-17T08:28:05Z

Thank you for spending time to investigate more on this. I forget a little bit about the details of implementation. I will have to spend some time to remember what I have done here.

hemalvarambhia · 2019-01-17T11:24:36Z

Thank you for your help on this, @SergeStinckwich. The code is really well-written compared to the Python library. I read this paper on the subject and your work was very easy to follow from that. So far the difference between your work and the Python one is just transpose negate.

hemalvarambhia · 2019-03-11T10:44:25Z

Hi, @SergeStinckwich,
I made some more progress on this, where I extracted out the algorithm from SciKit-Learn and tried to determine the outputs of each step. Here is something you can run on the command line to see this at a more granular level: scikit-learn flip algorithm

Given that in SciKit-Learn, U is

[[-0.21956688  0.53396977]
 [-0.35264795 -0.45713538]
 [-0.57221483  0.07683439]
 [ 0.21956688 -0.53396977]
 [ 0.35264795  0.45713538]
 [ 0.57221483 -0.07683439]]

(the first two columns from PolyMath match SciKit-Learn's)

the first thing to note is the max_abs_column variable, which in SciKitLearn is a 2-element array: [2 0], whereas in PolyMath it is #(3 4 5...). For the former, this leads to a sub matrix: [-0.57221483 0.53396977], which then leads to the signs being [-1 1]. This is the key difference because in PolyMath, we get #(-1-1 -1 1 1 -1). Our our argMaxOnColumns message yields -0.53396... from row 4, column 2 rather than 0.53396977 at row 1.

hemalvarambhia · 2019-03-12T21:45:12Z

On further investigation, when we compute the maxAbsColumn in PolyMath, we get the element at row 4 as the highest value because it is 0.5339697737377911, whereas at row 1 it is 0.5339697737377903. With SciKitLearn the columns aren't as precise as you know, so perhaps we need to do the same and round off.

SergeStinckwich · 2019-03-27T09:53:30Z

In Mathematica if we use SVD:

m = {{-1, -1}, {2, -1}, {-3, -2}, {1, 1}, {2, 1}, {3, 2}}
{{-1, -1}, {2, -1}, {-3, -2}, {1, 1}, {2, 1}, {3, 2}}
{u, w, v} = SingularValueDecomposition[N[m]]
{{{0.227413, -0.184384, -0.602751, 0.246542, 0.356209, 
   0.602751}, {-0.204296, -0.949274, 0.0746245, 
   0.109693, -0.184317, -0.0746245}, {0.598729, -0.113805, 0.705625, 
   0.128753, 0.165622, 0.294375}, {-0.227413, 0.184384, 0.106997, 
   0.942819, -0.0498161, -0.106997}, {-0.371316, -0.070579, 
   0.187377, -0.0715718, 0.884194, -0.187377}, {-0.598729, 0.113805, 
   0.294375, -0.128753, -0.165622, 0.705625}}, {{6.01037, 0.}, {0., 
   1.96863}, {0., 0.}, {0., 0.}, {0., 0.}, {0., 
   0.}}, {{-0.86491, -0.501927}, {-0.501927, 0.86491}}}
v
{{-0.86491, -0.501927}, {-0.501927, 0.86491}}

SergeStinckwich · 2019-03-27T09:59:42Z

Action points now:

try to find another example as an acceptance test,
run this example against PolyMath, Mathematica and scikit learn to order to find some patterns,
how SVD is done in Numpy ? what are the decisions taken by them ?

hemalvarambhia · 2019-03-27T13:00:47Z

I tried Wolfram Alpha and our V matrix matches theirs.

SergeStinckwich · 2019-03-27T13:27:09Z

Why Mathematica SVD is different from another math engine
Mathematica SVD documentation: https://reference.wolfram.com/language/ref/SingularValueDecomposition.html

hemalvarambhia · 2019-03-27T17:56:57Z

Wolfram Alpha:

hemalvarambhia · 2019-03-27T22:23:09Z

Polymath calculation: here we see it matches Wolfram Alpha exactly:

hemalvarambhia · 2019-03-27T22:42:29Z

The next example we could try is in section 3.1 of this article

Here is the answer according to Wolfram Alpha. You can see that the V matrix is similar in that the elements have the same sign and magnitude.

hemalvarambhia · 2019-04-14T15:35:08Z

@SergeStinckwich In relation to the first of our action points, to find another example as an acceptance test, I tried the example in section 3.1 of the PCA tutorial. The PCA of scikit learn matches the output published in the above paper almost exactly:.

[[-0.82797019 -0.17511531]
 [ 1.77758033  0.14285723]
 [-0.99219749  0.38437499]
 [-0.27421042  0.13041721]
 [-1.67580142 -0.20949846]
 [-0.9129491   0.17528244]
 [ 0.09910944 -0.3498247 ]
 [ 1.14457216  0.04641726]
 [ 0.43804614  0.01776463]
 [ 1.22382056 -0.16267529]]

This is not the case for PolyMath, sadly, for both implementations. However, as we've discovered, the problem is a bit further up, in how we're computing the eigenvectors.

hemalvarambhia · 2019-04-19T21:36:17Z

@SergeStinckwich Out of curiosity, I used the mean centred data from the PCA tutorial:

x	y
.69	.49
-1.31	-1.21
.39	.99
.09	.29
1.29	1.09
.49	.79
.19	-.31
-.81	-.81
-.31	-.31
-.71	-1.01

and for the SVD-based implementation of PCA, PolyMath's output is correct up to negation:

Similarly for the Jacobi implementation, again our output is correct up to the negation:

hemalvarambhia · 2019-04-20T21:31:56Z

On point 3, Scikit-Learn uses scipy for the SVD part of the PCA. In turn scipy delegates to LAPACK driver routines and according to the documentation, by default, it uses gessd (the available options are gessd and gesvd). I could not see anything in the scipy source code (decomp_svd.py) that indicated the decision taken in flipping signs.

SergeStinckwich added this to the 1.0 (codename Cagliari) milestone Aug 3, 2018

SergeStinckwich added Priority: Medium Status: On Hold Type: Bug labels Aug 3, 2018

olekscode self-assigned this Aug 4, 2018

hemalvarambhia self-assigned this Mar 13, 2019

hemalvarambhia mentioned this issue Mar 28, 2019

[Issue 81] 2 PCA implementations that give same results but different from Python scikit-learn implementation #93

Merged

hemalvarambhia added Status: In Progress and removed Status: On Hold Type: Bug labels Apr 5, 2019

hemalvarambhia mentioned this issue Apr 13, 2019

[Issue 81] - 2 PCA implementations that give same results but different from Python scikit-learn implementation #96

Merged

hemalvarambhia mentioned this issue Apr 20, 2019

[Issue 81] 2 PCA implementations that give same results but different from Python scikit-learn implementation #97

Merged

hemalvarambhia removed the Status: In Progress label Apr 24, 2019

hemalvarambhia added the Status: Completed label Apr 24, 2019

hemalvarambhia closed this as completed Apr 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2 PCA implementations that give same results but different from Python scikit-learn implementation ... #81

2 PCA implementations that give same results but different from Python scikit-learn implementation ... #81

SergeStinckwich commented Aug 3, 2018 •

edited

hemalvarambhia commented Nov 23, 2018

hemalvarambhia commented Jan 16, 2019 •

edited

hemalvarambhia commented Jan 16, 2019 •

edited

SergeStinckwich commented Jan 17, 2019

hemalvarambhia commented Jan 17, 2019

hemalvarambhia commented Mar 11, 2019 •

edited

hemalvarambhia commented Mar 12, 2019 •

edited

SergeStinckwich commented Mar 27, 2019

SergeStinckwich commented Mar 27, 2019 •

edited

hemalvarambhia commented Mar 27, 2019 •

edited

SergeStinckwich commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019 •

edited

hemalvarambhia commented Apr 14, 2019 •

edited

hemalvarambhia commented Apr 19, 2019 •

edited

hemalvarambhia commented Apr 20, 2019 •

edited

2 PCA implementations that give same results but different from Python scikit-learn implementation ... #81

2 PCA implementations that give same results but different from Python scikit-learn implementation ... #81

Comments

SergeStinckwich commented Aug 3, 2018 • edited

hemalvarambhia commented Nov 23, 2018

hemalvarambhia commented Jan 16, 2019 • edited

hemalvarambhia commented Jan 16, 2019 • edited

SergeStinckwich commented Jan 17, 2019

hemalvarambhia commented Jan 17, 2019

hemalvarambhia commented Mar 11, 2019 • edited

hemalvarambhia commented Mar 12, 2019 • edited

SergeStinckwich commented Mar 27, 2019

SergeStinckwich commented Mar 27, 2019 • edited

hemalvarambhia commented Mar 27, 2019 • edited

SergeStinckwich commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019

hemalvarambhia commented Mar 27, 2019 • edited

hemalvarambhia commented Apr 14, 2019 • edited

hemalvarambhia commented Apr 19, 2019 • edited

hemalvarambhia commented Apr 20, 2019 • edited

SergeStinckwich commented Aug 3, 2018 •

edited

hemalvarambhia commented Jan 16, 2019 •

edited

hemalvarambhia commented Jan 16, 2019 •

edited

hemalvarambhia commented Mar 11, 2019 •

edited

hemalvarambhia commented Mar 12, 2019 •

edited

SergeStinckwich commented Mar 27, 2019 •

edited

hemalvarambhia commented Mar 27, 2019 •

edited

hemalvarambhia commented Mar 27, 2019 •

edited

hemalvarambhia commented Apr 14, 2019 •

edited

hemalvarambhia commented Apr 19, 2019 •

edited

hemalvarambhia commented Apr 20, 2019 •

edited