Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with residual variances for PCA/PPCA #156

Closed
mgmverburg opened this issue Jun 3, 2021 · 1 comment
Closed

Bug with residual variances for PCA/PPCA #156

mgmverburg opened this issue Jun 3, 2021 · 1 comment
Labels

Comments

@mgmverburg
Copy link

Hi, so I noticed a bug I believe with PCA and PPCA
So PCA implements tresidualvar described in the docs as "The total residual variance."
And PPCA implements var described in the docs as "The total residual variance."

However, I noticed the implementation varies. With PPCA, with the :ml method selected, when you have identical projection for both PCA and PPCA, the variance given by tresidualvar of PCA is much smaller than that of the variance given by var of PPCA. I went through the code, and indeed, with PPCA it doesn't divide the variance or the V by the total number of observations as it does in the PCA implementation in this line .
I am not sure if this is intentional, but it certainly is confusing, so at the very least the docs would need to be updated.

Additionally, when doing PPCA with :em selected, one also doesn't get the same. I haven't yet fully figured out why.

Anyway, to easily reproduce the issue I am talking about, I made a MWE:

begin
     Random.seed!(1)
     n = 1000
     nr_d = 4 # number of variables
     nr_f = 2 # number of latent factors
     loadings_matrix = rand(Normal(0,1), nr_d, nr_f)
     latent_factor = rand(MvNormal([-1, 1],LinearAlgebra.I), n)'
     y = latent_factor*loadings_matrix' + rand(Normal(0, 0.1), n, nr_d)
     M_PCA = MultivariateStats.fit(PCA, y'; method=:svd)
     println(tresidualvar(M_PCA))
     M_PPCA = MultivariateStats.fit(PPCA, y', method=:ml, maxoutdim=2, maxiter=10000000)
     println(var(M_PPCA))
     M_PPCA_em = MultivariateStats.fit(PPCA, y', method=:em, maxoutdim=2, maxiter=10000000)
     println(var(M_PPCA_em))
 end
@wildart
Copy link
Collaborator

wildart commented Jun 22, 2021

Indeed, because data for :ml method wasn't scaled before SVD, the eigenvalue must be scaled afterward for getting correct variances. And that wasn't done. This line should be following V = abs2.(λ[ord]) ./ (n-1).

After that, the explained variance is reported correctly.

julia> M_PPCA2 = MultivariateStats.fit(PPCA, y', method=:ml, maxoutdim=2, maxiter=10000000)
Probabilistic PCA(indim = 4, outdim = 2, σ² = 0.0103621926397605)

@wildart wildart added the bug label Jun 22, 2021
wildart added a commit to wildart/MultivariateStats.jl that referenced this issue Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants