Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional show info for PCA #90

Merged
merged 13 commits into from
Mar 3, 2022
Merged

Additional show info for PCA #90

merged 13 commits into from
Mar 3, 2022

Conversation

Tokazama
Copy link
Contributor

@Tokazama Tokazama commented Apr 7, 2019

This arose from a discussion on discourse https://discourse.julialang.org/t/pca-output/22687/10.
Output is inspired by psych package from R. The pattern matrix included here is features by components. It would be more clear with row and column names but I didn't want to introduce new dependencies.

It's not polished by any means but I wanted to atleast get it started and get some feedback.

This arose from https://discourse.julialang.org/t/pca-output/22687/10.
Main contribution is pattern matrix with features x components.
This should probably be refined but not sure what to do without
DataFrames dependency.
@nalimilan
Copy link
Member

Thanks. Could you do a quick survey of what other packages/languages print? I suspect there are differences between disciplinary fields so better check that we don't choose an approach/terminology which is only familiar to psychologists.

Additional points:

  • It seems to me that we should provide a function to get the pattern matrix: otherwise it's frustrating for users when they cannot retrieve the printed information in a usable form. That will also allow defining what is meant exactly by "standardized loadings".
  • The same output should probably used for several similar techniques, like factor analysis.
  • The dump method we currently provide should be deprecated in favor of show as that's an abuse of dump AFAIK.
  • A way to print row and column names is to do something similar to show at Improve printing CoefTable StatsBase.jl#481.

@Tokazama
Copy link
Contributor Author

Tokazama commented Apr 8, 2019

Here's some broad coverage

I looked for Python output but I didn't find anything consistent, but I didn't look that hard because I've always found Python's support for stats to be underwhelming.

I don't think the proposed addition is unique to the field of psychology, but I think the minitab example provides a lot of additional output that we might consider.

@nalimilan
Copy link
Member

OK, thanks. Do all of these use the same definition of loadings (notably regarding standardization)?

Regarding row and column names, we had a discussion with @andreasnoack and an interesting approach would be to use wider types in method signatures so that NamedArrays and AxisArrays would be accepted as input data. Then NamedArrays would have to be improved so that its eigen method preserves names, like many other functions already do (e.g. *). Hopefully that should be enough so that your code automatically prints an array with names when a NamedArray is passed, without MultivariateStats depending on NamedArrays. But that will take some work, so maybe better start without names in this PR (unless you want to try the more ambitious fix of course).

@Tokazama
Copy link
Contributor Author

Tokazama commented Apr 8, 2019

I haven't read through all of the docs for each of these but I know that the exact method of obtaining loadings varies in the R functions when choosing different methods of rotation. I think the most sustainable way to manage this is to have an API that supports specification of different rotation methods. We don't need to implement anything other than what we have here and then other contributors would have a clear place to introduce more methods in the future.

I might be able to make some time in the future to add to other rotation methods but I'm basing a lot of this off what I can learn from R docs.

I'll probably just go without column/row labels for now and provide some documentation for what it means, if that sounds alright to you.

@nalimilan
Copy link
Member

Yes, we should have a way to choose the definition at some point. But here we just have to decide what to print by default. To make that decision I really think we need to have a rough idea of what other software prints, and how they call it.

@wildart
Copy link
Collaborator

wildart commented Apr 15, 2019

Why not to put it all as an dump function overloading?

@ararslan
Copy link
Member

dump shouldn't really be extended; it's intended to show the internal representation of things.

@Tokazama
Copy link
Contributor Author

Perhaps overloading describe would be better for verbose output.

@nalimilan
Copy link
Member

I'd prefer show to print the full output at the REPL by default: it's supposed to be multi-line. This is what we do when printing e.g. arrays or data frames. We have the two-argument show method to request a short representation where appropriate.

FWIW, R has both a semi-compact output by default and a somewhat longer summary method, and I find that really annoying: I always end up calling summary (e.g. for regression models).

@wildart
Copy link
Collaborator

wildart commented May 25, 2019

I'm not a fan of a default full output, but it looks like the only possible option to have separate short and full output comes with having full output by default in a console. So, let's put a longer output in show(io::IO, ::MIME"text/plain", M::PCA), and leave show(io::IO, M::PCA) without changes.

@nalimilan
Copy link
Member

Any updates?

@Tokazama
Copy link
Contributor Author

Tokazama commented Sep 4, 2019

Sorry, this got lost in some other stuff I was working on. I just pushed what I had sitting around and forgot about but I can look at putting the output in a CoefTable.

src/pca.jl Outdated
ldgs_signs[ldgs_signs .== 0] .= 1
ldgs = ldgs * diagm(0 => ldgs_signs[:])
print(io, "\n\nPattern matrix\n")
display(ldgs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if this came out with columns and rows labeled but without access to the variable names that's not possible. In the current state the columns are the components in the same order as those in the following coeftable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's annoying. For now maybe print a CoefTable object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.

Also, how about printing the second table first? That's the most important result I think.

@Tokazama
Copy link
Contributor Author

Tokazama commented Sep 4, 2019

One note that may be out of the scope of this PR, it would be nice if the returned PCA instance also specified what algorithm was used to estimate it.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I wonder whether there should be a way for users to extract the printed information in an object. It's frustrating to get printed results, but have to redo the computations by hand. That's a bit of an orthogonal issue, but still. Maybe coeftable could return the pattern matrix? Or could we find another name for that operation, which would return something meaningful also for other algorithms?

@@ -3,7 +3,7 @@ module MultivariateStats
using StatsBase: SimpleCovariance, CovarianceEstimator
import Statistics: mean, var, cov, covm
import Base: length, size, show, dump
import StatsBase: fit, predict, ConvergenceException
import StatsBase: fit, predict, ConvergenceException, CoefTable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import is for when you overload a function with a new method. It shouldn't be needed here.

src/pca.jl Outdated Show resolved Hide resolved
src/pca.jl Outdated Show resolved Hide resolved
src/pca.jl Outdated Show resolved Hide resolved
src/pca.jl Outdated
ldgs_signs = sign.(sum(ldgs, dims=1))
ldgs_signs[ldgs_signs .== 0] .= 1
ldgs = ldgs * diagm(0 => ldgs_signs[:])
print(io, "\n\nPattern matrix\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(io, "\n\nPattern matrix\n")
print(io, "\n\nPattern matrix:\n")

(for consistency with below)

Also, better be specific about what this matrix is: psych says "standardized loadings".

src/pca.jl Outdated
ldgs_signs[ldgs_signs .== 0] .= 1
ldgs = ldgs * diagm(0 => ldgs_signs[:])
print(io, "\n\nPattern matrix\n")
display(ldgs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's annoying. For now maybe print a CoefTable object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.

Also, how about printing the second table first? That's the most important result I think.

src/pca.jl Show resolved Hide resolved
src/pca.jl Outdated
print(io, "Importance of components:\n")
print(io, CoefTable(vcat(principalvars(M)', (principalvars(M) ./ tvar(M))', (cumsum(principalvars(M) ./tvar(M)))'),
string.("PC", 1:length(principalvars(M))), # components in order
["Loadings", "Proportion explained", "Cumulative proportion"])) # row names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Loadings" is "SS loadings" from psych, right? Shouldn't that be specified? Also "proportion" should say of what (the variance). And how about adding "proportion of variance" and "cumulative variance" which are in base R and psych?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not "Loadings", they are "Eigenvalues" or "SS loadings" as R calls them.

Loadings would be

projection(M).*sqrt.(principalvars(M))'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wildart You're definitely right. I apologize for the oversight.

Tokazama and others added 4 commits September 5, 2019 11:06
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@Tokazama
Copy link
Contributor Author

Tokazama commented Sep 5, 2019

Would these make sense?

  • proportionvar = proportion of variance explained
  • cumproportionvar = cumulative proportion variance explained

Perhaps this is an API thing that needs to be resolved in #94 first.

@nalimilan
Copy link
Member

* `proportionvar` = proportion of variance explained

* `cumproportionvar` = cumulative proportion variance explained

I'm not sure these are the most needed, as they are relatively simple to compute using principalvars and tvar. The second one in particular is easy to recompute from the first one. Though adding proportionvar wouldn't be absurd. But I'm more worried about providing a function to get the pattern matrix, which is much harder to recompute by hand.

@wildart wildart added this to the v1.0.0 milestone Feb 25, 2021
@wildart
Copy link
Collaborator

wildart commented Feb 17, 2022

Now, it looks like this:

julia> X = RDatasets.dataset("datasets", "iris")[:,1:4] |> Matrix;

julia> M = fit(PCA, X', maxoutdim=3)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)

Pattern matrix:
────────────────────────────────────
         PC1         PC2         PC3
────────────────────────────────────
1   0.743108   0.323446   -0.16277
2  -0.173801   0.359689    0.167212
3   1.76155   -0.0854062   0.0213202
4   0.736739  -0.0371832   0.152647
────────────────────────────────────

Importance of components:
─────────────────────────────────────────────────────────
                                PC1        PC2        PC3
─────────────────────────────────────────────────────────
SS Loadings (Eigenvalues)  4.22824   0.242671   0.0782095
Variance explained         0.924619  0.0530665  0.0171026
Cumulative variance        0.924619  0.977685   0.994788
Proportion explained       0.929463  0.0533445  0.0171922
Cumulative proportion      0.929463  0.982808   1.0
─────────────────────────────────────────────────────────

which is close to R psych version:

> pc <- principal(iris[1:4],3,rotate="none",cor="cov",covar=TRUE)
> print(pc, digits=6)
Principal Components Analysis
Call: principal(r = iris[1:4], nfactors = 3, rotate = "none", covar = TRUE, 
    cor = "cov")
Unstandardized loadings (pattern matrix) based upon covariance matrix
                   PC1       PC2       PC3       h2         u2       H2
Sepal.Length  0.743108  0.323446 -0.162770 0.683321 0.00237236 0.996540
Sepal.Width  -0.173801  0.359689  0.167212 0.187543 0.00243649 0.987175
Petal.Length  1.761545 -0.085406  0.021320 3.110790 0.00548792 0.998239
Petal.Width   0.736739 -0.037183  0.152647 0.567468 0.01353832 0.976698

                           PC1      PC2      PC3
SS loadings           4.228242 0.242671 0.078210
Proportion Var        0.924619 0.053066 0.017103
Cumulative Var        0.924619 0.977685 0.994788
Proportion Explained  0.929463 0.053345 0.017192
Cumulative Proportion 0.929463 0.982808 1.000000

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

src/pca.jl Outdated Show resolved Hide resolved
src/pca.jl Outdated Show resolved Hide resolved
Tokazama and others added 2 commits February 17, 2022 08:54
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan nalimilan requested a review from wildart March 3, 2022 09:07
Copy link
Collaborator

@wildart wildart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wildart wildart merged commit 48b2b50 into JuliaStats:master Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants