Additional `show` info for PCA #90

Tokazama · 2019-04-07T18:45:28Z

This arose from a discussion on discourse https://discourse.julialang.org/t/pca-output/22687/10.
Output is inspired by psych package from R. The pattern matrix included here is features by components. It would be more clear with row and column names but I didn't want to introduce new dependencies.

It's not polished by any means but I wanted to atleast get it started and get some feedback.

This arose from https://discourse.julialang.org/t/pca-output/22687/10. Main contribution is pattern matrix with features x components. This should probably be refined but not sure what to do without DataFrames dependency.

nalimilan · 2019-04-07T20:16:41Z

Thanks. Could you do a quick survey of what other packages/languages print? I suspect there are differences between disciplinary fields so better check that we don't choose an approach/terminology which is only familiar to psychologists.

Additional points:

It seems to me that we should provide a function to get the pattern matrix: otherwise it's frustrating for users when they cannot retrieve the printed information in a usable form. That will also allow defining what is meant exactly by "standardized loadings".
The same output should probably used for several similar techniques, like factor analysis.
The dump method we currently provide should be deprecated in favor of show as that's an abuse of dump AFAIK.
A way to print row and column names is to do something similar to show at Improve printing CoefTable StatsBase.jl#481.

Tokazama · 2019-04-08T09:54:28Z

Here's some broad coverage

I looked for Python output but I didn't find anything consistent, but I didn't look that hard because I've always found Python's support for stats to be underwhelming.

I don't think the proposed addition is unique to the field of psychology, but I think the minitab example provides a lot of additional output that we might consider.

nalimilan · 2019-04-08T11:29:57Z

OK, thanks. Do all of these use the same definition of loadings (notably regarding standardization)?

Regarding row and column names, we had a discussion with @andreasnoack and an interesting approach would be to use wider types in method signatures so that NamedArrays and AxisArrays would be accepted as input data. Then NamedArrays would have to be improved so that its eigen method preserves names, like many other functions already do (e.g. *). Hopefully that should be enough so that your code automatically prints an array with names when a NamedArray is passed, without MultivariateStats depending on NamedArrays. But that will take some work, so maybe better start without names in this PR (unless you want to try the more ambitious fix of course).

Tokazama · 2019-04-08T14:50:55Z

I haven't read through all of the docs for each of these but I know that the exact method of obtaining loadings varies in the R functions when choosing different methods of rotation. I think the most sustainable way to manage this is to have an API that supports specification of different rotation methods. We don't need to implement anything other than what we have here and then other contributors would have a clear place to introduce more methods in the future.

I might be able to make some time in the future to add to other rotation methods but I'm basing a lot of this off what I can learn from R docs.

I'll probably just go without column/row labels for now and provide some documentation for what it means, if that sounds alright to you.

nalimilan · 2019-04-08T15:08:15Z

Yes, we should have a way to choose the definition at some point. But here we just have to decide what to print by default. To make that decision I really think we need to have a rough idea of what other software prints, and how they call it.

wildart · 2019-04-15T20:14:46Z

Why not to put it all as an dump function overloading?

ararslan · 2019-04-15T20:20:50Z

dump shouldn't really be extended; it's intended to show the internal representation of things.

Tokazama · 2019-04-16T08:44:42Z

Perhaps overloading describe would be better for verbose output.

nalimilan · 2019-04-16T11:48:27Z

I'd prefer show to print the full output at the REPL by default: it's supposed to be multi-line. This is what we do when printing e.g. arrays or data frames. We have the two-argument show method to request a short representation where appropriate.

FWIW, R has both a semi-compact output by default and a somewhat longer summary method, and I find that really annoying: I always end up calling summary (e.g. for regression models).

wildart · 2019-05-25T05:30:24Z

I'm not a fan of a default full output, but it looks like the only possible option to have separate short and full output comes with having full output by default in a console. So, let's put a longer output in show(io::IO, ::MIME"text/plain", M::PCA), and leave show(io::IO, M::PCA) without changes.

nalimilan · 2019-09-04T09:27:45Z

Any updates?

Tokazama · 2019-09-04T10:50:59Z

Sorry, this got lost in some other stuff I was working on. I just pushed what I had sitting around and forgot about but I can look at putting the output in a CoefTable.

Tokazama · 2019-09-04T19:43:31Z

src/pca.jl

+    ldgs_signs[ldgs_signs .== 0] .= 1
+    ldgs = ldgs * diagm(0 => ldgs_signs[:])
+    print(io, "\n\nPattern matrix\n")
+    display(ldgs)


It would be nice if this came out with columns and rows labeled but without access to the variable names that's not possible. In the current state the columns are the components in the same order as those in the following coeftable.

Yes, that's annoying. For now maybe print a CoefTable object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.

Also, how about printing the second table first? That's the most important result I think.

Tokazama · 2019-09-04T19:45:50Z

One note that may be out of the scope of this PR, it would be nice if the returned PCA instance also specified what algorithm was used to estimate it.

nalimilan

Thanks.

I wonder whether there should be a way for users to extract the printed information in an object. It's frustrating to get printed results, but have to redo the computations by hand. That's a bit of an orthogonal issue, but still. Maybe coeftable could return the pattern matrix? Or could we find another name for that operation, which would return something meaningful also for other algorithms?

nalimilan · 2019-09-05T11:56:24Z

src/MultivariateStats.jl

@@ -3,7 +3,7 @@ module MultivariateStats
    using StatsBase: SimpleCovariance, CovarianceEstimator
    import Statistics: mean, var, cov, covm
    import Base: length, size, show, dump
-    import StatsBase: fit, predict, ConvergenceException
+    import StatsBase: fit, predict, ConvergenceException, CoefTable


import is for when you overload a function with a new method. It shouldn't be needed here.

src/pca.jl

nalimilan · 2019-09-05T12:08:37Z

src/pca.jl

+    ldgs_signs = sign.(sum(ldgs, dims=1))
+    ldgs_signs[ldgs_signs .== 0] .= 1
+    ldgs = ldgs * diagm(0 => ldgs_signs[:])
+    print(io, "\n\nPattern matrix\n")


Suggested change

print(io, "\n\nPattern matrix\n")

print(io, "\n\nPattern matrix:\n")

(for consistency with below)

Also, better be specific about what this matrix is: psych says "standardized loadings".

nalimilan · 2019-09-05T12:14:43Z

src/pca.jl

+    ldgs_signs[ldgs_signs .== 0] .= 1
+    ldgs = ldgs * diagm(0 => ldgs_signs[:])
+    print(io, "\n\nPattern matrix\n")
+    display(ldgs)


Yes, that's annoying. For now maybe print a CoefTable object without row names. If you pass an empty vector as row names, numbers will automatically be generated. That way at least the names of PCs will be consistent with the table below.

Also, how about printing the second table first? That's the most important result I think.

src/pca.jl

nalimilan · 2019-09-05T12:21:28Z

src/pca.jl

+    print(io, "Importance of components:\n")
+    print(io, CoefTable(vcat(principalvars(M)', (principalvars(M) ./ tvar(M))', (cumsum(principalvars(M) ./tvar(M)))'),
+                        string.("PC", 1:length(principalvars(M))),                      # components in order
+                        ["Loadings", "Proportion explained", "Cumulative proportion"])) # row names


"Loadings" is "SS loadings" from psych, right? Shouldn't that be specified? Also "proportion" should say of what (the variance). And how about adding "proportion of variance" and "cumulative variance" which are in base R and psych?

They are not "Loadings", they are "Eigenvalues" or "SS loadings" as R calls them.

Loadings would be

projection(M).*sqrt.(principalvars(M))'

@wildart You're definitely right. I apologize for the oversight.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Tokazama · 2019-09-05T15:55:48Z

Would these make sense?

proportionvar = proportion of variance explained
cumproportionvar = cumulative proportion variance explained

Perhaps this is an API thing that needs to be resolved in #94 first.

nalimilan · 2019-09-09T12:59:26Z

* `proportionvar` = proportion of variance explained

* `cumproportionvar` = cumulative proportion variance explained

I'm not sure these are the most needed, as they are relatively simple to compute using principalvars and tvar. The second one in particular is easy to recompute from the first one. Though adding proportionvar wouldn't be absurd. But I'm more worried about providing a function to get the pattern matrix, which is much harder to recompute by hand.

wildart · 2022-02-17T02:51:35Z

Now, it looks like this:

julia> X = RDatasets.dataset("datasets", "iris")[:,1:4] |> Matrix;

julia> M = fit(PCA, X', maxoutdim=3)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)

Pattern matrix:
────────────────────────────────────
         PC1         PC2         PC3
────────────────────────────────────
1   0.743108   0.323446   -0.16277
2  -0.173801   0.359689    0.167212
3   1.76155   -0.0854062   0.0213202
4   0.736739  -0.0371832   0.152647
────────────────────────────────────

Importance of components:
─────────────────────────────────────────────────────────
                                PC1        PC2        PC3
─────────────────────────────────────────────────────────
SS Loadings (Eigenvalues)  4.22824   0.242671   0.0782095
Variance explained         0.924619  0.0530665  0.0171026
Cumulative variance        0.924619  0.977685   0.994788
Proportion explained       0.929463  0.0533445  0.0171922
Cumulative proportion      0.929463  0.982808   1.0
─────────────────────────────────────────────────────────

which is close to R psych version:

> pc <- principal(iris[1:4],3,rotate="none",cor="cov",covar=TRUE)
> print(pc, digits=6)
Principal Components Analysis
Call: principal(r = iris[1:4], nfactors = 3, rotate = "none", covar = TRUE, 
    cor = "cov")
Unstandardized loadings (pattern matrix) based upon covariance matrix
                   PC1       PC2       PC3       h2         u2       H2
Sepal.Length  0.743108  0.323446 -0.162770 0.683321 0.00237236 0.996540
Sepal.Width  -0.173801  0.359689  0.167212 0.187543 0.00243649 0.987175
Petal.Length  1.761545 -0.085406  0.021320 3.110790 0.00548792 0.998239
Petal.Width   0.736739 -0.037183  0.152647 0.567468 0.01353832 0.976698

                           PC1      PC2      PC3
SS loadings           4.228242 0.242671 0.078210
Proportion Var        0.924619 0.053066 0.017103
Cumulative Var        0.924619 0.977685 0.994788
Proportion Explained  0.929463 0.053345 0.017192
Cumulative Proportion 0.929463 0.982808 1.000000

nalimilan

Cool!

src/pca.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

wildart

LGTM

Additional show info for PCA

923ca2f

This arose from https://discourse.julialang.org/t/pca-output/22687/10. Main contribution is pattern matrix with features x components. This should probably be refined but not sure what to do without DataFrames dependency.

Add show method

5c9b6dd

Zachary Christensen added 2 commits September 4, 2019 14:53

Added CoefTable for pretty printing.

6540e8f

Fixed row orientation

78c24ae

Tokazama commented Sep 4, 2019

View reviewed changes

nalimilan reviewed Sep 5, 2019

View reviewed changes

Tokazama and others added 4 commits September 5, 2019 11:06

Update src/pca.jl

d313dd4

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/pca.jl

2cb0251

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/pca.jl

e31458a

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/pca.jl

07331d6

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

wildart added this to the v1.0.0 milestone Feb 25, 2021

wildart added 3 commits October 28, 2021 18:52

Merge branch 'master' into master

03bdfd4

Updated output tables

63baa5a

Merge branch 'master' into master

16c9b75

nalimilan reviewed Feb 17, 2022

View reviewed changes

src/pca.jl Outdated Show resolved Hide resolved

src/pca.jl Outdated Show resolved Hide resolved

Tokazama and others added 2 commits February 17, 2022 08:54

Update src/pca.jl

15f96d4

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/pca.jl

6807a4a

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

nalimilan approved these changes Mar 3, 2022

View reviewed changes

nalimilan requested a review from wildart March 3, 2022 09:07

wildart reviewed Mar 3, 2022

View reviewed changes

wildart merged commit 48b2b50 into JuliaStats:master Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional `show` info for PCA #90

Additional `show` info for PCA #90

Tokazama commented Apr 7, 2019

nalimilan commented Apr 7, 2019

Tokazama commented Apr 8, 2019 •

edited

Loading

nalimilan commented Apr 8, 2019

Tokazama commented Apr 8, 2019

nalimilan commented Apr 8, 2019

wildart commented Apr 15, 2019

ararslan commented Apr 15, 2019

Tokazama commented Apr 16, 2019

nalimilan commented Apr 16, 2019

wildart commented May 25, 2019 •

edited

Loading

nalimilan commented Sep 4, 2019

Tokazama commented Sep 4, 2019

Tokazama Sep 4, 2019

nalimilan Sep 5, 2019

Tokazama commented Sep 4, 2019

nalimilan left a comment

nalimilan Sep 5, 2019

nalimilan Sep 5, 2019

nalimilan Sep 5, 2019

nalimilan Sep 5, 2019

wildart Sep 5, 2019

Tokazama Sep 5, 2019

Tokazama commented Sep 5, 2019

nalimilan commented Sep 9, 2019

wildart commented Feb 17, 2022

nalimilan left a comment

wildart left a comment

	print(io, "\n\nPattern matrix\n")
	print(io, "\n\nPattern matrix:\n")

Additional show info for PCA #90

Additional show info for PCA #90

Conversation

Tokazama commented Apr 7, 2019

nalimilan commented Apr 7, 2019

Tokazama commented Apr 8, 2019 • edited Loading

nalimilan commented Apr 8, 2019

Tokazama commented Apr 8, 2019

nalimilan commented Apr 8, 2019

wildart commented Apr 15, 2019

ararslan commented Apr 15, 2019

Tokazama commented Apr 16, 2019

nalimilan commented Apr 16, 2019

wildart commented May 25, 2019 • edited Loading

nalimilan commented Sep 4, 2019

Tokazama commented Sep 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tokazama commented Sep 4, 2019

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tokazama commented Sep 5, 2019

nalimilan commented Sep 9, 2019

wildart commented Feb 17, 2022

nalimilan left a comment

Choose a reason for hiding this comment

wildart left a comment

Choose a reason for hiding this comment

Additional `show` info for PCA #90

Additional `show` info for PCA #90

Tokazama commented Apr 8, 2019 •

edited

Loading

wildart commented May 25, 2019 •

edited

Loading