Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tidy.prcomp labels for loadings and scores matrices #910

Merged
merged 5 commits into from Sep 4, 2020
Merged

Fix tidy.prcomp labels for loadings and scores matrices #910

merged 5 commits into from Sep 4, 2020

Conversation

tavareshugo
Copy link
Contributor

@tavareshugo tavareshugo commented Aug 12, 2020

Hi,

I think there was a bug in tidy.prcomp() related to this commit, which moved code from using dplyr::gather() to dplyr::pivot_longer(). These two functions sort the output tibble in different ways, and so the labels being added with rep() were wrong when using tidy(x, matrix = "scores") and tidy(x, matrix = "loadings").

Here's an example (using broom v0.7.0):

library(magrittr)
set.seed(1)
m <- matrix(rnorm(8), ncol = 4)
rownames(m) <- paste0("row", 1:nrow(m))
colnames(m) <- paste0("col", 1:ncol(m))

x <- prcomp(m)

broom::tidy(x, "scores") %>% dplyr::arrange(PC)

Results in

# A tibble: 4 x 3
  row      PC     value
  <chr> <dbl>     <dbl>
1 row1      1 -1.41e+ 0
2 row1      1  1.41e+ 0
3 row2      2  0.      
4 row2      2  2.08e-17

The result we want is rather:

  row      PC     value
  <chr> <dbl>     <dbl>
1 row1      1 -1.41e+ 0
2 row2      1  1.41e+ 0
3 row1      2  0.      
4 row2      2  2.08e-17

I guess I could have just changed rep() to use each = ncomp rather than times = ncomp.
However, I think my fix is perhaps a bit simpler to read and more robust to future changes in how pivot_longer() sorts output. I took advantage of tibble::as_tibble() ability to add rownames as columns, and this works even if rownames(x) is NULL.

I hope the PR is helpful. (And thanks for the amazing package!)

R/stats-prcomp-tidiers.R Outdated Show resolved Hide resolved
Copy link
Contributor Author

@tavareshugo tavareshugo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency of returning integer when there are no rownames in the matrices, this could be added in.

R/stats-prcomp-tidiers.R Show resolved Hide resolved
R/stats-prcomp-tidiers.R Show resolved Hide resolved
@simonpcouch
Copy link
Collaborator

Thanks for pointing this out @tavareshugo!

Should I consider this PR ready to review? How should I interpret your review comments/suggestions?

@tavareshugo
Copy link
Contributor Author

Hi @simonpcouch, thanks for having a look at this. I've committed my previous suggestions (i.e. for consistency with previous versions it returns labels as integer if rownames of those matrices are missing). I've also made a new commit adding my details to the contributors and a note to NEWS.md. So, I would say that this is ready for review now, thanks.

@simonpcouch
Copy link
Collaborator

Thanks for your patience with this, @tavareshugo. This code looks great to me, so I'll go ahead and resolve conflicts and merge! Much appreciated.

@simonpcouch simonpcouch merged commit 7f29de5 into tidymodels:master Sep 4, 2020
@tavareshugo
Copy link
Contributor Author

thanks a lot @simonpcouch, I'm glad I could help a little bit with the package

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants