Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some values of percent_pattern cause percentages in "Total" row or column to report 100% #10

Closed
philipmgrant opened this issue Mar 8, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@philipmgrant
Copy link

Describe the bug
Some choices of percent_pattern cause the percentages in the "Total" row or column to report as 100%. (The examples below are the best way to illustrate.)

Reproducible example

  1. Default and expected behaviour. The percentages in the rightmost column show the total of each row as a percentage of the grand total; the percentages in the bottom row show the total of each column as a percentage of the grand total.
> crosstable(mtcars, gear, by=cyl, total="both")
# A tibble: 4 x 7
  .id   label variable `4`         `6`        `8`         Total       
  <chr> <chr> <chr>    <chr>       <chr>      <chr>       <chr>       
1 gear  gear  3        1 (6.67%)   2 (13.33%) 12 (80.00%) 15 (46.88%) 
2 gear  gear  4        8 (66.67%)  4 (33.33%) 0 (0%)      12 (37.50%) 
3 gear  gear  5        2 (40.00%)  1 (20.00%) 2 (40.00%)  5 (15.62%)  
4 gear  gear  Total    11 (34.38%) 7 (21.88%) 14 (43.75%) 32 (100.00%)
  1. Using percent_pattern with a value that simply reformats the percentage in the non-total cells. The cells in the total row/column don't change their format accordingly, but they still report the right numerical percentage.
> crosstable(mtcars, gear, by=cyl, total="both", percent_pattern = "{n} ~~~ {p_row}")
# A tibble: 4 x 7
  .id   label variable `4`          `6`          `8`           Total       
  <chr> <chr> <chr>    <chr>        <chr>        <chr>         <chr>       
1 gear  gear  3        1 ~~~ 6.67%  2 ~~~ 13.33% 12 ~~~ 80.00% 15 (46.88%) 
2 gear  gear  4        8 ~~~ 66.67% 4 ~~~ 33.33% 0 ~~~ 0%      12 (37.50%) 
3 gear  gear  5        2 ~~~ 40.00% 1 ~~~ 20.00% 2 ~~~ 40.00%  5 (15.62%)  
4 gear  gear  Total    11 (34.38%)  7 (21.88%)   14 (43.75%)   32 (100.00%)
  1. Using percent_pattern with a value that applies a formula to p_row. This breaks the percentage calculation in the total row and column, so that the cells in the total column all show 100% (formatted according to the pattern), while the cells in the total row show no percentage.
> crosstable(mtcars, gear, by=cyl, total="both", percent_pattern = "{n} ~~~ {str_replace(p_row, '%', 'pct')}")
# A tibble: 4 x 7
  .id   label variable `4`            `6`            `8`             Total           
  <chr> <chr> <chr>    <chr>          <chr>          <chr>           <chr>           
1 gear  gear  3        1 ~~~ 6.67pct  2 ~~~ 13.33pct 12 ~~~ 80.00pct 15 ~~~ 100.00pct
2 gear  gear  4        8 ~~~ 66.67pct 4 ~~~ 33.33pct 0 ~~~ 0pct      12 ~~~ 100.00pct
3 gear  gear  5        2 ~~~ 40.00pct 1 ~~~ 20.00pct 2 ~~~ 40.00pct  5 ~~~ 100.00pct 
4 gear  gear  Total    11             7              14              32  
  1. Wrapping that formula in paste fixes the calculation and gives the same behaviour as (1):
> crosstable(mtcars, gear, by=cyl, total="both", percent_pattern = "{n} ~~~ {paste(str_replace(p_row, '%', 'pct'))}")
# A tibble: 4 x 7
  .id   label variable `4`            `6`            `8`             Total       
  <chr> <chr> <chr>    <chr>          <chr>          <chr>           <chr>       
1 gear  gear  3        1 ~~~ 6.67pct  2 ~~~ 13.33pct 12 ~~~ 80.00pct 15 (46.88%) 
2 gear  gear  4        8 ~~~ 66.67pct 4 ~~~ 33.33pct 0 ~~~ 0pct      12 (37.50%) 
3 gear  gear  5        2 ~~~ 40.00pct 1 ~~~ 20.00pct 2 ~~~ 40.00pct  5 (15.62%)  
4 gear  gear  Total    11 (34.38%)    7 (21.88%)     14 (43.75%)     32 (100.00%)
  1. But wrapping it in identity instead gives the same issue as shown in (3):
> crosstable(mtcars, gear, by=cyl, total="both", percent_pattern = "{n} ~~~ {identity(str_replace(p_row, '%', 'pct'))}")
# A tibble: 4 x 7
  .id   label variable `4`            `6`            `8`             Total           
  <chr> <chr> <chr>    <chr>          <chr>          <chr>           <chr>           
1 gear  gear  3        1 ~~~ 6.67pct  2 ~~~ 13.33pct 12 ~~~ 80.00pct 15 ~~~ 100.00pct
2 gear  gear  4        8 ~~~ 66.67pct 4 ~~~ 33.33pct 0 ~~~ 0pct      12 ~~~ 100.00pct
3 gear  gear  5        2 ~~~ 40.00pct 1 ~~~ 20.00pct 2 ~~~ 40.00pct  5 ~~~ 100.00pct 
4 gear  gear  Total    11             7              14              32   

This seems to be due to the code around lines 126-139 of cross_categorical.R, which does this...

        any_p = pattern_vars %>% str_starts("p") %>% any()

...and then uses that to decide whether to do a margin calculation. The dependency on having a glue variable that starts with the letter "p" would explain why using "paste" (as in (4)) is an effective workaround, but using "identity" (as in (5)) is not.

Session info

``` r R version 4.1.2 (2021-11-01) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] crosstable_0.4.1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.8 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 ggplot2_3.3.5
[10] tidyverse_1.3.1

loaded via a namespace (and not attached):
[1] xfun_0.29 tidyselect_1.1.2 haven_2.4.3 colorspace_2.0-3 vctrs_0.3.8 generics_0.1.2 htmltools_0.5.2 base64enc_0.1-3 utf8_1.2.2
[10] rlang_1.0.1 pillar_1.7.0 glue_1.6.2 withr_2.4.3 DBI_1.1.2 gdtools_0.2.4 dbplyr_2.1.1 uuid_1.0-3 modelr_0.1.8
[19] readxl_1.3.1 lifecycle_1.0.1 munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 zip_2.2.0 rvest_1.0.2 evaluate_0.15 knitr_1.37
[28] fastmap_1.1.0 tzdb_0.2.0 fansi_1.0.2 broom_0.7.12 Rcpp_1.0.8 renv_0.15.2 scales_1.1.1 backports_1.4.1 checkmate_2.0.0
[37] jsonlite_1.8.0 systemfonts_1.0.4 fs_1.5.2 digest_0.6.29 hms_1.1.1 stringi_1.7.6 grid_4.1.2 cli_3.2.0 tools_4.1.2
[46] magrittr_2.0.2 crayon_1.5.0 pkgconfig_2.0.3 ellipsis_0.3.2 data.table_1.14.2 xml2_1.3.3 reprex_2.0.1 lubridate_1.8.0 rmarkdown_2.11
[55] officer_0.4.1 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.13 flextable_0.6.10 R6_2.5.1 compiler_4.1.2

</details>
@philipmgrant philipmgrant added the bug Something isn't working label Mar 8, 2022
@DanChaltiel
Copy link
Owner

This is quite a nice bug you found.
I really like what you are doing this with percent_pattern, never thought about that.

It seems that replacing the 3 occurrences in cross_categorical.R with the following will solve the bug:

any_p = pattern_vars %>% str_detect("p_row|p_col") %>% any()

Could you test that it fits all your cases? I just tested on the cases you gave but there might be more to it.
You might want to use this to debug the function if you don't want to fork the whole project.

That being said, this makes me think that p_row and p_col should not have the % inside but rather outside, so that the default percent_pattern should be "{n} ({p_row}%)".

This would make more sense, as you could write percent_pattern = "{n} ~~~ {p_row}pct"), but it would be a minor breaking change for everyone who already specify percent_pattern, with no chance of throwing a warning... Or this could also be yet another option, something like crosstable_option(percent_char="pct").

@DanChaltiel
Copy link
Owner

Hi Philip,
This bug should now be fixed in the dev version so you don't have to debug the function yourself.
You can install it using devtools::install_github("DanChaltiel/crosstable").
If it still doesn't work, please post the problem here and I will reopen the issue.

@philipmgrant
Copy link
Author

Hi Philip, This bug should now be fixed in the dev version so you don't have to debug the function yourself. You can install it using devtools::install_github("DanChaltiel/crosstable"). If it still doesn't work, please post the problem here and I will reopen the issue.

Thank you! It fixes the issue I identified - however, testing some different cases, I think there may be a regression in the dev version.

If you do this...

crosstable(mtcars, gear, total="both")

In the released version, the result is correct:

# A tibble: 4 x 4
  .id   label variable value       
  <chr> <chr> <chr>    <chr>       
1 gear  gear  3        15 (46.88%) 
2 gear  gear  4        12 (37.50%) 
3 gear  gear  5        5 (15.62%)  
4 gear  gear  Total    32 (100.00%)

But in dev, all the percentages are 100%:

# A tibble: 4 x 4
  .id   label variable value       
  <chr> <chr> <chr>    <chr>       
1 gear  gear  3        15 (100.00%)
2 gear  gear  4        12 (100.00%)
3 gear  gear  5        5 (100.00%) 
4 gear  gear  Total    32 (100.00%)

This might be due to an unrelated change in dev - since you get the issue even without specifying percent_pattern.

@DanChaltiel DanChaltiel reopened this May 12, 2022
@DanChaltiel
Copy link
Owner

I finally found some time to work on this, it should be solved now (in the dev).
I should release v0.5.0 to CRAN this summer.
Thanks for your input, Philip!

@philipmgrant
Copy link
Author

I finally found some time to work on this, it should be solved now (in the dev). I should release v0.5.0 to CRAN this summer. Thanks for your input, Philip!

The dev version looks good to me now too. And thanks for a cool package!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants