Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roworder() does not sort columns of Interval class from lubridate #418

Closed
jmrung opened this issue May 3, 2023 · 4 comments
Closed

roworder() does not sort columns of Interval class from lubridate #418

jmrung opened this issue May 3, 2023 · 4 comments

Comments

@jmrung
Copy link

jmrung commented May 3, 2023

It appears that roworder() does not sort columns of class Interval from lubridate. If roworder() is applied to a data frame that contains an Interval class column, all non-Interval columns will be sorted as specified, but the Interval class columns will remain in the order as in the originating data frame.

The same issue occurs with data.table's setorder(). A workaround is to use dplyr's arrange(), which correctly sorts all columns, regardless of class. Of course, arrange() can be much slower depending on the size of your data frame.

Another alternative is to convert the Interval class column(s) to character before using roworder(), but if the Interval class is needed for later operations, this might not be ideal.

Similar to another issue pertaining to lubridate Intervals (see #186), this may be more of a limitation than a bug. In the latter case, including a note in the documentation might be helpful and/or a warning if an Interval class column is detected in the data frame being sorted.

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.5
#> Warning: package 'ggplot2' was built under R version 4.0.5
#> Warning: package 'tibble' was built under R version 4.0.5
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'readr' was built under R version 4.0.5
#> Warning: package 'purrr' was built under R version 4.0.5
#> Warning: package 'dplyr' was built under R version 4.0.5
#> Warning: package 'stringr' was built under R version 4.0.5
#> Warning: package 'forcats' was built under R version 4.0.5
#> Warning: package 'lubridate' was built under R version 4.0.5
library(lubridate)
library(collapse)
#> Warning: package 'collapse' was built under R version 4.0.5
#> collapse 1.9.3, see ?`collapse-package` or ?`collapse-documentation`
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:lubridate':
#> 
#>     is.Date
#> The following object is masked from 'package:stats':
#> 
#>     D

ID<-c("E", "B", "A", "C", "D")
Start<-c("3/1/2021",    "5/11/2018",    "6/12/2019",    "7/11/2018",    "3/3/2021")
End<-c("3/3/2021",  "5/13/2018",    "6/15/2019",    "7/16/2018",    "3/5/2021")

example<-
  tibble(ID, Start, End) %>%
  mutate(Start = mdy(Start),
         End = mdy(End),
         Interval = interval(Start, End)) 

#Data frame, displayed as created, out of order
example
#> # A tibble: 5 × 4
#>   ID    Start      End        Interval                      
#>   <chr> <date>     <date>     <Interval>                    
#> 1 E     2021-03-01 2021-03-03 2021-03-01 UTC--2021-03-03 UTC
#> 2 B     2018-05-11 2018-05-13 2018-05-11 UTC--2018-05-13 UTC
#> 3 A     2019-06-12 2019-06-15 2019-06-12 UTC--2019-06-15 UTC
#> 4 C     2018-07-11 2018-07-16 2018-07-11 UTC--2018-07-16 UTC
#> 5 D     2021-03-03 2021-03-05 2021-03-03 UTC--2021-03-05 UTC

#Data frame, ordered by ID, with "Interval" unaffected by sort
example %>%
  roworder(ID)
#> # A tibble: 5 × 4
#>   ID    Start      End        Interval                      
#>   <chr> <date>     <date>     <Interval>                    
#> 1 A     2019-06-12 2019-06-15 2021-03-01 UTC--2021-03-04 UTC
#> 2 B     2018-05-11 2018-05-13 2018-05-11 UTC--2018-05-13 UTC
#> 3 C     2018-07-11 2018-07-16 2019-06-12 UTC--2019-06-17 UTC
#> 4 D     2021-03-03 2021-03-05 2018-07-11 UTC--2018-07-13 UTC
#> 5 E     2021-03-01 2021-03-03 2021-03-03 UTC--2021-03-05 UTC

#Data frame, ordered by ID, with all columns correctly sorted
example %>%
  arrange(ID)
#> # A tibble: 5 × 4
#>   ID    Start      End        Interval                      
#>   <chr> <date>     <date>     <Interval>                    
#> 1 A     2019-06-12 2019-06-15 2019-06-12 UTC--2019-06-15 UTC
#> 2 B     2018-05-11 2018-05-13 2018-05-11 UTC--2018-05-13 UTC
#> 3 C     2018-07-11 2018-07-16 2018-07-11 UTC--2018-07-16 UTC
#> 4 D     2021-03-03 2021-03-05 2021-03-03 UTC--2021-03-05 UTC
#> 5 E     2021-03-01 2021-03-03 2021-03-01 UTC--2021-03-03 UTC

Created on 2023-05-03 with reprex v2.0.2

@SebKrantz
Copy link
Owner

SebKrantz commented May 3, 2023

Thanks, yeah roworder() is essentially based on a collapse-specific version of base::order(..., method = "radix"), which uses the same source as data.table::setorder(). The problem here is, if you look at unclass(example$Interval), that the object is comprised of a vector with a "starts" attribute of the same length, which is not taken into consideration in the C-based radixorder routine. In general, I don't think it makes sense to program a lot of exemptions for different classes at the C-level. Collapse supports basic R classes and data.table, tibble, xts, pseries, pdata.frame and, to some extent, sf. That for my purposes is complex enough. So for your case I think it indeed makes sense to call functions like arrange() which probably check at the R-level if there are "Interval" methods defined for functions like sort() or order().

I'll think about your suggestion of issuing a warning, but in general, I don't think it is a great idea to check for the existence of lots of different non-base-R classes at the C-level that may give different bevahior.

@NicChr
Copy link

NicChr commented May 4, 2023

Full disclaimer, I'm not a collapse package dev or anything and think it's best to go with SebKrantz's response, which I also agree with.

That being said, I've encountered similar issues and just wrote a custom GRP() method for lubridate Intervals:

GRP.Interval <- function(X, ...){
  X <- dplyr::tibble(!!"start" := lubridate::int_start(X),
                     !!"data" := lubridate::int_length(X))
  collapse::GRP(X, ...)
}

From there you can probably write your own function to do something similar for data frames.

@jmrung
Copy link
Author

jmrung commented May 5, 2023

Thanks so much for the thorough response, @SebKrantz. The explanation makes a lot of sense, and I understand the decision to stick with programming for the more common classes (and hesitancy regarding class-specific warnings).

Just noticed a typo near the suggestion in my initial post--should've said "in the former case" (not the latter), but I think that was understood! If nothing else, I'm hoping this post helps make the incompatibility a bit more visible. The issue can be easy to miss when working with wider data frames. If there's other pieces in the documentation that indicate the incompatibility, it wasn't as obvious to me, at least as someone newer to these more advanced/custom elements of objects in R.

Thank you as well, @NicChr, for sharing the code!

@SebKrantz
Copy link
Owner

Thanks @jmrung and @NicChr. I guess this all points towards creating a general vignette about how collapse handles R objects. I am planning to create a short vignette about that this year. I will close the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants