Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

Closed
eshom opened this issue Sep 2, 2021 · 2 comments
Closed

Comments

@eshom
Copy link

eshom commented Sep 2, 2021

I don't know if this is really an issue, but I think it would be useful to give this problem some attention if someone else will encounter this, or if there's a solution.

This is related to Rdatatable/data.table#4415

The problem is that the Interval class has in its second slot, start, data which has to be the same length as the main data. Otherwise, things break. In the following example, after using fsubset() you cannot print the resulting tibble. However, when subsetting using either base::subset() or dplyr::filter(), the second start slot is correctly subsetted, and you can print the tibble normally (I don't normally expect attributes to be subsetted alongside the data, so I find this surprising).

This issue also happens when you subset using data.table syntax. However, they consider this more of a limitation than a bug. That is, data.table only supports atomic vector columns and list columns.

data("flights", package = "nycflights13")

library(collapse)
#> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`
#> Note: stats::D  ->  D.expression, D.call, D.name
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
loadNamespace("tibble") # So its print method is used
#> <environment: namespace:tibble>

flights <- ftransform(flights, flight_dur = lubridate::as.interval(air_time, time_hour))

flights |> fsubset(carrier == "UA") -> sub_fsubset
flights |> subset(carrier == "UA") -> sub_subset
flights |> dplyr::filter(carrier == "UA") -> sub_filter

try(sub_fsubset)
#> Error: Internal error in `df_slice()`: Columns must match the data frame size.
sub_subset
#> # A tibble: 58,665 × 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      554            558        -4      740            728
#>  4  2013     1     1      558            600        -2      924            917
#>  5  2013     1     1      558            600        -2      923            937
#>  6  2013     1     1      559            600        -1      854            902
#>  7  2013     1     1      607            607         0      858            915
#>  8  2013     1     1      611            600        11      945            931
#>  9  2013     1     1      623            627        -4      933            932
#> 10  2013     1     1      628            630        -2     1016            947
#> # … with 58,655 more rows, and 12 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   flight_dur <Interval>
identical(sub_subset, sub_filter) # base subset and dplyr's filter give indentical reuslts
#> [1] TRUE
try(testthat::expect_identical(sub_fsubset, sub_subset)) # details the problem
#> Error : `sub_fsubset` not identical to `sub_subset`.
#> Component "flight_dur": Attributes: < Component "start": Numeric: lengths (336776, 58665) differ >

sub_fsubset |> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...
#>   .. ..@ start: POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#>   .. ..@ tzone: chr "America/New_York"
sub_subset |> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...
#>   .. ..@ start: POSIXct[1:58665], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#>   .. ..@ tzone: chr "America/New_York"

Created on 2021-09-02 by the reprex package (v2.0.1)

@SebKrantz
Copy link
Owner

SebKrantz commented Sep 2, 2021

Thanks for this issue! But I also think there is no straightforward solution here. I presume that there is a method [.Interval defined in lubridate, which handles the subsetting of that class. It appears that base::subset and dplyr::filter call [ and thus invoke the right method, which however also means they are limited to base R speed. Both collapse and data.table do subsetting fully in C, and thus do not call any methods from R. I could of course write C code that calls subset methods from R if an S4 object is found as a column, but I'm not sure that would be worth the effort for a couple of cases as the column-wise checking would slow down the C code a bit, and you can already subset with methods from R using functions like base::subset or dplyr::filter, so the added value to me appears little compared to the cost.

@SebKrantz
Copy link
Owner

I wont't solve this in collapse soon, so I'll close the issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants