Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

eshom · 2021-09-02T10:49:49Z

I don't know if this is really an issue, but I think it would be useful to give this problem some attention if someone else will encounter this, or if there's a solution.

This is related to Rdatatable/data.table#4415

The problem is that the Interval class has in its second slot, start, data which has to be the same length as the main data. Otherwise, things break. In the following example, after using fsubset() you cannot print the resulting tibble. However, when subsetting using either base::subset() or dplyr::filter(), the second start slot is correctly subsetted, and you can print the tibble normally (I don't normally expect attributes to be subsetted alongside the data, so I find this surprising).

This issue also happens when you subset using data.table syntax. However, they consider this more of a limitation than a bug. That is, data.table only supports atomic vector columns and list columns.

data("flights", package = "nycflights13")

library(collapse)
#> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`
#> Note: stats::D  ->  D.expression, D.call, D.name
#> 
#> Attaching package: 'collapse'
#> The following object is masked from 'package:stats':
#> 
#>     D
loadNamespace("tibble") # So its print method is used
#> <environment: namespace:tibble>

flights <- ftransform(flights, flight_dur = lubridate::as.interval(air_time, time_hour))

flights |> fsubset(carrier == "UA") -> sub_fsubset
flights |> subset(carrier == "UA") -> sub_subset
flights |> dplyr::filter(carrier == "UA") -> sub_filter

try(sub_fsubset)
#> Error: Internal error in `df_slice()`: Columns must match the data frame size.
sub_subset
#> # A tibble: 58,665 × 20
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      554            558        -4      740            728
#>  4  2013     1     1      558            600        -2      924            917
#>  5  2013     1     1      558            600        -2      923            937
#>  6  2013     1     1      559            600        -1      854            902
#>  7  2013     1     1      607            607         0      858            915
#>  8  2013     1     1      611            600        11      945            931
#>  9  2013     1     1      623            627        -4      933            932
#> 10  2013     1     1      628            630        -2     1016            947
#> # … with 58,655 more rows, and 12 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   flight_dur <Interval>
identical(sub_subset, sub_filter) # base subset and dplyr's filter give indentical reuslts
#> [1] TRUE
try(testthat::expect_identical(sub_fsubset, sub_subset)) # details the problem
#> Error : `sub_fsubset` not identical to `sub_subset`.
#> Component "flight_dur": Attributes: < Component "start": Numeric: lengths (336776, 58665) differ >

sub_fsubset |> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...
#>   .. ..@ start: POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#>   .. ..@ tzone: chr "America/New_York"
sub_subset |> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots
#>   .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...
#>   .. ..@ start: POSIXct[1:58665], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
#>   .. ..@ tzone: chr "America/New_York"

^{Created on 2021-09-02 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

SebKrantz · 2021-09-02T17:49:27Z

Thanks for this issue! But I also think there is no straightforward solution here. I presume that there is a method [.Interval defined in lubridate, which handles the subsetting of that class. It appears that base::subset and dplyr::filter call [ and thus invoke the right method, which however also means they are limited to base R speed. Both collapse and data.table do subsetting fully in C, and thus do not call any methods from R. I could of course write C code that calls subset methods from R if an S4 object is found as a column, but I'm not sure that would be worth the effort for a couple of cases as the column-wise checking would slow down the C code a bit, and you can already subset with methods from R using functions like base::subset or dplyr::filter, so the added value to me appears little compared to the cost.

SebKrantz · 2021-12-29T21:53:05Z

I wont't solve this in collapse soon, so I'll close the issue for now.

SebKrantz closed this as completed Dec 29, 2021

jmrung mentioned this issue May 3, 2023

roworder() does not sort columns of Interval class from lubridate #418

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

eshom commented Sep 2, 2021

SebKrantz commented Sep 2, 2021 •

edited

Loading

SebKrantz commented Dec 29, 2021

Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

Data dependent attributes in lubridate's S4 Interval class break after fsubset #186

Comments

eshom commented Sep 2, 2021

SebKrantz commented Sep 2, 2021 • edited Loading

SebKrantz commented Dec 29, 2021

SebKrantz commented Sep 2, 2021 •

edited

Loading