You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I don't know if this is really an issue, but I think it would be useful to give this problem some attention if someone else will encounter this, or if there's a solution.
The problem is that the Interval class has in its second slot, start, data which has to be the same length as the main data. Otherwise, things break. In the following example, after using fsubset() you cannot print the resulting tibble. However, when subsetting using either base::subset() or dplyr::filter(), the second start slot is correctly subsetted, and you can print the tibble normally (I don't normally expect attributes to be subsetted alongside the data, so I find this surprising).
This issue also happens when you subset using data.table syntax. However, they consider this more of a limitation than a bug. That is, data.table only supports atomic vector columns and list columns.
data("flights", package="nycflights13")
library(collapse)
#> collapse 1.6.5, see ?`collapse-package` or ?`collapse-documentation`#> Note: stats::D -> D.expression, D.call, D.name#> #> Attaching package: 'collapse'#> The following object is masked from 'package:stats':#> #> D
loadNamespace("tibble") # So its print method is used#> <environment: namespace:tibble>flights<- ftransform(flights, flight_dur=lubridate::as.interval(air_time, time_hour))
flights|> fsubset(carrier=="UA") ->sub_fsubsetflights|> subset(carrier=="UA") ->sub_subsetflights|>dplyr::filter(carrier=="UA") ->sub_filter
try(sub_fsubset)
#> Error: Internal error in `df_slice()`: Columns must match the data frame size.sub_subset#> # A tibble: 58,665 × 20#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> <int> <int> <int> <int> <int> <dbl> <int> <int>#> 1 2013 1 1 517 515 2 830 819#> 2 2013 1 1 533 529 4 850 830#> 3 2013 1 1 554 558 -4 740 728#> 4 2013 1 1 558 600 -2 924 917#> 5 2013 1 1 558 600 -2 923 937#> 6 2013 1 1 559 600 -1 854 902#> 7 2013 1 1 607 607 0 858 915#> 8 2013 1 1 611 600 11 945 931#> 9 2013 1 1 623 627 -4 933 932#> 10 2013 1 1 628 630 -2 1016 947#> # … with 58,655 more rows, and 12 more variables: arr_delay <dbl>,#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,#> # flight_dur <Interval>
identical(sub_subset, sub_filter) # base subset and dplyr's filter give indentical reuslts#> [1] TRUE
try(testthat::expect_identical(sub_fsubset, sub_subset)) # details the problem#> Error : `sub_fsubset` not identical to `sub_subset`.#> Component "flight_dur": Attributes: < Component "start": Numeric: lengths (336776, 58665) differ >sub_fsubset|> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)#> $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots#> .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...#> .. ..@ start: POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...#> .. ..@ tzone: chr "America/New_York"sub_subset|> fselect(flight_dur) |> str()
#> tibble [58,665 × 1] (S3: tbl_df/tbl/data.frame)#> $ flight_dur:Formal class 'Interval' [package "lubridate"] with 3 slots#> .. ..@ .Data: num [1:58665] 227 227 150 345 361 337 157 366 229 366 ...#> .. ..@ start: POSIXct[1:58665], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...#> .. ..@ tzone: chr "America/New_York"
Thanks for this issue! But I also think there is no straightforward solution here. I presume that there is a method [.Interval defined in lubridate, which handles the subsetting of that class. It appears that base::subset and dplyr::filter call [ and thus invoke the right method, which however also means they are limited to base R speed. Both collapse and data.table do subsetting fully in C, and thus do not call any methods from R. I could of course write C code that calls subset methods from R if an S4 object is found as a column, but I'm not sure that would be worth the effort for a couple of cases as the column-wise checking would slow down the C code a bit, and you can already subset with methods from R using functions like base::subset or dplyr::filter, so the added value to me appears little compared to the cost.
I don't know if this is really an issue, but I think it would be useful to give this problem some attention if someone else will encounter this, or if there's a solution.
This is related to Rdatatable/data.table#4415
The problem is that the
Interval
class has in its second slot,start
, data which has to be the same length as the main data. Otherwise, things break. In the following example, after usingfsubset()
you cannot print the resulting tibble. However, when subsetting using eitherbase::subset()
ordplyr::filter()
, the secondstart
slot is correctly subsetted, and you can print the tibble normally (I don't normally expect attributes to be subsetted alongside the data, so I find this surprising).This issue also happens when you subset using
data.table
syntax. However, they consider this more of a limitation than a bug. That is,data.table
only supports atomic vector columns and list columns.Created on 2021-09-02 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: