Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as.xts.data.table is limited to numeric data types, but this is not an xts limitation #5268

Closed
ethanbsmith opened this issue Nov 20, 2021 · 5 comments · Fixed by #5276
Closed
Milestone

Comments

@ethanbsmith
Copy link
Contributor

ethanbsmith commented Nov 20, 2021

xts allows columns to have any type, as long as the columns are all the same type (as its really a matrix under the hood)

for example, you can have character columns:
xts(x = cbind(letters, LETTERS), order.by = Sys.Date() + (1:26))

logicals:
xts(x = c(T,F), order.by = Sys.Date() + (1:2))

as.xts.data.table currently only allows conversion for numeric columns. it should follow the same rules as xts and support any type

@ethanbsmith
Copy link
Contributor Author

ethanbsmith commented Nov 20, 2021

i was going to work on a pr to essentially just do either:
xts::xts(as.matrix(x[, -1L]), order.by = if ("IDate" %chin% class(x[[1L]])) as.Date(x[[1L]]) else x[[1L]])
or
xts::xts(x[, -1L], order.by = if ("IDate" %chin% class(x[[1L]])) as.Date(x[[1L]]) else x[[1L]])

which i think is the proper way to handle the conversion.
However, this will be a breaking change, as some people may be relying on the existing numeric filtering. Any thoughts or suggestions?

@ben-schwen
Copy link
Member

Best option imho would be to establish type counts via table(vapply_1c(x[,-1L], mode)) and filter for the most occurring type. Maybe even establish some kind of tie-breaker with numeric > ...

@ethanbsmith
Copy link
Contributor Author

ethanbsmith commented Nov 21, 2021

wont mode based column filtering still be a potential breaking change? shouldn't the caller really be responsible for subsetting which columns to include

using to.matrix has a reliable, standard , expected outcome. it turns this function into handling just handling the index conversion between a data column and an attribute, which is a good piece of functionality

one option would be to add a numerics.only argument that would default to TRUE for backward compatibility, The default could be changed/removed on a later version

@ethanbsmith
Copy link
Contributor Author

ethanbsmith commented Nov 21, 2021

There is also a significant performance gain to using data.table awesome as.matrix implementation. some benchmarks on a ~20 million row OHLCV data table:

> str(z)
Classesdata.tableand 'data.frame':	19911744 obs. of  6 variables:
 $ index : POSIXct, format: "1962-01-02" "1962-01-03" "1962-01-04" "1962-01-05" ...
 $ Open  : num  0.423 0.422 0.424 0.419 0.411 ...
 $ High  : num  0.423 0.43 0.431 0.422 0.418 ...
 $ Low   : num  0.415 0.422 0.419 0.4 0.406 ...
 $ Close : num  0.415 0.424 0.419 0.411 0.412 ...
 $ Volume: num  704396 1420820 1822013 1760777 947213 ...
 - attr(*, ".internal.selfref")=<externalptr> 
> microbenchmark(as.xts(setDF(z[,-1L]), order.by = z[[1L]]),
+                xts(setDF(z[,-1L]), order.by = z[[1L]]),
+                xts(as.matrix(z[,-1L]), order.by = z[[1L]]), times = 5)
Unit: seconds
                                         expr       min        lq      mean    median        uq      max neval
  as.xts(setDF(z[, -1L]), order.by = z[[1L]]) 12.374819 16.449329 17.465670 16.517783 18.019016 23.96741     5
     xts(setDF(z[, -1L]), order.by = z[[1L]]) 11.876490 14.277099 17.339480 19.085982 19.989391 21.46844     5
 xts(as.matrix(z[, -1L]), order.by = z[[1L]])  4.237925  4.315352  6.064066  4.817026  5.817695 11.13233     5

@ethanbsmith
Copy link
Contributor Author

ethanbsmith commented Nov 21, 2021

to disambiguate the issues, this is what i see:

  1. handling mixed type tables. here i think as.matrix is the best option as it provides a standard, deterministic result
  2. handling backward compatibility. here i think a new parameter that optimizes a common use case is best
  3. where should the conversion to a matrix be done? here i think the optimized data.table::as.matrix is best vs. passing a data.frame to xts

@mattdowle mattdowle added this to the 1.14.3 milestone Dec 2, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants