use look-up table in IDate conversion #3279

MichaelChirico · 2019-01-13T04:52:22Z

(and maybe #2503?)

Somewhat rough version intended to get the ball rolling on this -- open to design changes.

Design of use_merge = 'auto' is kind of back-of-the-envelope based on mini-benchmark mentioned here: #2603 (comment)

A more ideal version would be to look at uniqueN(x)/length(x) as part of 'auto', something like:

if (use_merge == 'auto') {
  DT = data.table(x)
  lookup = unique(DT)
  if (nrow(lookup)/length(x) < some_threshold) {
    DT[lookup[ , IDate := as.IDate(x, ..., tz = tz)], on = 'x', i.IDate]
  } else as.IDate(as.Date(x, ..., tz = tz))
}

but I am assuming (i.e. no benchmarking done on this) once we run unique, we may as well go through with the merge.

R/IDateTime.R

jangorecki · 2019-01-13T05:52:21Z

R/IDateTime.R

+as.IDate.default <- function(x, ..., tz = attr(x, "tzone"), use_merge = 'auto') {
  if (is.null(tz)) tz = "UTC"
-  as.IDate(as.Date(x, tz = tz, ...))
+  if (isTRUE(use_merge) || (use_merge == 'auto' && length(x) >= 1000)) {


We could check class of "x" to redirect to this branch only types on which dt can do the merge.

Good point. Will lead to strange unexpected errors otherwise.

Do we keep a list of known merge types? If not we should...

in setops.R I used to define one so you can look it up from there, or maybe put to helper function and reuse

yes, was thinking something like can_merge = function(x) class(x) %chin% known_types or just keep known_types = function() c(...)

can_merge would need to handle multiple columns so maybe simpler just merge_types

I guess this is what you're referring to:

bad_types = c("raw", "complex", if (block_list) "list") found = bad_types %chin% c(vapply(x, typeof, FUN.VALUE = ""), vapply(y, typeof, FUN.VALUE = ""))

R/IDateTime.R

jangorecki · 2019-01-13T05:59:35Z

R/IDateTime.R

+  if (isTRUE(use_merge) || (use_merge == 'auto' && length(x) >= 1000)) {
+    DT = data.table(x)
+    # shut off use_merge to prevent recursion
+    unique(DT)[ , 'IDate' := as.IDate(x, tz = tz, ..., use_merge = FALSE)


I would use "unique" before creating data.table

Could you elaborate? Not quite following the reasoning... my thinking was unique.data.table would be more efficient.

It will be more efficient than data.frame method, but not sure about comparing to vector. R uses data.table's order (not the parallel one, but previous one, still super fast). The fastest way might be to use forder(retGrp=T) to then subset to unique value.

when using unique.data.table we have internal funique function in setops.R. AFAIU it will avoid extra copy of input if values are already unique.

True. Yes, I think forder approach should be best since it'll obviate the need to do both unique and merge if I'm not mistaken...

OK, not immediate to me how to do it with just forderv... here's what I came up with and it's horribly inefficient:

o = forderv(DT, retGrp = TRUE) s = attr(o, 'starts') n = length(x) out = integer(n) lookup = as.IDate(as.Date(x[o][s], tz = tz, ...)) for (ii in seq_along(s)) { end_idx = if (ii == length(s)) n else s[ii + 1L] - 1L out[o][s[ii]:end_idx] = lookup[ii] } setattr(out, 'class', c('IDate', 'Date'))

@MichaelChirico which part is inefficient? loop? src/vecseq.c might be good solution for that
other ideas:

x is reordered according to o and then subset according to s. Shouldn't that be better to first subset and then reorder?

cc(F) x = c(2L,4L,1L,2L,3L,1L,4L) o = forderv(data.table(x), "x", retGrp=TRUE) s = attr(o, 'starts') x[o][s] x[o[s]]

instead of as.IDate(as.Date( it should be better to handle that in as.IDate only, if possible

Here's an approach with forderv:

IDate_character = function (x) { o__ = data.table:::forderv(list(x), 1L, retGrp=TRUE) if (attr(o__, "maxgrpn") == 1L) { ## lookup will not help us - let's just make a Date ans = as.integer(as.Date(x)) } else { f__ = attr(o__, 'starts') len__ = data.table:::uniqlengths(f__, length(x)) if (!length(o__)) { ans = rep.int(as.integer(as.Date(x[f__])), len__) } else { ans = integer(length(x)) ans[o__] = rep.int(as.integer(as.Date(x[o__[f__]])), len__) } } class(ans) = c("IDate", "Date") return(ans) }

Initial testing is promising - more performant than a lookup self-join and results seem to align with as.IDate. But I will admit it does not look as clean as the join method.

Edit: I forgot to add that this would give us uniqueN for free with length(f__).

Great work! for our internal code I think this sort of usage of forderv is common enough that we don't need to worry too much about "cleanliness" as there are a few other examples too.

A benchmark would be nice; in any case feel free to update/mege directly into this branch to be sure you're getting the commit credit 😃

There are 3 options we are discussing.

Make a lookup table based on unique dates

Use an existing lookup table that would invisibly take up memory (but be fast)

Use forderv directly

Additionally, there are two flavors, unkeyed and keyed. For the most part, option 1 is the fastest although once we get to 1e8 rows, the forderv option surprisingly starts to outpace it. Note, option #2 (which is proposed in the FR) is faster than my proposal with keyed and 1e5 rows.

If we agree that this is limited to as.IDate.character(), I will make the changes.

Unkeyed:

Keyed:

Code used:

library(data.table) IDate_character = function (x) { o__ = data.table:::forderv(list(x), 1L, retGrp=TRUE) if (attr(o__, "maxgrpn", exact = TRUE) == 1L) { ## lookup will not help us - let's just make a Date ans = as.integer(as.Date(x)) } else { f__ = attr(o__, 'starts', exact = TRUE) len__ = data.table:::uniqlengths(f__, length(x)) if (!length(o__)) { ans = rep.int(as.integer(as.Date(x[f__])), len__) } else { ans = integer(length(x)) ans[o__] = rep.int(as.integer(as.Date(x[o__[f__]])), len__) } } class(ans) = c("IDate", "Date") return(ans) } IDate_lookup = function(x) { tz = "UTC" DT = list(input = x) setDT(DT) lookup = unique(DT) lookup[ , 'IDate' := as.IDate(as.Date(input, tz = tz))] lookup[DT, on = 'input']$IDate } IDate_existing_lookup = function(x) { tz = "UTC" DT = list(date_char = x) setDT(DT) date_lookup[DT, on = 'date_char']$date_int } dates = seq.Date(as.Date('1900-01-01'), as.Date('2099-12-31'), by = 'day') date_lookup = data.table( date_char = as.character(dates), date_int = as.IDate(dates), key = 'date_char' ) set.seed(3082) NN = 1e7 smp_dt = copy(date_lookup[sample(.N, NN, TRUE)]) setkey(smp_dt, date_char) plot(bench::press( NN = c(10^(1:8)), { set.seed(3082) smp_dt = copy(date_lookup[sample(.N, NN, TRUE)]) setkey(smp_dt, date_char) ## comment out to do unkeyed bench::mark( IDate_character(smp_dt$date_char), IDate_lookup(smp_dt$date_char), IDate_existing_lookup(smp_dt$date_char), # as.IDate(smp_dt$date_char) , min_iterations = 2L ) } ))

R/IDateTime.R

codecov · 2019-01-13T06:03:47Z

Codecov Report

Merging #3279 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3279      +/-   ##
==========================================
+ Coverage   94.81%   94.81%   +<.01%     
==========================================
  Files          65       65              
  Lines       12094    12098       +4     
==========================================
+ Hits        11467    11471       +4     
  Misses        627      627

Impacted Files	Coverage Δ
R/IDateTime.R	`96.66% <100%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...99ad385. Read the comment docs.

codecov · 2019-01-13T06:03:47Z

Codecov Report

Merging #3279 into master will decrease coverage by 0.28%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3279      +/-   ##
==========================================
- Coverage   95.09%   94.81%   -0.29%     
==========================================
  Files          65       65              
  Lines       12122    12100      -22     
==========================================
- Hits        11528    11473      -55     
- Misses        594      627      +33

Impacted Files	Coverage Δ
R/IDateTime.R	`96.71% <100%> (+0.18%)`	⬆️
R/data.table.R	`95.2% <0%> (-2.33%)`	⬇️
R/shift.R	`93.75% <0%> (-0.37%)`	⬇️
R/print.data.table.R	`92.47% <0%> (-0.24%)`	⬇️
src/assign.c	`94.66% <0%> (-0.18%)`	⬇️
src/reorder.c	`97.43% <0%> (-0.04%)`	⬇️
src/fmelt.c	`84.71% <0%> (-0.04%)`	⬇️
src/subset.c	`100% <0%> (ø)`	⬆️
src/uniqlist.c	`95.55% <0%> (ø)`	⬆️
src/frollR.c	`100% <0%> (ø)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cd3a46...7c4622a. Read the comment docs.

R/IDateTime.R

jangorecki · 2019-01-13T06:39:29Z

man/IDateTime.Rd

    \code{as.IDate.default}, arguments are passed to \code{as.Date}. For
    \code{as.ITime.default}, arguments are passed to \code{as.POSIXlt}.}
  \item{tz}{time zone (see \code{strptime}).}
+  \item{use_merge}{ Should the parsing be done by via merge? See Details. }


It ts good the mention what data type it expects, I usually do it as the first word of argument description. You can fix that for tz argument also.
I also don't like use_merge name, maybe use.dict?
Instead of character we could use a numeric value to avoid hardcoded 1000 length to kick in the merge. Default to 1000, passing Inf would be equivalent to current FALSE. I can be easier for users to program around that switch knowing uniqueN of input.

Feels a bit weird to me to use absolute numbers like that... in terms of knowing your data you can always pass TRUE/FALSE (I mainly have in mind FALSE to account for the case when you know you have very few repeated dates). In any case moving to this I'd change the argument name to dict.threshold or similar.

R/IDateTime.R

jangorecki · 2019-01-19T10:35:14Z

I think it make sense to include the same feature for ITime in this PR, logic will be generally the same.

MichaelChirico · 2019-02-07T01:39:48Z

@jangorecki just added ITime logic. Still need to do the can_merge step.

…ersion

jangorecki

I would keep default disabled, just be safer on CRAN now, and changed to length(x)>1000 in 1.12.3

jangorecki · 2019-02-10T12:59:44Z

R/IDateTime.R

+as.IDate.default <- function(x, ..., tz = attr(x, "tzone"), use_lookup = 'auto') {
  if (is.null(tz)) tz = "UTC"
-  as.IDate(as.Date(x, tz = tz, ...))
+  if (isTRUE(use_lookup) || (use_lookup == 'auto' && length(x) >= 1000L)) {


How will it behave when user provide some custom class to this method? This is default method so it can get any kind of input. Or am I missing something?
Where is check that we support merge on x arg that we discussed before?

PR still incomplete...

mattdowle · 2021-08-05T23:51:50Z

@MichaelChirico Now that the date parser is in fread(), could this PR make as.IDate use that code in fread?

MichaelChirico mentioned this pull request Jan 13, 2019

Use lookup (join to dictionary) for performance boost #2603

Open

jangorecki reviewed Jan 13, 2019

View reviewed changes

jangorecki added this to the 1.12.2 milestone Jan 13, 2019

jangorecki reviewed Jan 13, 2019

View reviewed changes

ben519 reviewed Jan 14, 2019

View reviewed changes

R/IDateTime.R Show resolved Hide resolved

jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 22, 2019

This comment was marked as outdated.

Sign in to view

Closes #2306 -- automatically use look-up table to do IDate/Time conv…

7c4622a

…ersion

MichaelChirico force-pushed the IDate_merge branch from 473ee04 to 7c4622a Compare February 10, 2019 06:03

This comment was marked as outdated.

Sign in to view

jangorecki reviewed Feb 10, 2019

View reviewed changes

mattdowle changed the title ~~Closes #2306 -- automatically use look-up table to do IDate conversion~~ use look-up table in IDate conversion May 23, 2019

jangorecki mentioned this pull request Jun 5, 2019

IDate conversion from "YYYY-MM-DD" is slow #2503

Open

jangorecki modified the milestones: 1.12.4, 1.13.0 Sep 17, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

MichaelChirico mentioned this pull request Sep 15, 2020

fread to recognize "mdy" dates? #4701

Closed

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

mattdowle removed this from the 1.14.1 milestone Aug 5, 2021

mattdowle added the WIP label Aug 6, 2021

MichaelChirico removed the WIP label Dec 14, 2023

MichaelChirico marked this pull request as draft December 14, 2023 11:23

MichaelChirico mentioned this pull request Apr 12, 2024

Changig column classes is slow. #2138

Open

MichaelChirico mentioned this pull request Jul 8, 2025

Use faster base implementation for isoweek #7144

Merged

use look-up table in IDate conversion #3279

Are you sure you want to change the base?

use look-up table in IDate conversion #3279

Uh oh!

Conversation

MichaelChirico commented Jan 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jangorecki Jan 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jangorecki Feb 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ColeMiller1 Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jan 13, 2019

Codecov Report

Uh oh!

codecov bot commented Jan 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jangorecki commented Jan 19, 2019

Uh oh!

MichaelChirico commented Feb 7, 2019

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

jangorecki left a comment

Choose a reason for hiding this comment

Uh oh!

jangorecki Feb 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdowle commented Aug 5, 2021

Uh oh!

MichaelChirico commented Jan 13, 2019 •

edited

Loading

jangorecki Jan 13, 2019 •

edited

Loading

jangorecki Feb 7, 2019 •

edited

Loading

ColeMiller1 Sep 17, 2020 •

edited

Loading

codecov bot commented Jan 13, 2019 •

edited

Loading

jangorecki Feb 10, 2019 •

edited

Loading