Skip to content

Conversation

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jan 13, 2019

Closes #2603

(and maybe #2503?)

Somewhat rough version intended to get the ball rolling on this -- open to design changes.

Design of use_merge = 'auto' is kind of back-of-the-envelope based on mini-benchmark mentioned here: #2603 (comment)

A more ideal version would be to look at uniqueN(x)/length(x) as part of 'auto', something like:

if (use_merge == 'auto') {
  DT = data.table(x)
  lookup = unique(DT)
  if (nrow(lookup)/length(x) < some_threshold) {
    DT[lookup[ , IDate := as.IDate(x, ..., tz = tz)], on = 'x', i.IDate]
  } else as.IDate(as.Date(x, ..., tz = tz))
}

but I am assuming (i.e. no benchmarking done on this) once we run unique, we may as well go through with the merge.

R/IDateTime.R Outdated
as.IDate.default <- function(x, ..., tz = attr(x, "tzone"), use_merge = 'auto') {
if (is.null(tz)) tz = "UTC"
as.IDate(as.Date(x, tz = tz, ...))
if (isTRUE(use_merge) || (use_merge == 'auto' && length(x) >= 1000)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check class of "x" to redirect to this branch only types on which dt can do the merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will lead to strange unexpected errors otherwise.

Do we keep a list of known merge types? If not we should...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in setops.R I used to define one so you can look it up from there, or maybe put to helper function and reuse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, was thinking something like can_merge = function(x) class(x) %chin% known_types or just keep known_types = function() c(...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_merge would need to handle multiple columns so maybe simpler just merge_types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is what you're referring to:

bad_types = c("raw", "complex", if (block_list) "list")
  found = bad_types %chin% c(vapply(x, typeof, FUN.VALUE = ""),
                             vapply(y, typeof, FUN.VALUE = ""))

R/IDateTime.R Outdated
if (isTRUE(use_merge) || (use_merge == 'auto' && length(x) >= 1000)) {
DT = data.table(x)
# shut off use_merge to prevent recursion
unique(DT)[ , 'IDate' := as.IDate(x, tz = tz, ..., use_merge = FALSE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use "unique" before creating data.table

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Not quite following the reasoning... my thinking was unique.data.table would be more efficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more efficient than data.frame method, but not sure about comparing to vector. R uses data.table's order (not the parallel one, but previous one, still super fast). The fastest way might be to use forder(retGrp=T) to then subset to unique value.

Copy link
Member

@jangorecki jangorecki Jan 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when using unique.data.table we have internal funique function in setops.R. AFAIU it will avoid extra copy of input if values are already unique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Yes, I think forder approach should be best since it'll obviate the need to do both unique and merge if I'm not mistaken...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, not immediate to me how to do it with just forderv... here's what I came up with and it's horribly inefficient:

    o = forderv(DT, retGrp = TRUE)
    s = attr(o, 'starts')
    n = length(x)
    out = integer(n)
    lookup = as.IDate(as.Date(x[o][s], tz = tz, ...))
    for (ii in seq_along(s)) {
      end_idx = if (ii == length(s)) n else s[ii + 1L] - 1L
      out[o][s[ii]:end_idx] = lookup[ii]
    }
    setattr(out, 'class', c('IDate', 'Date'))

Copy link
Member

@jangorecki jangorecki Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico which part is inefficient? loop? src/vecseq.c might be good solution for that
other ideas:

  • x is reordered according to o and then subset according to s. Shouldn't that be better to first subset and then reorder?
cc(F)
x = c(2L,4L,1L,2L,3L,1L,4L)
o = forderv(data.table(x), "x", retGrp=TRUE)
s = attr(o, 'starts')
x[o][s]
x[o[s]]
  • instead of as.IDate(as.Date( it should be better to handle that in as.IDate only, if possible

Copy link
Contributor

@ColeMiller1 ColeMiller1 Sep 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an approach with forderv:

IDate_character = function (x) {
  o__ = data.table:::forderv(list(x), 1L, retGrp=TRUE)
  if (attr(o__, "maxgrpn") == 1L) { ## lookup will not help us - let's just make a Date
    ans = as.integer(as.Date(x))
  } else {
    f__ = attr(o__, 'starts')
    len__ = data.table:::uniqlengths(f__, length(x))
    
    if (!length(o__)) {
      ans = rep.int(as.integer(as.Date(x[f__])), len__)
    } else {
      ans = integer(length(x))
      ans[o__] = rep.int(as.integer(as.Date(x[o__[f__]])), len__)
      }
  }
  class(ans) = c("IDate", "Date")
  return(ans)
}

Initial testing is promising - more performant than a lookup self-join and results seem to align with as.IDate. But I will admit it does not look as clean as the join method.

Edit: I forgot to add that this would give us uniqueN for free with length(f__).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! for our internal code I think this sort of usage of forderv is common enough that we don't need to worry too much about "cleanliness" as there are a few other examples too.

A benchmark would be nice; in any case feel free to update/mege directly into this branch to be sure you're getting the commit credit 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 options we are discussing.

  1. Make a lookup table based on unique dates
  2. Use an existing lookup table that would invisibly take up memory (but be fast)
  3. Use forderv directly

Additionally, there are two flavors, unkeyed and keyed. For the most part, option 1 is the fastest although once we get to 1e8 rows, the forderv option surprisingly starts to outpace it. Note, option #2 (which is proposed in the FR) is faster than my proposal with keyed and 1e5 rows.

If we agree that this is limited to as.IDate.character(), I will make the changes.

Unkeyed:
image

Keyed:
image

Code used:

library(data.table)

IDate_character = function (x) {
  o__ = data.table:::forderv(list(x), 1L, retGrp=TRUE)
  if (attr(o__, "maxgrpn", exact = TRUE) == 1L) { ## lookup will not help us - let's just make a Date
    ans = as.integer(as.Date(x))
  } else {
    f__ = attr(o__, 'starts', exact = TRUE)
    len__ = data.table:::uniqlengths(f__, length(x))
    
    if (!length(o__)) {
      ans = rep.int(as.integer(as.Date(x[f__])), len__)
    } else {
      ans = integer(length(x))
      ans[o__] = rep.int(as.integer(as.Date(x[o__[f__]])), len__)
    }
  }
  class(ans) = c("IDate", "Date")
  return(ans)
}

IDate_lookup = function(x) {
  tz = "UTC"
  DT = list(input = x)
  setDT(DT)
  
  lookup = unique(DT)
  lookup[ , 'IDate' := as.IDate(as.Date(input, tz = tz))]
  lookup[DT, on = 'input']$IDate
}

IDate_existing_lookup = function(x) {
  tz = "UTC"
  DT = list(date_char = x)
  setDT(DT)
  
  date_lookup[DT, on = 'date_char']$date_int
}

dates = seq.Date(as.Date('1900-01-01'),
                 as.Date('2099-12-31'), by = 'day')
date_lookup = data.table(
  date_char = as.character(dates),
  date_int = as.IDate(dates),
  key = 'date_char'
)

set.seed(3082)
NN = 1e7
smp_dt = copy(date_lookup[sample(.N, NN, TRUE)])
setkey(smp_dt, date_char)

plot(bench::press(
  NN = c(10^(1:8)),
  {
    set.seed(3082)
    smp_dt = copy(date_lookup[sample(.N, NN, TRUE)])
    setkey(smp_dt, date_char) ## comment out to do unkeyed
    bench::mark(
      IDate_character(smp_dt$date_char),
      IDate_lookup(smp_dt$date_char),
      IDate_existing_lookup(smp_dt$date_char),
      # as.IDate(smp_dt$date_char)
      , min_iterations = 2L
    )
  }
))

@codecov
Copy link

codecov bot commented Jan 13, 2019

Codecov Report

Merging #3279 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3279      +/-   ##
==========================================
+ Coverage   94.81%   94.81%   +<.01%     
==========================================
  Files          65       65              
  Lines       12094    12098       +4     
==========================================
+ Hits        11467    11471       +4     
  Misses        627      627
Impacted Files Coverage Δ
R/IDateTime.R 96.66% <100%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2acb5...99ad385. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 13, 2019

Codecov Report

Merging #3279 into master will decrease coverage by 0.28%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3279      +/-   ##
==========================================
- Coverage   95.09%   94.81%   -0.29%     
==========================================
  Files          65       65              
  Lines       12122    12100      -22     
==========================================
- Hits        11528    11473      -55     
- Misses        594      627      +33
Impacted Files Coverage Δ
R/IDateTime.R 96.71% <100%> (+0.18%) ⬆️
R/data.table.R 95.2% <0%> (-2.33%) ⬇️
R/shift.R 93.75% <0%> (-0.37%) ⬇️
R/print.data.table.R 92.47% <0%> (-0.24%) ⬇️
src/assign.c 94.66% <0%> (-0.18%) ⬇️
src/reorder.c 97.43% <0%> (-0.04%) ⬇️
src/fmelt.c 84.71% <0%> (-0.04%) ⬇️
src/subset.c 100% <0%> (ø) ⬆️
src/uniqlist.c 95.55% <0%> (ø) ⬆️
src/frollR.c 100% <0%> (ø) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cd3a46...7c4622a. Read the comment docs.

@jangorecki jangorecki added this to the 1.12.2 milestone Jan 13, 2019
man/IDateTime.Rd Outdated
\code{as.IDate.default}, arguments are passed to \code{as.Date}. For
\code{as.ITime.default}, arguments are passed to \code{as.POSIXlt}.}
\item{tz}{time zone (see \code{strptime}).}
\item{use_merge}{ Should the parsing be done by via merge? See Details. }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ts good the mention what data type it expects, I usually do it as the first word of argument description. You can fix that for tz argument also.
I also don't like use_merge name, maybe use.dict?
Instead of character we could use a numeric value to avoid hardcoded 1000 length to kick in the merge. Default to 1000, passing Inf would be equivalent to current FALSE. I can be easier for users to program around that switch knowing uniqueN of input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit weird to me to use absolute numbers like that... in terms of knowing your data you can always pass TRUE/FALSE (I mainly have in mind FALSE to account for the case when you know you have very few repeated dates). In any case moving to this I'd change the argument name to dict.threshold or similar.

@jangorecki
Copy link
Member

I think it make sense to include the same feature for ITime in this PR, logic will be generally the same.

@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 22, 2019
@MichaelChirico
Copy link
Member Author

@jangorecki just added ITime logic. Still need to do the can_merge step.

@jangorecki

This comment was marked as outdated.

@MichaelChirico

This comment was marked as outdated.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep default disabled, just be safer on CRAN now, and changed to length(x)>1000 in 1.12.3

as.IDate.default <- function(x, ..., tz = attr(x, "tzone"), use_lookup = 'auto') {
if (is.null(tz)) tz = "UTC"
as.IDate(as.Date(x, tz = tz, ...))
if (isTRUE(use_lookup) || (use_lookup == 'auto' && length(x) >= 1000L)) {
Copy link
Member

@jangorecki jangorecki Feb 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it behave when user provide some custom class to this method? This is default method so it can get any kind of input. Or am I missing something?
Where is check that we support merge on x arg that we discussed before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR still incomplete...

@mattdowle mattdowle changed the title Closes #2306 -- automatically use look-up table to do IDate conversion use look-up table in IDate conversion May 23, 2019
@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Sep 17, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@mattdowle
Copy link
Member

@MichaelChirico Now that the date parser is in fread(), could this PR make as.IDate use that code in fread?

@mattdowle mattdowle removed this from the 1.14.1 milestone Aug 5, 2021
@mattdowle mattdowle added the WIP label Aug 6, 2021
@MichaelChirico MichaelChirico marked this pull request as draft December 14, 2023 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use lookup (join to dictionary) for performance boost

5 participants