Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forder should detect and switch to integer type on columns that aren't really double #1738

Closed
arunsrinivasan opened this issue Jun 10, 2016 · 2 comments · Fixed by #3624
Closed
Milestone

Comments

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Jun 10, 2016

require(data.table)
set.seed(1L)
dt = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="days"), 1e7, TRUE))
typeof(dt$d)  # double

system.time(dt[, .N, by=d]) # 0.7s
system.time(dt[, .N, by=as.numeric(d)]) # 0.7s
system.time(dt[, .N, by=as.integer(d)]) # 0.2s
@jangorecki jangorecki added this to the 1.12.4 milestone Mar 5, 2019
@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 31, 2019

This looks OK for Date where we don't have to verify that real is actually integer. For ordinary numeric fields we would need to call isReallyReal which is expensive as needs to make full pass over input if it is going to be converted to integer. It might be better to expect from user to use proper data type.

@jangorecki jangorecki removed this from the 1.12.4 milestone May 31, 2019
jangorecki added a commit that referenced this issue Jun 3, 2019
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jul 19, 2019

for Date where we don't have to verify that real is actually integer

I asked R-core about this once. The reason they made Date double was because Date allows fractional days; e.g. mean(Date). So because they wanted fractional Date to be valid, they made it double up front so it wouldn't change type suddenly midway in an analysis; for type consistency. I can see the logic behind the decision. So anyway, we can't assume that Date has no fractional part, because it can. We'd still have to call isReallReal on Date.

It might be better to expect from user to use proper data type.

Agree. How about just emitting a message for now.

Here are current timings:

> set.seed(1L)
> dt = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="days"),
                           1e7, TRUE))
> typeof(dt$d)
[1] "double"
> dt[, i:=as.integer(d)]
> dt
                   d     i
              <Date> <int>
       1: 2015-04-07 16532
       2: 2015-05-16 16571
       3: 2015-07-29 16645
       4: 2015-11-28 16767
       5: 2015-03-15 16509
      ---                 
 9999996: 2015-05-04 16559
 9999997: 2015-11-19 16758
 9999998: 2015-12-31 16800
 9999999: 2015-01-24 16459
10000000: 2015-08-08 16655
> system.time(dt[, .N, by=d])
   user  system elapsed 
  0.953   0.072   0.377 
> system.time(dt[, .N, by=i])
   user  system elapsed 
  0.425   0.072   0.172 
> system.time(dt[, .N, by=as.integer(d)])
   user  system elapsed 
  0.385   0.061   0.183 

> isReallyReal(dt$d)   # full scan needed here to confirm it is not really real
[1] 0
> system.time(isReallyReal(dt$d))
   user  system elapsed 
  0.036   0.000   0.037 
> system.time(as.integer(dt$d))
   user  system elapsed 
  0.044   0.000   0.043 

with 10x the rows (1e8) [ one timing shown here to keep it simple; I confirmed timing was stable locally ]

> dt = data.table(d=sample(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="days"),
                           1e8, TRUE))
> dt[, i:=as.integer(d)]
> system.time(dt[, .N, by=d])
   user  system elapsed 
  7.929   0.778   2.801 
> system.time(dt[, .N, by=i])
   user  system elapsed 
  3.415   0.647   1.206 
> system.time(dt[, .N, by=as.integer(d)])
   user  system elapsed 
  3.745   0.715   1.630 
> system.time(isReallyReal(dt$d))
   user  system elapsed 
  0.364   0.000   0.364 
> system.time(as.integer(dt$d))
   user  system elapsed 
  0.348   0.076   0.424 

@mattdowle mattdowle added this to the 1.12.4 milestone Jul 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants