Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subsetting n rows by .SD is O(e^n) #1400

Closed
geneorama opened this issue Oct 19, 2015 · 23 comments
Closed

subsetting n rows by .SD is O(e^n) #1400

geneorama opened this issue Oct 19, 2015 · 23 comments

Comments

@geneorama
Copy link

Please feel free to edit the subject, I may not be saying this right...

I've noticed again and again that subsetting on .SD gets less efficient as the size of a data.table increases. This has been consistent across versions of R, Linux and Windows, and versions of data.table.

Here's an example based on license data from data.cityofchicago.org

image

Never mind that I'm debugging, the point is that each time I add 100k rows the additional processing time grows exponentially, even though I set LICENSE_NUMBER as the key.

The times visualized:
plot(diff(c(1.34, 2.95, 4.56, 6.32, 8.85, 11.28, 15.27, 23.22)))
lines(lowess(diff(c(1.34, 2.95, 4.56, 6.32, 8.85, 11.28, 15.27, 23.22))))
image

Has anyone else noticed this?

@franknarf1
Copy link
Contributor

Interesting. How many groups per 100k rows? If you had done

dat[, N := .N, by=LICENSE_NUMBER]
setorder(dat,N)

before this (as an extreme case), then your observation would be confounded, right? If you posted

dput( dat[, .GRP, by=LICENSE_NUMBER]$GRP )

somewhere (not into a post here, of course), then we could all see the exact same data and compare alternatives (using .I or cumsum and diff, etc.), and compare the timing you have vs dat800 <- dat[seq(800000)]; system.time(dat800[, .SD[1], by=LICENSE_NUMBER]).

@geneorama
Copy link
Author

geneorama commented Oct 20, 2015

@franknarf1 Test away!

It takes about 10 minutes to download the data the first time, unless someone else has recently done it in which case the server has a cached copy that it serves up faster.

Please do not use this code, see updated example below

##------------------------------------------------------------------------------
## Initalize
##------------------------------------------------------------------------------
library(data.table)
devtools::install_github("chicago/RSocrata", ref="sprint7")
bus <- RSocrata::read.socrata(hostname="https://data.cityofchicago.org",
                              resourcePath="r5kz-chrr.csv", ## <- actually JSON
                              keyfield = "id")

## If you want to run this again, save the file rather than downloading it again
# saveRDS(bus, "bus.Rds")
# bus <- readRDS("bus.Rds")

bus <- as.data.table(bus)

##------------------------------------------------------------------------------
## Current function to do subsetting
##------------------------------------------------------------------------------
most_recent_bus_lic_fast <- function(dt){
    dat <- copy(dt)
    ## Keep records where LICENSE_ID is the maximum LICENSE_ID for all records 
    ## of a particular LICENSE_NUMBER (in other words, just keep the most 
    ## recent license number)
    dat <- dat[ , keep := (LICENSE.ID) == max(LICENSE.ID), LICENSE.NUMBER]
    dat <- dat[keep==TRUE]
    dat <- dat[ , keep:=NULL][]
    return(dat)
}

system.time(bus_recent1 <- most_recent_bus_lic_fast(bus))

##------------------------------------------------------------------------------
## The "living the dream" function to do subsetting
##------------------------------------------------------------------------------
most_recent_bus_lic_slow <- function(dt){
    dat <- copy(dt)
    ## Same as above, but using the preferred .SD method
    dat <- dat[ , .SD[LICENSE.ID == max(LICENSE.ID)], LICENSE.NUMBER]
    return(dat)
}

## Time tests with only 10k chunks of records, previous test was with 100k
setkey(bus, LICENSE.NUMBER, LICENSE.ID)
system.time(bus_recent2a <- most_recent_bus_lic_slow(bus[1:10000]))
system.time(bus_recent2b <- most_recent_bus_lic_slow(bus[1:20000]))
system.time(bus_recent2c <- most_recent_bus_lic_slow(bus[1:30000]))
system.time(bus_recent2d <- most_recent_bus_lic_slow(bus[1:40000]))
system.time(bus_recent2e <- most_recent_bus_lic_slow(bus[1:50000]))

Here're the results I'm seeing:
Fast version 🚀
image

Slow version 🐢
image

@geneorama
Copy link
Author

PS: I love working in the public sector where I can often share my data.

@franknarf1
Copy link
Contributor

Hm, haven't downloaded it myself yet, but weren't you doing .SD[1] before, not finding the max? If you're looking to select the max, this SO answer shows the current idiom: http://stackoverflow.com/a/16574176/1191259

If license IDs are unique or you don't mind just taking the first max:

dat[order(LICENSE.ID, decreasing=FALSE), .SD[1], by=LICENSE.NUMBER]

should work.

One minor point that could be costing you runtime: You shouldn't need to copy or use <- here except for the end result, like

most_recent_bus_lic_fast2 <- function(dat){
    dat[ , keep := LICENSE.ID == max(LICENSE.ID), LICENSE.NUMBER]
    return(dat[(keep), keep := NULL])
}

Copying a big table can take time and dat <- dat[..., x := y, ...] has no additional effect beyond dat[..., x := y, ...] since := modifies by reference. You might want to browse the vignettes if you're not familiar with what I mean: https://github.com/Rdatatable/data.table/wiki/Getting-started

@geneorama
Copy link
Author

@franknarf1 my issue / question is "why does .SD take take exponentially longer as you linearly add rows"

Originally I showed .SD[1] to provide a simple example, and because it was what I had on hand, and to avoid solutions where people try to find workarounds rather than focusing on the point of the question.

I have good reasons for using copy and for using the assignment operator, but don't want to debate those choices unless it's relevant to the question. Feel free to remove it in your tests.

Yes, it's possible to write the same procedure using the .I syntax in your SO example. This is worth remembering / good to know. However, I think it's a lot harder to read, and it doesn't answer the question about .SD.

Comparison of .I syntax to .SD
bus <- bus[bus[, .I[LICENSE.ID == max(LICENSE.ID)], by = LICENSE.NUMBER]$V1]
bus<- bus[ , .SD[LICENSE.ID == max(LICENSE.ID)], LICENSE.NUMBER]
Timing with .I Syntax
image

@eantonya
Copy link
Contributor

As @franknarf1 mentioned, to understand the timings you first need to understand the size of the loops. In this case - how many more unique license numbers do you get per each 100k rows you add, i.e. what are these numbers:

nrow(dat[1:1e5, 1, LICENSE_NUMBER])
nrow(dat[1:2e5, 1, LICENSE_NUMBER])
...

@franknarf1
Copy link
Contributor

Sorry, still haven't looked at your data, but here's mine, for which it is linear:

DT      = data.table(a=1:1e8, b=rep(1:1e6, each=100))
DTfracs = lapply(seq(1e7,1e8,1e7), function(i) DT[seq(i)]) 

for(di in seq_along(DTfracs))
    print(system.time(DTfracs[[di]][, .SD[1], by=b]))

   user  system elapsed 
   0.09    0.01    0.11 
   user  system elapsed 
   0.24    0.03    0.27 
   user  system elapsed 
   0.31    0.02    0.33 
   user  system elapsed 
   0.35    0.03    0.37 
   user  system elapsed 
   0.52    0.02    0.53 
   user  system elapsed 
   0.58    0.04    0.62 
   user  system elapsed 
   0.64    0.05    0.68 
   user  system elapsed 
   0.70    0.06    0.77 
   user  system elapsed 
   0.81    0.03    0.85 
   user  system elapsed 
   0.87    0.10    0.97 

Note that my benchmark does not have the subsetting operation inside of it; DT[1:1e8] is not free.

@geneorama
Copy link
Author

There are lots of examples that don't show this problem, which is why I provided the example that I provided. If it were easy to find a good example I would have posted this question a year ago when I first had the problem. I've had the same issue with many types of real world data. The .SD function takes a long time. However, I have not been able to contrive an example that has the same problem.

It doesn't take that long to download the data. Use the CSV download and read.csv if you don't want to install the devtools and / or the RSocrata reader (and dependencies). Or manually inspect the data here (click on the "LICENSE_NUMBER" column to sort by that column:
https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr

The LICENSE_NUMBER is about 30% unique, see the NAsummary below

str(fbus)
# Classes ‘data.table’ and 'data.frame':    843765 obs. of  31 variables:
#  $ ID                               : chr  "1000000-20020221" "1000049-20010816" "1000049-20020516" "1000049-20020816" ...
#  $ LICENSE_ID                       : num  1000000 1162772 1233615 1265665 1342680 ...
#  $ ACCOUNT_NUMBER                   : num  200001 200068 10141 200068 10141 ...
#  $ SITE_NUMBER                      : num  1 1 2 1 2 1 2 1 1 2 ...
#  $ LEGAL_NAME                       : chr  "MARK BOSTON" "ANTONIA CASTREJON" "PEPE\"S RETAIL MEATS, INC." "ANTONIA CASTREJON" ...
#  $ DOING_BUSINESS_AS_NAME           : chr  "COLORS IN MOTION" "ILLUSIONS HAIR DESIGN" "PEREZ MEXICAN FOOD" "ILLUSIONS HAIR DESIGN" ...
#  $ ADDRESS                          : chr  "6421 N DAMEN AVE" "3800 W DIVERSEY AVE" "853-855 W RANDOLPH ST 1ST" "3800 W DIVERSEY AVE" ...
#  $ CITY                             : chr  "CHICAGO" "CHICAGO" "CHICAGO" "CHICAGO" ...
#  $ STATE                            : chr  "IL" "IL" "IL" "IL" ...
#  $ ZIP_CODE                         : chr  "60645" "60647" "60607" "60647" ...
#  $ WARD                             : num  50 30 27 30 27 30 27 30 30 27 ...
#  $ PRECINCT                         : num  28 999 1 999 1 999 1 999 999 1 ...
#  $ POLICE_DISTRICT                  : num  24 25 12 25 12 25 12 25 25 12 ...
#  $ LICENSE_CODE                     : num  1011 1010 1006 1010 1006 ...
#  $ LICENSE_DESCRIPTION              : chr  "Home Repair" "Limited Business License" "Retail Food Establishment" "Limited Business License" ...
#  $ LICENSE_NUMBER                   : num  1e+06 1e+06 1e+06 1e+06 1e+06 ...
#  $ APPLICATION_TYPE                 : chr  "ISSUE" "RENEW" "RENEW" "RENEW" ...
#  $ APPLICATION_CREATED_DATE         : IDate, format: "2000-06-19" NA NA ...
#  $ APPLICATION_REQUIREMENTS_COMPLETE: IDate, format: "2002-02-15" "2001-06-25" "2002-03-27" ...
#  $ PAYMENT_DATE                     : IDate, format: "2002-02-15" "2001-08-20" "2002-04-17" ...
#  $ CONDITIONAL_APPROVAL             : chr  "N" "N" "N" "N" ...
#  $ LICENSE_TERM_START_DATE          : IDate, format: "2002-02-21" "2001-08-16" "2002-05-16" ...
#  $ LICENSE_TERM_EXPIRATION_DATE     : IDate, format: "2002-11-15" "2002-08-15" "2003-05-15" ...
#  $ LICENSE_APPROVED_FOR_ISSUANCE    : IDate, format: "2002-02-21" "2001-08-20" "2002-04-17" ...
#  $ DATE_ISSUED                      : IDate, format: "2002-02-22" "2002-04-30" "2002-04-18" ...
#  $ LICENSE_STATUS                   : chr  "AAI" "AAI" "AAI" "AAI" ...
#  $ LICENSE_STATUS_CHANGE_DATE       : IDate, format: NA NA NA ...
#  $ SSA                              : num  NA NA NA NA NA NA NA NA NA NA ...
#  $ LATITUDE                         : num  42 41.9 41.9 41.9 41.9 ...
#  $ LONGITUDE                        : num  -87.7 -87.7 -87.6 -87.7 -87.6 ...
#  $ LOCATION                         : chr  "(41.99851437112669, -87.68001090539342)" "(41.931960332638006, -87.72215036594574)" "(41.88426142200001, -87.6495341312589)" "(41.931960332638006, -87.72215036594574)" ...
#  - attr(*, ".internal.selfref")=<externalptr> 

geneorama::NAsummary(fbus)
#                                   col  Count    nNA    rNA nUnique rUnique
# ID                                  1 843765      0 0.0000  843529  0.9997
# LICENSE_ID                          2 843765      0 0.0000  843765  1.0000
# ACCOUNT_NUMBER                      3 843765      0 0.0000  146771  0.1739
# SITE_NUMBER                         4 843765      0 0.0000     387  0.0004
# LEGAL_NAME                          5 843765      0 0.0000  144814  0.1716
# DOING_BUSINESS_AS_NAME              6 843765      0 0.0000  165890  0.1966
# ADDRESS                             7 843765      0 0.0000  156344  0.1852
# CITY                                8 843765      0 0.0000    1666  0.0019
# STATE                               9 843765      0 0.0000      55  0.0000
# ZIP_CODE                           10 843765      8 0.0000    2494  0.0029
# WARD                               11 843765  61102 0.0724      51  0.0000
# PRECINCT                           12 843765  79015 0.0936      79  0.0000
# POLICE_DISTRICT                    13 843765  74450 0.0882      32  0.0000
# LICENSE_CODE                       14 843765      0 0.0000     132  0.0001
# LICENSE_DESCRIPTION                15 843765      0 0.0000     132  0.0001
# LICENSE_NUMBER                     16 843765      1 0.0000  259203  0.3071 <- here you go
# APPLICATION_TYPE                   17 843765      0 0.0000       6  0.0000
# APPLICATION_CREATED_DATE           18 843765 651561 0.7722    3813  0.0045
# APPLICATION_REQUIREMENTS_COMPLETE  19 843765  17442 0.0206    4167  0.0049
# PAYMENT_DATE                       20 843765  20734 0.0245    4994  0.0059
# CONDITIONAL_APPROVAL               21 843765      0 0.0000       2  0.0000
# LICENSE_TERM_START_DATE            22 843765   2499 0.0029    3574  0.0042
# LICENSE_TERM_EXPIRATION_DATE       23 843765     63 0.0000    1636  0.0019
# LICENSE_APPROVED_FOR_ISSUANCE      24 843765  43718 0.0518    4776  0.0056
# DATE_ISSUED                        25 843765      0 0.0000    3478  0.0041
# LICENSE_STATUS                     26 843765      0 0.0000       5  0.0000
# LICENSE_STATUS_CHANGE_DATE         27 843765 786570 0.9322    3396  0.0040
# SSA                                28 843765 622703 0.7380      54  0.0000
# LATITUDE                           29 843765  61956 0.0734   73869  0.0875
# LONGITUDE                          30 843765  61956 0.0734   73876  0.0875
# LOCATION                           31 843765      0 0.0000   73888  0.0875

@geneorama geneorama reopened this Oct 20, 2015
@eantonya
Copy link
Contributor

@geneorama my guess your expectations are simply off - what slows down loops (aka by operations) is usually not so much the size of data, but how large the loop is (aka how many unique groups you have).

Compute the numbers I suggested, and if those do not grow "exponentially", then we have something interesting here.

@geneorama
Copy link
Author

image
image

@geneorama
Copy link
Author

@eantonya I think I (just now) realized your point about the time increases if it were exponential.

It's hard to pick up on the nuances by just reading the thread, and I think you missed the detail that the entire 800k+ data is processed in less than 1 second if you do the process in steps / use .I (as @franknarf1 's comment suggested).

So sure, the increments could be growing if the distribution is changing, but the whole thing shouldn't take more than a second, and it definitely shouldn't take more than 10 seconds.

@arunsrinivasan
Copy link
Member

.SD[1] was optimised to list(...) but:

system.time(bus[1:800000, .SD[1L], by=LICENSE.NUMBER])
#    user  system elapsed 
#  19.379   0.215  19.804 
system.time(bus[1:800000, .I[1L], by=LICENSE.NUMBER])
#    user  system elapsed 
#   0.085   0.007   0.092 
system.time(bus[bus[1:800000, .I[1L], by=LICENSE.NUMBER]$V1])
#    user  system elapsed 
#   0.157   0.018   0.176 

.SD[expr] should be even slower.. Definitely needs some work.

@geneorama
Copy link
Author

@arunsrinivasan Am I using .SD as intended? I'm surprised that this isn't a known issue.

@arunsrinivasan
Copy link
Member

There's an entire issue for it #735

@geneorama
Copy link
Author

For the record, I think @franknarf1 had the best suggestion in the beginning. Since I only wanted to subset the rows within a group I don't need the overhead of using .SD, so I could just use .I.

In my mind since the S in .SD stands for "subset" I think of that first for subsetting. However my new understanding is that .I is a better choice.

This advice was very good advice (and I use this pattern myself quite often):

If license IDs are unique or you don't mind just taking the first max:
[method to sort by group id, and take first element]

As @franknarf1 pointed out there could be an issue with taking the first max, which is why I prefer the longer route inspired by the SO link. I would just break it up into pieces to attempt to make it more readable

ii <- bus[ , .I[(LICENSE.ID) == max(LICENSE.ID)], LICENSE.NUMBER]
bus_most_recent <- bus[ii$V1]
rm(ii)

I'm wondering though, whenever the .SD function only has an argument for i, like my example: dat[ , .SD[LICENSE.ID == max(LICENSE.ID)], LICENSE.NUMBER], maybe it would be easiest to convert the call to use .I instead of .SD?

@franknarf1
Copy link
Contributor

Yeah, that's the spirit of the optimization for .SD in the github issue Arun linked, I think.

By the way, eantonya is the @eddi you see authoring that now-idiomatic .I answer on SO, and is a contributor to the package itself.

@geneorama
Copy link
Author

@franknarf1 Thanks. I couldn't follow the linked issue yesterday, so I didn't see the link.

Also, thanks for linking eatonya to eddi. I've been using data.table in place of data.frame for a few years now, but at least once a week I'm plumbing the depths of https://stackoverflow.com/questions/tagged/data.table and relying on the answers that all of you provide there.

I didn't want to get off track yesterday, but I use copy in every function to avoid accidental "in place" modifications. It's caught me off guard before, and I never want it to happen again ("fool me once.. can't get food again" as GW says).

Also I do the <- for several reasons: to safeguard against accidental printing in .Rmd files, to help non-data.table users to follow along, and for my own readability.

@MichaelChirico
Copy link
Member

Re-running this in this environment:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.6.0 (64-bit)

R は、自由なソフトウェアであり、「完全に無保証」です。 
一定の条件に従えば、自由にこれを再配布することができます。 
配布条件の詳細に関しては、'license()' あるいは 'licence()' と入力してください。 

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

'demo()' と入力すればデモをみることができます。 
'help()' とすればオンラインヘルプが出ます。 
'help.start()' で HTML ブラウザによるヘルプがみられます。 
'q()' と入力すれば R を終了します。 

 起動準備中です -  警告メッセージ: 
1: Setting LC_COLLATE failed, using "C" 
2: Setting LC_TIME failed, using "C" 
3: Setting LC_MESSAGES failed, using "C" 
4: Setting LC_MONETARY failed, using "C" 
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin17.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.1

I ran this:

library(data.table)

DT = fread('~/Downloads/Business_Licenses.csv', check.names = TRUE)

NN = 8
tt = numeric(NN)

for (nn in 1:NN) {
  DTc  = copy(DT)
  tt[nn] = system.time(DTc[1:(1e5*nn)][ , .SD[1L], LICENSE.NUMBER])['elapsed']
}

png('~/Desktop/timings.png')
plot(1:NN, tt)
dev.off()

Output plot:

timings

So, maybe this was really a base R problem of 1:N?

@geneorama are you able to reproduce the slow timing again?

I would close this now but our dependency is still on 3.1.0...

@MichaelChirico
Copy link
Member

Here's what I got from running on R 3.1.0 via Jan's Docker image

https://hub.docker.com/r/jangorecki/r-3.1.0/~/dockerfile/

timings_3 1 0

So, whatever problem was there doesn't appear to be there anymore... closing

@geneorama
Copy link
Author

There is still a huge performance issue, but maybe it doesn't belong in this issue anymore because the growth of the time is no longer exponential.

Still, by my calculations it takes about 300 times longer to use .SD than to not use a dummy variable.

Originally I focused on the growing time because it was growing so fast that I couldn't estimate how long it would take to do the whole data set. Now the growth is linear, so I can estimate it, which is where I got the 300x figure above.

Since I authored my example before we've updated RSocrata and it's now possible to download this large data set without installing a special branch of the package. It still takes a while to download this data, typically more than 15 minutes.

Note the following will install the RSocrata package from CRAN if you don't already have it

##------------------------------------------------------------------------------
## Install RSocrata, load libraries, download data
## (note, new url , and new field names)
##------------------------------------------------------------------------------
library(data.table)
if(!"RSocrata" %in% rownames(installed.packages())) install.packages("RSocrata")
bus <-as.data.table(read.socrata("https://data.cityofchicago.org/resource/xqx5-8hwx.csv"))

##------------------------------------------------------------------------------
## Current function to do subsetting
## This is the best I can do in terms of performance; create a dummy 
## variable called "keep" then delete it later.
##------------------------------------------------------------------------------
most_recent_bus_lic_fast <- function(dt){
    dat <- copy(dt)
    ## Keep records where LICENSE_ID is the maximum LICENSE_ID for all records 
    ## of a particular LICENSE_NUMBER (in other words, just keep the most 
    ## recent license number)
    dat <- dat[ , keep := (license_id) == max(license_id), license_number]
    dat <- dat[keep==TRUE]
    dat <- dat[ , keep:=NULL][]
    return(dat)
}

## This takes about .7 seconds on my humble laptop
system.time(bus_recent1 <- most_recent_bus_lic_fast(bus))

##------------------------------------------------------------------------------
## This is how I would want to subset; by using the beauty of .SD and 
## just one line of code!
##------------------------------------------------------------------------------
most_recent_bus_lic_slow <- function(dt){
    dat <- copy(dt)
    ## Same as above, but using the preferred .SD method
    dat <- dat[ , .SD[license_id == max(license_id)], license_number]
    return(dat)
}

##------------------------------------------------------------------------------
## Calculate system times
##------------------------------------------------------------------------------

## Set the key for .SD to work, guess it should really be in the function?
setkey(bus, license_number, license_id)

## Initialize table of system times
system_times <- data.table(size = seq(10000, 50000, by=10000),
                           elapsed = NA_real_)

## Calculate times for each chunk size
for(i in 1:nrow(system_times)){
    print(i)
    cur_size <- system_times[i, size]
    cur_elapsed <- system.time(most_recent_bus_lic_slow(bus[1:cur_size]))[["elapsed"]]
    system_times[i, elapsed := cur_elapsed]
}
system_times

##------------------------------------------------------------------------------
## Plot system times
##------------------------------------------------------------------------------
plot(system_times)

##------------------------------------------------------------------------------
## Model and plot system times with estimated system time for nrow(bus)
##------------------------------------------------------------------------------
time_model <- lm(elapsed ~ size, system_times)
pred <- unname(predict(time_model, data.frame(size=nrow(bus)) ))

system_times <- rbind(system_times, 
                      data.table(size = nrow(bus), 
                                 elapsed = pred))
plot(system_times)

The plot of times looks linear now:
image

However it still takes less than a second to subset with dummy variables, and it would take a long time with .SD (I estimated 235 seconds for 947707 rows).
image

I don't know how this came to your attention, but strangely this came up with a colleague of mine just last week and I referred him to this issue. He said that he has a similar problem. I just got him on the data.table bandwagon.

We both agree, data.table is great, but I'm trying to do my part to document this performance issue.

I remember reading other things that @jangorecki and/or @arunsrinivasan wrote about the behind the scenes optimization and .SD performance. I couldn't find it again, and it was way over my head anyway.

please let me know if you want a new issue, or if one's already open, or if you don't care. I'm not opening an issue unless prompted, thank you!

My current specs:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4

loaded via a namespace (and not attached):
 [1] httr_1.3.1       compiler_3.5.1   plyr_1.8.4       R6_2.2.2         tools_3.5.1      curl_3.2        
 [7] yaml_2.2.0       Rcpp_0.12.18     RSocrata_1.7.4-7 jsonlite_1.5     mime_0.5 

@franknarf1
Copy link
Contributor

@geneorama Dupe of #613 ? (I'm not sure what that issue is about precisely, but it seems related.)

Btw, Arun answered it on SO: https://stackoverflow.com/a/31854111 Your use of .SD[x == max(x)] led me to think that you were selecting more than one row per group, but now that I'm looking at the data....

uniqueN(bus, by=c("license_number", "license_id")) == nrow(bus)
# TRUE

so sorting and dropping dupes is another option.

All approaches are fast except for j = .SD[condition] ... so don't use that one; study GForce; and keep an eye on #735 until #613 is done ...?

#dupe drop
system.time(unique(bus[order(license_id)], by="license_number", fromLast=TRUE))
# 0.54

# vs...
# your "slow" approach
system.time(bus[ , .SD[license_id == max(license_id)], by = license_number])
# 176.54 

# your dummy approach
system.time({
  bus[ , keep := license_id == max(license_id), by = license_number]
  res = bus[keep==TRUE][ , keep:=NULL][]
  bus[, keep := NULL ]
  res
})
# 0.39

# Arun's approach
system.time({
  dt.agg = bus[, .(license_id = max(license_id)), by = license_number]
  bus[dt.agg, on = c("license_number", "license_id")] 
})
# 0.29

@geneorama
Copy link
Author

geneorama commented Sep 14, 2018

@franknarf1 look, you're way better at data.table than me. I love dt, but this level of detail hurts my brain. I'm quite happy to use it normally, then when something takes a long time (which is very rare) I work around it. I'm not above converting to data.frame, doing apply with margin=1, then converting back to data.table. So long as I don't have to load more libraries (I'm looking at you tidyverse), I don't really care. It's not like I'm processing the entirety of Facebook's comment history.

Edit: thank you very much for your comments and analysis. Seriously, you guys are amazing. I read your answers all the time on SO.

Edit2: Actually, Arun's approach makes a lot of sense looking at this again. It's not that complicated. The linked issues are complicated though.

@MichaelChirico
Copy link
Member

@geneorama it looks like what you're saying is:

the workaround syntax feels much less natural than what your first instinct was. Therefore one/both of the following could help you as a "lay"/casual (for lack of a better phrasing off the top of my head) user

  1. speed up the calculation in your original syntax
  2. improve vignettes/introductory material so that syntax like Arun's looks more natural/becomes your first instinct

I think the first is covered in the issues Frank linked.

The second might be filed separately or included in an existing issue about vignettes.

Thanks for your feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants