Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
X[Y, on = ...] significantly slower than X[Y] for the same join fields #1825
Following recommended practice, I've begun swapping from
After making this change to some code using a 40 million row data.table, I was surprised by how much performance degraded. It appears that
Here's an example:
Neither changing to
A five-time decrease in performance for semantically identical code seems a bit much. While I appreciate the flexibility that the new join system brings, I don't understand why it cannot fall back to the existing merge join gracefully when the full set of key columns are specified.
In case it matters...
attached base packages:
other attached packages:
loaded via a namespace (and not attached):
@sz-cgt That's not a good example. Come up with one where timings are at least 1 second (instead of single digit millis) and then it'll be interesting.
There is (obviously) more processing that gets done with more arguments in
Nice catch. When columns to join on are key columns of a data.table, the order vector need not be computed. Similarly when a secondary index already exists, it could be reused. However, the logic was, till now, only implemented for secondary indices. So, on keyed joins, the order vector was computed again.. and the first step before computing order is to check if the vector is sorted.. and in this case, that step will tell that the vector is sorted.. and that's the extra time you see in your benchmark.
Now I get:
require(data.table) set.seed(1L) DT = data.table(a = as.integer(runif(40e6L, 1, 10000)), b = 1:1e6L, key = 'a') DT2 = data.table(a = sample(DT$a, 100), key = 'a') setkey(DT2, a)
system.time(DT[DT2]) # 0.005s system.time(DT[DT2, on=key(DT)]) # 0.170s
system.time(DT[DT2]) # user system elapsed # 0.008 0.000 0.003 system.time(DT[DT2, on=key(DT)]) # user system elapsed # 0.010 0.000 0.004 identical(DT[DT2], DT[DT2, on=key(DT)]) #  TRUE
Fix on its way.