Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignettes update, closes #2952 #2954

Merged
merged 5 commits into from Sep 4, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
106 changes: 81 additions & 25 deletions vignettes/datatable-benchmarking.Rmd
Expand Up @@ -11,15 +11,9 @@ vignette: >
\usepackage[utf8]{inputenc}
---

<style>
h2 {
font-size: 20px;
}
</style>
This document is meant to guide on measuring performance of `data.table`. Single place to document best practices and traps to avoid.

This document is meant to guide on measuring performance of data.table. Single place to documents best practices or traps to avoid.

## fread: clear caches
# fread: clear caches

Ideally each `fread` call should be run in fresh session with the following commands preceding R execution. This clears OS cache file in RAM and HD cache.

Expand All @@ -30,42 +24,104 @@ sudo lshw -class disk
sudo hdparm -t /dev/sda
```

## subset: index optimization switch off
When comparing `fread` to non-R solutions be aware that R requires values of character columns to be added to _R's global string cache_, which adds significant overhead because R API for this is not thread safe, thus cannot be done in parallel.

# subset: threshold for index optimization on compound queries

Index optimization will currently be turned off when doing subset using index and when cross product of elements provided to filter on exceeds > 1e4.
Index optimization for compound filter queries will be not be used when cross product of elements provided to filter on exceeds 1e4 elements.

```r
DT = data.table(V1=1:10, V2=1:10, V3=1:10, V4=1:10)
setindex(DT)
v = c(1L, rep(11L, 9))
length(v)^4 # cross product of elements in filter
#[1] 10000 # <= 10000
DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
#Optimized subsetting with index 'V1__V2__V3__V4'
#on= matches existing index, using index
#Starting bmerge ...done in 0.000sec
#...
v = c(1L, rep(11L, 10))
length(v)^4 # cross product of elements in filter
#[1] 14641 # > 10000
DT[V1 %in% v & V2 %in% v & V3 %in% v & V4 %in% v, verbose=TRUE]
#Subsetting optimization disabled because the cross-product of RHS values exceeds 1e4, causing memory problems.
#...
```

## subset: index aware benchmarking
# subset: index aware benchmarking

For convinience data.table automatically builds index on fields you are doing subset data. It will add some overhead to first subset on particular fields but greatly reduce time to query those columns in subsequent runs. When measuring speed best way is to measure index creation and query using index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
To control usage of index use following options (see `?datatable.optimize` for more details):
For convinience `data.table` automatically builds index on fields you are doing subset data. It will add some overhead to first subset on particular fields but greatly reduce time to query those columns in subsequent runs. When measuring speed best way is to measure index creation and query using index separately. Having such timings it is easy to decide what is the optimal strategy for your use case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar:

  • convenience
  • on fields you are doing subset data -> on fields you use to subset data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push to branch please :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other possible small grammar improvements:

builds index => builds an index
greatly reduce time => greatly reduces time
speed best way => speed, the best way
using index separately => using an index separately

To control usage of index use following options:

```r
options(datatable.optimize=2L)
options(datatable.optimize=3L)
options(datatable.auto.index=TRUE)
options(datatable.use.index=TRUE)
```
`options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
`use.index=FALSE` will force query not to use index even if it exists, but existing keys are used for optimization. `auto.index=FALSE` only disables building index automatically when doing subset on non-indexed data.

## _by reference_ operations
- `use.index=FALSE` will force query not to use indices even if they exists, but existing keys are still used for optimization.
- `auto.index=FALSE` disables building index automatically when doing subset on non-indexed data, but if indices were created before this option was set, or explicitly by calling `setindex` they still will be used for optimization.

Two other options control optimization globally, including use of indices:
```r
options(datatable.optimize=2L)
options(datatable.optimize=3L)
```
`options(datatable.optimize=2L)` will turn off optimization of subsets completely, while `options(datatable.optimize=3L)` will switch it back on.
Those options affects much more optimizations thus should not be used when only control of index is needed. Read more in `?datatable.optimize`.

# _by reference_ operations

When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed `data.table` on input.

Protecting your `data.table` from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.

When benchmarking `set*` functions it make sense to measure only first run. Those functions updates data.table by reference thus in subsequent runs they get already processed data.table on input.
# try to benchmark atomic processes

Protecting your data.table from being updated by reference operations can be achieved using `copy` or `data.table:::shallow` functions. Be aware `copy` might be very expensive as it needs to duplicate whole object, but this is what other packages usually do. It is unlikely we want to include duplication time in time of the actual task we are benchmarking.
If your benchmark is meant to be published it will be much more insightful if you will split it to measure time of atomic processes. This way your readers can see how much time was spent on reading data from source, cleaning, actual transformation, exporting results.
Of course if your benchmark is meant to present _full workflow_ then it perfectly make sense to present total timing, still spliting timings might give good insight into bottlenecks in such workflow.
There are another cases when it might not be desired, for example when benchmarking _reading csv_, followed by _grouping_. R requires to populate _R's global string cache_ which adds extra overhead when importing character data to R session. On the other hand _global string cache_ might speed up processes like _grouping_. In such cases when comparing R to other languages it might be useful to include total timing.

## avoid `microbenchmark(, times=100)`
# avoid class coercion

Unless this is what you truly want to measure you should prepare input objects for every tools you are benchmarking in expected class.

# avoid `microbenchmark(..., times=100)`

Repeating benchmarking many times usually does not fit well for data processing tools. Of course it perfectly make sense for more atomic calculations. It does not well represent use case for common data processing tasks, which rather consists of batches sequentially provided transformations, each run once.
Matt once said:

> I'm very wary of benchmarks measured in anything under 1 second. Much prefer 10 seconds or more for a single run, achieved by increasing data size. A repetition count of 500 is setting off alarm bells. 3-5 runs should be enough to convince on larger data. Call overhead and time to GC affect inferences at this very small scale.

## multithreaded processing
This is very valid. The smaller time measurement is the relatively bigger noise is. Noise generated by method dispatch, package/class initialization, etc. Main focus of benchmark should be on real use case scenarios.

# multithreaded processing

One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of `data.table` some of the functions has been parallelized.
You can control how much threads you want to use with `setDTthreads`.

```r
setDTthreads(0) # use all available cores (default)
getDTthreads() # check how many cores are currently used
```

# inside a loop prefer `set` instead of `:=`

Unless you are utilizing index when doing _sub-assign by reference_ you should prefer `set` function which does not impose overhead of `[.data.table` method call.

```r
DT = data.table(a=3:1, b=letters[1:3])
setindex(DT, a)

One of the main factor that is likely to impact timings is number of threads in your machine. In recent versions of data.table some of the functions has been parallelized.
You can control how much threads you want to use with `setDTthreads`.
# for (...) { # imagine loop here

DT[a==2L, b := "z"] # sub-assign by reference, uses index
DT[, d := "z"] # not sub-assign by reference, not uses index and adds overhead of `[.data.table`
set(DT, j="d", value="z") # no `[.data.table` overhead, but no index yet, till #1196

# }
```

## avoid `data.table()` inside a loop
# inside a loop prefer `setDT` instead of `data.table()`

As of now `data.table()` has an overhead, thus inside loops it is preferred to use `as.data.table()` or `setDT()` on a valid list.
19 changes: 14 additions & 5 deletions vignettes/datatable-keys-fast-subset.Rmd
Expand Up @@ -86,7 +86,7 @@ i.e., row names are more or less *an index* to rows of a *data.frame*. However,
```{r eval = FALSE}
rownames(DF) = sample(LETTERS[1:5], 10, TRUE)
# Warning: non-unique values when setting 'row.names': 'C', 'D'
# Error in `row.names<-.data.frame`(`*tmp*`, value = value): duplicate 'row.names' are not allowed
# Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
```

Now let's convert it to a *data.table*.
Expand Down Expand Up @@ -420,7 +420,15 @@ we could have done:
flights[origin == "JFK" & dest == "MIA"]
```

One advantage very likely is shorter syntax. But even more than that, *binary search based subsets* are **incredibly fast**.
One advantage very likely is shorter syntax. But even more than that, *binary search based subsets* are **incredibly fast**.

As the time goes `data.table` gets new optimization and currently the latter call is automatically optimized to use *binary search*.
To use slow *vector scan* key needs to be removed.

```{r eval = FALSE}
setkey(flights, NULL)
flights[origin == "JFK" & dest == "MIA"]
```

### a) Performance of binary search approach

Expand All @@ -431,17 +439,16 @@ set.seed(2L)
N = 2e7L
DT = data.table(x = sample(letters, N, TRUE),
y = sample(1000L, N, TRUE),
val = runif(N), key = c("x", "y"))
val = runif(N))
print(object.size(DT), units = "Mb")

key(DT)
```

`DT` is ~380MB. It is not really huge, but this will do to illustrate the point.

From what we have seen in the Introduction to data.table section, we can subset those rows where columns `x = "g"` and `y = 877` as follows:

```{r}
key(DT)
## (1) Usual way of subsetting - vector scan approach
t1 <- system.time(ans1 <- DT[x == "g" & y == 877L])
t1
Expand All @@ -452,6 +459,8 @@ dim(ans1)
Now let's try to subset by using keys.

```{r}
setkeyv(DT, c("x", "y"))
key(DT)
## (2) Subsetting using keys
t2 <- system.time(ans2 <- DT[.("g", 877L)])
t2
Expand Down
8 changes: 4 additions & 4 deletions vignettes/datatable-secondary-indices-and-auto-indexing.Rmd
Expand Up @@ -276,9 +276,9 @@ flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", on = c("origin", "dest"

## 3. Auto indexing

First we looked at how to fast subset using binary search using *keys*. Then we figured out that we could improve performance even further and have more cleaner syntax by using secondary indices. What could be better than that? The answer is to optimise *native R syntax* to use secondary indices internally so that we can have the same performance without having to use newer syntax.
First we looked at how to fast subset using binary search using *keys*. Then we figured out that we could improve performance even further and have more cleaner syntax by using secondary indices.

That is what *auto indexing* does. At the moment, it is only implemented for binary operators `==` and `%in%`. And it only works with a single column at the moment as well. An index is automatically created *and* saved as an attribute. That is, unlike the `on` argument which computes the index on the fly each time, a secondary index is created here.
That is what *auto indexing* does. At the moment, it is only implemented for binary operators `==` and `%in%`. An index is automatically created *and* saved as an attribute. That is, unlike the `on` argument which computes the index on the fly each time (unless one already exists), a secondary index is created here.

Let's start by creating a data.table big enough to highlight the advantage.

Expand Down Expand Up @@ -320,8 +320,8 @@ system.time(dt[x %in% 1989:2012])

#

In the future, we plan to extend auto indexing to expressions involving more than one column. Also we are working on extending binary search to work with more binary operators like `<`, `<=`, `>` and `>=`. Once done, it would be straightforward to extend it to these operators as well.
In recent version we extended auto indexing to expressions involving more than one column (combined with `&` operator). In the future, we plan to extend binary search to work with more binary operators like `<`, `<=`, `>` and `>=`.

We will extend fast *subsets* using keys and secondary indices to *joins* in the next vignette, *"Joins and rolling joins"*.
We will discuss fast *subsets* using keys and secondary indices to *joins* in the next vignette, *"Joins and rolling joins"*.

***