Efficiently remove rows #2890

Demetrio92 · 2018-05-18T09:48:08Z

Hey guys, I am struggling with removing rows from a data.table in an efficient way. With SQL I have DELETE FROM table_name WHERE condition, but with data.table I only find examples of people doing dt <- dt[condition]. This seems to be 2 time slower than base R data.frame...

Example:

library(microbenchmark)
library(data.table)

dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
   
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])

Unit: microseconds
                     expr     min       lq     mean  median      uq      max neval
      dt <- dt[id != "a"] 222.561 243.5625 374.6974 280.055 351.798 7184.560   100
 df <- df[df$id != "a", ]  85.059 108.7100 151.5715 127.215 173.659 1339.389   100

(borrowed and adjusted from here: https://stackoverflow.com/a/22655130/3494126)

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4-4 bindrcpp_0.2.2       subprocess_0.8.2     htmlwidgets_1.2     
 [5] googlesheets_0.2.2   raster_2.6-7         exifr_0.2.1          magick_1.8          
 [9] bazar_1.0.6          glue_1.2.0           lubridate_1.7.4      matrixStats_0.53.1  
[13] rgdal_1.2-18         sp_1.2-7             leaflet_2.0.0        lwgeom_0.1-4        
[17] sf_0.6-2             pbsat_0.1.17         xml2_1.2.0           data.table_1.10.4-3 
[21] forcats_0.3.0        stringr_1.3.0        dplyr_0.7.4          purrr_0.2.4         
[25] readr_1.1.1          tidyr_0.8.0          tibble_1.4.2         ggplot2_2.2.1.9000  
[29] tidyverse_1.2.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-137       RColorBrewer_1.1-2 httr_1.3.1         tools_3.4.4        R6_2.2.2          
 [6] spData_0.2.8.3     rgeos_0.3-26       DBI_0.8            lazyeval_0.2.1     colorspace_1.3-2  
[11] withr_2.1.2        gridExtra_2.3      mnormt_1.5-5       curl_3.2           compiler_3.4.4    
[16] cli_1.0.0          rvest_0.3.2        scales_0.5.0.9000  classInt_0.2-3     psych_1.8.3.3     
[21] digest_0.6.15      foreign_0.8-70     pkgconfig_2.0.1    htmltools_0.3.6    rlang_0.2.0.9001  
[26] readxl_1.1.0       rstudioapi_0.7     shiny_1.0.5        bindr_0.1.1        jsonlite_1.5      
[31] crosstalk_1.0.0    magrittr_1.5       kimisc_0.4         Rcpp_0.12.16       munsell_0.4.3     
[36] RPostgreSQL_0.6-2  viridis_0.5.1      stringi_1.1.7      yaml_2.1.18        jqr_1.0.0         
[41] plyr_1.8.4         grid_3.4.4         maptools_0.9-2     promises_1.0.1     crayon_1.3.4      
[46] geojsonio_0.6.0    udunits2_0.13      lattice_0.20-35    haven_1.1.1        geojson_0.2.0     
[51] hms_0.4.2          knitr_1.20         pillar_1.2.1       reshape2_1.4.3     V8_1.5            
[56] modelr_0.1.1       httpuv_1.4.1       cellranger_1.1.0   gtable_0.2.0       assertthat_0.2.0  
[61] mime_0.5           xtable_1.8-2       broom_0.4.4        e1071_1.6-8        later_0.7.1       
[66] viridisLite_0.3.0  class_7.3-14       memoise_1.1.0      units_0.5-1

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2018-05-18T10:40:56Z

If you do it on bigger data, the timing is reversed, and more.

library(microbenchmark)
library(data.table)

NN = 1e7
dt<-data.table(id=sample(letters,NN,T),var=rnorm(NN))
df<-data.frame(id=sample(letters,NN,T),var=rnorm(NN))
   
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
# Unit: milliseconds
#                      expr      min       lq     mean   median        uq       max neval
#       dt <- dt[id != "a"] 148.0570 165.3594 210.1427 191.2038  229.1141  389.5252   100
#  df <- df[df$id != "a", ] 810.7681 877.9762 944.7228 916.4006 1025.6018 1099.8863   100

(in fact my timings show data.table faster on NN = 1e6 as well).

All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.

Please update your data.table. 1.11.2 is on CRAN
See the benchmarking vignette
See Delete rows by reference #635 (your question is duplicate of this)

MichaelChirico closed this as completed May 18, 2018

jangorecki added the duplicate label May 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently remove rows #2890

Efficiently remove rows #2890

Demetrio92 commented May 18, 2018

MichaelChirico commented May 18, 2018 •

edited by jangorecki

Efficiently remove rows #2890

Efficiently remove rows #2890

Comments

Demetrio92 commented May 18, 2018

MichaelChirico commented May 18, 2018 • edited by jangorecki

MichaelChirico commented May 18, 2018 •

edited by jangorecki