Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently remove rows #2890

Closed
Demetrio92 opened this issue May 18, 2018 · 1 comment
Closed

Efficiently remove rows #2890

Demetrio92 opened this issue May 18, 2018 · 1 comment

Comments

@Demetrio92
Copy link

Hey guys, I am struggling with removing rows from a data.table in an efficient way. With SQL I have DELETE FROM table_name WHERE condition, but with data.table I only find examples of people doing dt <- dt[condition]. This seems to be 2 time slower than base R data.frame...


Example:

library(microbenchmark)
library(data.table)

dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
   
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
Unit: microseconds
                     expr     min       lq     mean  median      uq      max neval
      dt <- dt[id != "a"] 222.561 243.5625 374.6974 280.055 351.798 7184.560   100
 df <- df[df$id != "a", ]  85.059 108.7100 151.5715 127.215 173.659 1339.389   100

(borrowed and adjusted from here: https://stackoverflow.com/a/22655130/3494126)


> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4-4 bindrcpp_0.2.2       subprocess_0.8.2     htmlwidgets_1.2     
 [5] googlesheets_0.2.2   raster_2.6-7         exifr_0.2.1          magick_1.8          
 [9] bazar_1.0.6          glue_1.2.0           lubridate_1.7.4      matrixStats_0.53.1  
[13] rgdal_1.2-18         sp_1.2-7             leaflet_2.0.0        lwgeom_0.1-4        
[17] sf_0.6-2             pbsat_0.1.17         xml2_1.2.0           data.table_1.10.4-3 
[21] forcats_0.3.0        stringr_1.3.0        dplyr_0.7.4          purrr_0.2.4         
[25] readr_1.1.1          tidyr_0.8.0          tibble_1.4.2         ggplot2_2.2.1.9000  
[29] tidyverse_1.2.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-137       RColorBrewer_1.1-2 httr_1.3.1         tools_3.4.4        R6_2.2.2          
 [6] spData_0.2.8.3     rgeos_0.3-26       DBI_0.8            lazyeval_0.2.1     colorspace_1.3-2  
[11] withr_2.1.2        gridExtra_2.3      mnormt_1.5-5       curl_3.2           compiler_3.4.4    
[16] cli_1.0.0          rvest_0.3.2        scales_0.5.0.9000  classInt_0.2-3     psych_1.8.3.3     
[21] digest_0.6.15      foreign_0.8-70     pkgconfig_2.0.1    htmltools_0.3.6    rlang_0.2.0.9001  
[26] readxl_1.1.0       rstudioapi_0.7     shiny_1.0.5        bindr_0.1.1        jsonlite_1.5      
[31] crosstalk_1.0.0    magrittr_1.5       kimisc_0.4         Rcpp_0.12.16       munsell_0.4.3     
[36] RPostgreSQL_0.6-2  viridis_0.5.1      stringi_1.1.7      yaml_2.1.18        jqr_1.0.0         
[41] plyr_1.8.4         grid_3.4.4         maptools_0.9-2     promises_1.0.1     crayon_1.3.4      
[46] geojsonio_0.6.0    udunits2_0.13      lattice_0.20-35    haven_1.1.1        geojson_0.2.0     
[51] hms_0.4.2          knitr_1.20         pillar_1.2.1       reshape2_1.4.3     V8_1.5            
[56] modelr_0.1.1       httpuv_1.4.1       cellranger_1.1.0   gtable_0.2.0       assertthat_0.2.0  
[61] mime_0.5           xtable_1.8-2       broom_0.4.4        e1071_1.6-8        later_0.7.1       
[66] viridisLite_0.3.0  class_7.3-14       memoise_1.1.0      units_0.5-1   
@MichaelChirico
Copy link
Member

MichaelChirico commented May 18, 2018

If you do it on bigger data, the timing is reversed, and more.

library(microbenchmark)
library(data.table)

NN = 1e7
dt<-data.table(id=sample(letters,NN,T),var=rnorm(NN))
df<-data.frame(id=sample(letters,NN,T),var=rnorm(NN))
   
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
# Unit: milliseconds
#                      expr      min       lq     mean   median        uq       max neval
#       dt <- dt[id != "a"] 148.0570 165.3594 210.1427 191.2038  229.1141  389.5252   100
#  df <- df[df$id != "a", ] 810.7681 877.9762 944.7228 916.4006 1025.6018 1099.8863   100

(in fact my timings show data.table faster on NN = 1e6 as well).

All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.

  1. Please update your data.table. 1.11.2 is on CRAN
  2. See the benchmarking vignette
  3. See Delete rows by reference #635 (your question is duplicate of this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants