Matt Dowle edited this page Feb 21, 2017 · 112 revisions

  Linux/Mac:   Windows:
<<< Help promote data.table and get this hexbin-standard sticker.

data.table is one of the 9,800 add-on packages for the programming language R which is popular in these fields. It provides a high-performance version of base R's data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.


These queries can be chained together just by adding another one on the end:
See data.table compared to dplyr on Stack Overflow and Quora.

Other features include :

  • fast and friendly file reader: ?fread. It accepts system commands directly (such as grep and gunzip) and other convenience features for small data.
  • fast and parallelized file writer: ?fwrite announced here and on CRAN in Nov 2016.
  • parallelized row subsets - See this benchmark for timings
  • fast aggregation of large data; e.g. 100GB in RAM (see benchmarks on up to two billion rows)
  • fast add/update/delete columns by reference by group using no copies at all
  • fast ordered joins; e.g. rolling forwards, backwards, nearest and limited staleness
  • fast overlapping range joins; similar to findOverlaps function from IRanges/GenomicRanges Bioconductor packages, but not limited to genomic (integer) intervals.
  • fast non-equi (or conditional) joins, i.e., joins using operators >, >=, <, <= as well, available from v1.9.8+
  • a fast primary ordered index; e.g. setkey(DT,col1,col2)
  • automatic secondary indexing; e.g. DT[col==val,] and DT[col %in% vals,]
  • fast and memory efficient combined join and group by; by=.EACHI
  • fast reshape2 methods (dcast and melt) without needing reshape2 and its dependency chain installed or loaded
  • group summary results may be many rows (e.g. first and last row by group) and each cell value may itself be a vector/object/function (e.g. unique ids by group as a list column of varying length vectors - this is pretty printed with commas)
  • special symbols built-in for convenience and raw speed by avoiding the overhead of function calls: .N, .SD, .I, .GRP and .BY
  • any R function from any R package can be used in queries not just the subset of functions made available by a database backend
  • has no dependencies at all other than base R itself, for simpler production/maintenance
  • the R dependency is as old as possible for as long as possible and we test against that version; e.g., v1.9.8 released on 25-Nov-2016 bumped the dependency up from 4.5 year old R 2.14.0 to 3 year old R 3.0.0.

Version 1.0 was released to CRAN in 2006. In June 2014 we moved from R-Forge to GitHub.

Guidelines for filing issues / pull requests: Contribution Guidelines.

As of 30 Dec 2016, data.table is the 3rd largest Stack Overflow tag about an R package, the 8th most starred R package on GitHub, has 321 CRAN and Bioconductor packages using it and is the #1 most directly downloaded R package on RStudio's CRAN mirror.