Convenience features of fread

Matt Dowle edited this page Feb 16, 2017 · 24 revisions

The fread function in the data.table R package is not just for speed on large files (although it is good for that too). Here we highlight its convenience features for small data. Full details of all arguments and abilities are on the online manual page. A brief introduction at useR!2014 is on youtube here.

1. Using command line tools directly

For example, from this answer :

fread("grep -v TRIAL sum_data.txt")
#      V1   V2 V3      V4      V5      V6      V7
#  1:   2  0.1  0 -0.0047 -0.0168 -0.9938 -0.0087
#  2:   2  0.2  0 -0.0121  0.0002 -0.9898 -0.0364
#  3:   2  0.3  0  0.0193 -0.0068 -0.9884  0.0040
   ...

The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.

Another example, from this answer :

fread('unzip -cq mtcars.csv.zip')
#                      V1  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#  1:           Mazda RX4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#  2:       Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#  ...

An example using chained commands from this answer and tweet :

fread('cat *dat.gz | gunzip | grep -v "^Day"')

The manual page ?fread contains :

input ..., a shell command that preprocesses the file (e.g. fread("grep blah filename")), ...

On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep. In March 2016, Microsoft announced they will include these tools in Windows 10 natively. On Linux and Mac they have always been included in the operating system. You can find many examples and tutorials online about command line tools. We recommend Data Science at the Command Line. Some advantages of this approach were highlighted in slides 10-12 here.

2. Automatic separator detection

Given a tab, comma, pipe, space, colon or semi-colon delimited file, you don't need to find or remember the function name or argument name to use. fread is a single command that will look at its input and observe the separator for you. All you as the user has to do is pass the file name to the function. The rest is done for you. Of course, if you do wish to override the separator, or use a very unusual one, you can pass it to the sep= parameter.

The manual page ?fread contains :

sep The separator between columns. Defaults to the first character in the set [,\t |;:] that exists on line autostart outside quoted ("") regions, and separates the rows above autostart into a consistent number of fields too.

3. High quality automatic column type detection

Some files start off with blanks or zeros in some columns which only become populated later in the file. For this reason fread doesn't just look at the top of the file but it jumps to 10 equally spaced points in the file including the very end and reads 100 rows from each point. It is possible to jump directly to these points efficiently because fread uses the operating system's mmap which does not need to read from the beginning of the file to get to a particular point. Before v1.9.8 it used to read only the top 5, middle 5 and bottom 5 rows. The increase from 15 to 1,000 rows used for type detection is not expected to add any noticeable overhead because even 1,000 rows is not very large.

The manual page ?fread contains :

A sample of 1,000 rows is used to determine column types (100 rows from 10 points). The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character. This enables fread to allocate exactly the right number of rows, with columns of the right type, up front once. The file may of course still contain data of a higher type in rows outside the sample. In that case, the column types are bumped mid read and the data read on previous rows is coerced. Setting verbose=TRUE reports the line and field number of each mid read type bump, and how long this type bumping took (if any).

4. Reading SQL insert scripts

There is no built-in feature for reading SQL insert scripts but they are easily handled with command line pre-processing. For example, given SQL insert script insert_script.sql:

INSERT INTO tbl VALUES (1, 'asd', 923123123, 'zx');
INSERT INTO tbl VALUES (1, NULL, 923123123, 'zxz');
INSERT INTO tbl VALUES (3, 'asd3', 923123123, NULL);

you can fread it using the following command (from this question):

fread('awk -F\' *[(),]+ *\' -v OFS=, \'{for (i=2;i<NF;i++) printf "%s%s", ($i=="NULL"?"":$i), (i<(NF-1)?OFS:ORS)}\' insert_script.sql')
#    V1     V2        V3    V4
# 1:  1  'asd' 923123123  'zx'
# 2:  1        923123123 'zxz'
# 3:  3 'asd3' 923123123      

5. integer64 support

fread automatically detects large integers (> 2^31) and reads them as type integer64 from the bit64 package. This retains full precision by storing the integers as true and accurate integers. This default behaviour can be overridden on a per-column basis using the colClasses argument, for all columns in the call to fread using its integer64 argument, or always by setting options(datatable.integer64="double") or options(datatable.integer64="character").

The manual page ?fread contains:

integer64 "integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as base::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".

To know that the default can be changed globally (e.g. in your personal .Rprofile file), you can look at the Usage section of ?fread which contains the default arguments :

integer64=getOption("datatable.integer64") # default: "integer64"

6. drop columns by name or number, as well as select by name or number

Rather than read all columns in to R (using up memory) only to use a subset of columns in your analysis, you can use these arguments to tell fread to skip over a set of columns. If you only have a few to discard, it's easier to just drop those. Easier because you don't have to type out the columns to keep or think how to write an R expression that correctly results in the columns to keep. fread will warn you if you specify a column that isn't present in the file. You use either select or drop but not both at the same time.

The manual page ?fread contains:

select Vector of column names or numbers to keep, drop the rest.

drop Vector of column names or numbers to drop, keep the rest.

7. Empty lines and header text is automatically detected and skipped by default

This is done by starting on line 30 (by default) and searching upwards for the top of the regular tabular data. Particularly useful if you are reading a set of, say, 910 files and the header information varies in size at the top: rbindlist(lapply(files,fread)) will automatically find the varying location of the header line for you and create a single data.table result stacking the 910 tables.

The manual page ?fread contains:

autostart Any line number within the region of machine readable delimited text, by default 30. If the file is shorter or this line is empty (e.g. short files with trailing blank lines) then the last non empty line (with a non empty line above that) is used. This line and the lines above it are used to auto detect sep, sep2 and the number of fields. It's extremely unlikely that autostart should ever need to be changed, we hope.

8. Footer information is automatically discarded with warning

The known common case is the "Rowcount: <n>" trailing line that SQL select statements tend to emit. Consider the following input.

A,B
1,3
2,4
Rowcount: 2

data.table::fread returns the correct result, with warning about the footer being disarded.

> fread("A,B\n1,3\n2,4\nRowcount: 2\n")
   A B
1: 1 3
2: 2 4
Warning message:
In fread("A,B\n1,3\n2,4\nRowcount: 2\n") :
  Stopped reading at line 4 but text exists afterwards (discarded): Rowcount: 2

readr::read_csv does not appear to detect the footer. It treats the first column as character to fit the footer into the first column.

> read_csv("A,B\n1,3\n2,4\nRowcount: 2\n")
Warning: 1 parsing failure.
row col  expected    actual
  3  -- 2 columns 1 columns

# A tibble: 3 × 2
            A     B
        <chr> <int>
1           1     3
2           2     4
3 Rowcount: 2    NA

9. Read any one of multiple tables present in a single file

There are two ways provided to achieve this.

  1. Set autostart= to any line number within the table you wish to extract. It searches up and down from that line to find the extent of that table. Blank lines around the table typical define its extent but any inconsistent number of fields (such as a header or footer) is also detected and considered an extent.

  2. Set skip= to a character string found anywhere within the table's header row you wish to start on. This could be a column name unique to that table, or a string anywhere on the line just before it.

The manual page ?fread contains:

skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line (inspired by read.xls in package gdata).