[R-Forge #5360] Add fill=T to fread #536

arunsrinivasan opened this Issue Jun 8, 2014 · 7 comments

4 participants

Rdatatable member

Submitted by: Michele Carriero; Assigned to: Nobody; R-Forge link

Since this option is being added to rbind I wonder if it could be added to fread too, in order to reflect the read.table feature.


Requested here as well (on a file from CBOE) :

Seems likely it could be Excel generating such files. Could potentially be quite large and worth fread supporting then : http://stackoverflow.com/questions/25339552/how-to-read-cboe-csv-file-using-data-table/25341502?noredirect=1#comment39526821_25341502


For what it is worth, our problematic dataset is data from Clinical Practice Research Datalink in the UK (the additional clinical details file where things like blood pressure, cholesterol, body weight, etc. are stored). Very commonly used in epidemiology and health services research. That one is not excel-based.


I often have data dumps taken from ad server data in a list format:
This list is a KEY:VALUE setup where the VALUE is itself a tuple
This is a very good setup for storing large amounts of data (and one i see implemented a lot)

Reading with a fill=T, flag would create a binary matrix

USER1, [a, b, c, d]
USER2, [b,e,f]
USER3, [a,b,c]


      a   b   c   d   e   f
USER1 1   1   1   1   0   0
USER2 0   1   0   0   1   1
USER3 1   1   1   0   0   0

Now with a quick awk script i can transform these

awk -F '[ ,\\[\\]]+' '{for (i=2; i<NF; i++) print $1,$i}' $1 >> "transformed_$1"

I am then able to use fread, and post process, (i personally read into a sparse data matrix)

But the use case it obviously much more to save having to AWK data files prior to reading and then converting.

This proves to be significantly faster than something like:

ReadMaxCSVCols <- function(f, sep = ",", quote = "\"'", header = FALSE, ...) {
  nc <- max(count.fields(f, sep = sep, quote = quote))
             sep = sep, 
             quote = quote, 
             header = header,
             fill = TRUE,
             col.names = paste("V", 1:nc, sep = ""),
foo <- data.table(ReadMaxCSVCols("myfile.txt"))

@mpearmain Thanks, really useful. Tuple columns like VALUE was what sep2= was intended for. The VALUE column would be read into a list column. Would that work for you? The original use case for sep2= was columns 11 and 12 of BED files in genomics (they are vectors of integers iirc, separated within a field by a different separator than between fields). Is your VALUE field really wrapped with [ ] like that (or similar) then that could be coded in fread as an option where sep==sep2 i.e. both comma.
Could do fill=T as well, just that reading VALUE into a list column might be better. It depends on what operations you need to do it on afterwards really?


@markdanese Great, yes very useful to know, thanks. Could you post a link to a sample file perhaps (or a made-up example of 3 or 4 lines that's close would be great). I had a look at http://www.cprd.com/ and it seems huge and varied ... and interesting. We could do fill=TRUE, but might sep2= into a list column be better and work for you? See new comments above.


The list probably won't help. It is a simple flat file and would probably be easiest as columns -- to create a complete table.

I took a small file and changed individual digits randomly so that this is not identifiable. This dropbox link should allow you to get the .txt file:

Thanks for your help, and let me know if this file doesn't work.


Hi Matt,

I think you've hit the nail on the head with what you want to do after, to me the main use is to load as fast as possible and with a structure that is consistent, the list mechanism would allow for this.

I'm looking to do binary matrix factorization and so a full or sparse matrix is the end point, and so the list isnt ideal, but it adds structure if i am given a list of cols,

I can of course transform this into a DT or matrix, my concern is the overhead of the transform operation. which means running a few AWK or SED scripts before may still be the best option in my situation.

@arunsrinivasan arunsrinivasan added the fread label Sep 4, 2015
@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Dec 17, 2015
@arunsrinivasan arunsrinivasan self-assigned this Dec 21, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment