Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of the fwrite function #580 #1613

Merged
merged 7 commits into from Apr 7, 2016

Conversation

@oseiskar
Copy link
Contributor

oseiskar commented Mar 27, 2016

This implementation of fwrite (#580) aims to be faster, or at least as fast as write.csv, but a few things have been left out or simplified:

  1. When quote=TRUE, all column names are quoted
  2. When quote=FALSE, nothing is quoted, even if this would break the CSV
  3. There is no option for row.names. They only make sense for data.frames with named rows. For data.tables, they would just reduce to row numbers.

The speedup compared to write.csv depends on column types and parameters but speedup factors from 2 to 4 are possible.

@oseiskar oseiskar force-pushed the oseiskar:fwrite branch from 015ad40 to d794b3b Mar 27, 2016
R/fwrite.R Outdated
@@ -0,0 +1,62 @@
fwrite <- function(dt, file.path, append = FALSE, quote = TRUE,

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 27, 2016

Member
  1. should definitely use DT or x, given the dt function in the stats package.

  2. consistency with write.csv would have file.path named file and take default value "" (according to ?write.csv, the default is to print to console)

This comment has been minimized.

Copy link
@oseiskar

oseiskar Mar 28, 2016

Author Contributor
  1. ok, I'll change this

  2. This is semi-intentional. Supporting file="" (stdout) and connections would require passing file handles (instead of just file name strings) from R to C and I don't currently know how to do this. R's C interface is poorly documented.

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 28, 2016

Member
  1. Gotcha. Not sure how important the consistency is, but maybe just follow write.table to this end:
if (file == "") 
  file <- stdout()
else if (is.character(file)) {
  file <- if (nzchar(fileEncoding)) 
    file(file, ifelse(append, "a", "w"), encoding = fileEncoding)
  else file(file, ifelse(append, "a", "w"))
  on.exit(close(file))
}
else if (!isOpen(file, "w")) {
  open(file, "w")
  on.exit(close(file))
}
if (!inherits(file, "connection")) 
  stop("'file' must be a character string or connection")

And here is the C side of the code, to the extent that it helps

R/fwrite.R Outdated
# validate arguments
stopifnot(is.data.frame(dt))
stopifnot(ncol(dt) > 0)
stopifnot(identical(unique(names(dt)), names(dt)))

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 27, 2016

Member

maybe a warning for this, instead of an error?

R/fwrite.R Outdated
# determine from column types, which ones should be quoted
if (quote) {
column_types <- lapply(dt, class)
quoted_cols <- column_types %in% c('character', 'factor')

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 28, 2016

Member

better %chin%

This comment has been minimized.

Copy link
@dselivanov

dselivanov Apr 2, 2016

IMHO %chin% is overkill here =)

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Apr 2, 2016

Member

@dselivanov

chin

Seems it's always 2-3x faster than %in%, what's the harm? Two-character difference...

This comment has been minimized.

Copy link
@dselivanov

dselivanov Apr 2, 2016

no harm! Just little bit readable. All know about %in%, but %chin% is data.table specific. In this particular case speedup won't be noticeable.

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Apr 2, 2016

Member

that's what ?"%chin%" is for ;-)

R/fwrite.R Outdated
repeat {
block_end <- min(block_begin+(block.size-1), nrow(dt))

dt_block <- dt[c(block_begin:block_end),]

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 28, 2016

Member

Not sure how easy it is to skip the last column in C, but would probably be faster to define a block column, then key on it and use binary search, a la:

DT[.(block_no)]

Would require using setDT in the case of data.frame, but that would also take care of the one-column DF problem on L22

…f unique column names, replaced %in% -> %chin%
repeat {
block_end <- min(block_begin+(block.size-1), nrow(x))

dt_block <- x[c(block_begin:block_end),]

This comment has been minimized.

Copy link
@oseiskar

oseiskar Mar 28, 2016

Author Contributor

As a reply to MichaelChirico's comment if it would be faster to create an extra column and use it like x[.(block_no)]: could be a little bit faster and not difficult to implement in C, but I dislike the idea of modifying the input data table. Is there a convenient way to generate the name of such column so that it would not conflict with existing column names?

This comment has been minimized.

Copy link
@jangorecki

jangorecki Mar 28, 2016

Member

@oseiskar none I'm aware of, once #633 solved it will be easy, you can use cryptic name (prefix it with the dot), and check if it doesn't exist in a data.table. Not sure what you are referring by modifying input, but adding column without modifying input is as simple as x = shallow(x)[, "col" := new], it won't copy the data, and it will add new column only to locally processed data.

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 28, 2016

Member

Just to reinforce Jan's point, this sort of thing (restricting acceptable column names) is done under the hood in several places of data.table code already, see e.g. here. As a suggestion, there's b__, which has an analogue to f__, l__, o__, zo__, jn__, and jl__, all found in [.data.table. Agreed block_no is too likely to be in users' tables (in fact I have several cases like that myself).

R/fwrite.R Outdated

# convert data.frame row block to a list of columns
col_list <- lapply(dt_block, function(column) {
str_col <- as.character(column)

This comment has been minimized.

Copy link
@MichaelChirico

MichaelChirico Mar 28, 2016

Member

I'm not 100% sure I'm interpreting the output of profvis correctly, but I'm pretty sure this here is by far the biggest bottleneck on the code as it stands:

screenshot from 2016-03-28 09 56 00

I think all the time spent on Cwritefile is in the tiny slices between lapply blocks (AFAICT profvis has no way of recording anything about C code, not even that the compiler is currently in C).

If that's correct, it seems there's ample room for improvement by porting the formatting step into C.


Here's the code I used for that:

setwd("~/Desktop/")
library(profvis)
library(data.table)
NN <- 1e6
set.seed(51423)
testDT <- 
  data.table(str1=sample(sprintf("%010d",1:NN)), #ID field 1
             str2=sample(sprintf("%09d",1:NN)),  #ID field 2
             #varying length string field--think names/addresses, etc.
             str3=replicate(NN,paste0(sample(LETTERS,sample(10:30,1),T),
                                      collapse="")),
             #factor-like string field with 50 "levels"
             str4=sprintf("%05d",sample(sample(1e5,50),NN,T)),
             #factor-like string field with 17 levels, varying length
             str5=sample(replicate(17,paste0(sample(LETTERS,
                                                    sample(15:25,1),T),
                                             collapse="")),NN,T),
             #lognormally distributed numeric
             num1=exp(rnorm(NN,mean=6.5,sd=1.5)),
             #3 binary strings
             str6=sample(c("Y","N"),NN,T),
             str7=sample(c("M","F"),NN,T),
             str8=sample(c("B","W"),NN,T),
             #right-skewed integer
             int1=ceiling(rexp(NN)),
             #dates by month
             dat1=sample(seq(from=as.Date("2005-12-31"),
                             to=as.Date("2015-12-31"),by="month"),
                         NN,T),
             dat2=sample(seq(from=as.Date("2005-12-31"),
                             to=as.Date("2015-12-31"),by="month"),
                         NN,T),
             num2=exp(rnorm(NN,mean=6,sd=1.5)),
             #date by day
             dat3=sample(seq(from=as.Date("2015-06-01"),
                             to=as.Date("2015-07-15"),by="day"),
                         NN,T),
             #lognormal numeric that can be positive or negative
             num3=(-1)^sample(2,NN,T)*exp(rnorm(NN,mean=6,sd=1.5)))

profvis(fwrite(testDT, "test.csv"))

This comment has been minimized.

Copy link
@oseiskar

oseiskar Mar 28, 2016

Author Contributor

That sounds right. I also interpreted from my performance test that the biggest bottleneck is formatting, which is especially slow for floating point numbers. However, I'm not sure how much room there is for optimization. Is R's native as.character (which might allocate a lot of new small string objects) much slower than using C's fprintf to write directly to a file? Should be easy to check...

oseiskar added 2 commits Mar 28, 2016
…C. Also added na option to fwrite.
…fprintf, e.g., 1e-10 (Linux) = 1e-010 (Windows)
@oseiskar

This comment has been minimized.

Copy link
Contributor Author

oseiskar commented Mar 28, 2016

I moved number and NA formatting to C code, which resulted in less column copying and a significant perfomance boost. See this gist. Now this version seems to be between 2 and 4 times faster than write.csv with character and numeric columns.

@jangorecki thanks for pointing out the shortcomings of Sys.time. Any ideas how it should be replaced in benchmarking?

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 28, 2016

My preference is to use microbenchmark::get_nanotime.

By the way, thanks for working on this! The data.table community will be overjoyed to see a working version of this, I'm sure.

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 28, 2016

Two more things:

  • I get an error when trying to write outside current wd; modifying slightly the example in ?fwrite:
fwrite(data.table(first=c(1,2), second=c(NA, 'foo"bar')), "~/Desktop/table.csv")

Error in fwrite(data.table(first = c(1, 2), second = c(NA, "foo\"bar")), :
No such file or directory

  • Handling for list columns is probably incorrect. Not sure whether we should support... seems like it will be a good counterpart to sep2 in fread. As of now:
DT <- data.table(a = 1:3, l = list(list(1:6), list(3:5), list("a","b")))

fwrite(DT, "test.csv")

Produces:

"a","l"
1,list(1:6)
2,list(3:5)
3,list("a", "b")

I don't know the proper way to handle this, perhaps surround a list in angle brackets and comma-separate the elements?

"a","l"
1,<<1,2,3,4,5,6>>
2,<<3,4,5>>
3,<<"a">,<"b">>

?

Or perhaps take the cue from format.data.table:

format.item <- function(x) {
  if (is.atomic(x) || is.formula(x)) 
    paste(c(format(head(x, 6), justify = justify, ...), 
            if (length(x) > 6) ""), collapse = ",")
  else paste("<", class(x)[1L], ">", sep = "")
}

if (is.list(col)) 
  col = sapply(col, format.item)
else col = format(char.trunc(col), justify = justify, ...)

Otherwise just error on list columns

@oseiskar

This comment has been minimized.

Copy link
Contributor Author

oseiskar commented Mar 28, 2016

@MichaelChirico The problem is not the working directory but the ~, which is not automatically expanded in C. I'll fix this using path.expand...

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 28, 2016

inre: lists, just noticed that write.csv pops an error in this case, so it should be fine to just error for now:

Error in .External2(C_writetable, x, file, nrow(x), p, rnames, sep, eol, :
unimplemented type list in EncodeElement

@MichaelChirico

This comment has been minimized.

Copy link
Member

MichaelChirico commented Mar 28, 2016

inre: numeric columns, for consistency with write.csv more digits should be printed; from ?write.csv:

In almost all cases the conversion of numeric quantities is governed by the option "scipen" (see options), but with the internal equivalent of digits = 15. For finer control, use format to make a character matrix/data frame, and call write.table on that.

@jangorecki

This comment has been minimized.

Copy link
Member

jangorecki commented Mar 28, 2016

@oseiskar

elapsed.secs <- system.time(ans <- fwrite(...))[[3L]]`

is the easiest way
for precise timing you can use

pt = microbenchmark::get_nanotime()
ans = fwrite(...)
microbenchmark::get_nanotime() - pt

or from the devel version of microbenchmarkCore a drop-in replacement for system.time called system.nanotime

No need to have list columns supported now, this can be easily and flexibly handled within data.table before passing it to fwrite.

Not sure about as.character() for non-(numeric/integer/characters), it might be better to use toString instead (then by=1:nrow(.) probably needed), but it generally depends on how it is processed further in terms of edge cases detection. toString always returns non-NA, length 1L character. Not sure about performance of that.

@oseiskar

This comment has been minimized.

Copy link
Contributor Author

oseiskar commented Mar 30, 2016

As a reply to this earlier comment about supporting write.csv-like file argument that could also be an R connection and would default to writing to the console: not possible, as far as I know. Adding support to writing in STDOUT is possible but it seems that writing to an existing R connection is not.

@MichaelChirico thanks for looking up the C side of write.csv. Unfortunately, it does not seem to be possible do replicate this because the code is calling functions that are not part of the public C API of R. It would seem that all file/connection handling is part of the private / internal R code and thus unavailable.

In particular, one would need to call getConnection to transform a parameter SEXP, representing a connection number, to an Rconnection and trying to replicate the implementation this function would not help here.

@dselivanov

This comment has been minimized.

Copy link

dselivanov commented Mar 31, 2016

Not sure, if this will help, but leave it here: official C level API for connection handles in R 3.3.0.

@oseiskar

This comment has been minimized.

Copy link
Contributor Author

oseiskar commented Apr 2, 2016

Actually, it is possible to bypass the private API limitation and call getConnection at least on some platforms: see this branch for a proof of concept. However, this brings in a whole new bunch of problems:

  1. It is not guaranteed to work or remain supported (However, as @dselivanov pointed out, official support will be available in R version 3.3.0)
  2. The first attempted implementation absolutely kills performance. The code linked above is about 4 times slower that write.csv with character columns.
  3. For regular files, it is possible to "fix" 2. using lower-level functions like this (speedup drops from about 2x to about 1.5x) but this will not work for all R connections, such as stdout. It is possible to handle both cases separately, which would significantly increase the complexity of the code.

In my opinion, supporting R connection arguments is not worth the trouble at this point.

@dselivanov

This comment has been minimized.

Copy link

dselivanov commented Apr 2, 2016

Personally I think, connections is "nice to have" feature (which can be implemented in future), but even without connections this is a great PR. Thank you, @oseiskar.

@jangorecki

This comment has been minimized.

Copy link
Member

jangorecki commented Apr 2, 2016

+1 dselivanov
FYI @mattdowle :)

@oseiskar oseiskar force-pushed the oseiskar:fwrite branch from 3ac65c1 to cf83d03 Apr 3, 2016
@oseiskar oseiskar force-pushed the oseiskar:fwrite branch from cf83d03 to 6be2ed1 Apr 3, 2016
@mattdowle mattdowle merged commit 6be2ed1 into Rdatatable:master Apr 7, 2016
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mattdowle

This comment has been minimized.

Copy link
Member

mattdowle commented Apr 7, 2016

Awesome! Looks great @oseiskar :-)

@arunsrinivasan

This comment has been minimized.

Copy link
Member

arunsrinivasan commented Apr 7, 2016

Great PR, @oseiskar. Using the data.dable from the extensive benchmark Michael did on SO here, I profiled it using Instruments -> time profiler. Here's a snapshot.

fwrite-time-profiler

I've not looked at the code, but IIUC writefile() takes only about 50% of the time? Also perhaps there are trivial places that could be improved on that you could spot? Hope it's of some use.

@jangorecki

This comment has been minimized.

Copy link
Member

jangorecki commented Apr 11, 2016

@oseiskar
Could you substitute R_xlen_t with something which is available in R 2.15.0?
While installing package on R 2.15.0 (data.table stated dependency) I get the following error:

fwrite.c: In functionwritefile:
fwrite.c:31:3: error: unknown type nameR_xlen_tR_xlen_t ncols = LENGTH(list_of_columns);
   ^

You can reproduce it with:

docker run -it docker.io/jangorecki/r-2.15.0 /bin/bash
curl -O https://Rdatatable.github.io/data.table/src/contrib/data.table_1.9.7.tar.gz
R2 CMD INSTALL data.table_1.9.7.tar.gz

Alternatively, if there is no substitution for that or it would decrease performance, then @mattdowle would need to decide about stated dependency upgrade. R 2.15.0 is from March 2012, in my opinion 4 years old isn't yet old enough to deprecate it without strong reason.

@oseiskar oseiskar deleted the oseiskar:fwrite branch Apr 25, 2016
@mattdowle

This comment has been minimized.

Copy link
Member

mattdowle commented on inst/tests/tests.Rraw in d794b3b Oct 28, 2016

Great tests. Just small note that data.table's test() function already has an error= argument to test for expected error and that the error contains that text. Will replace fwrite_expect_error with using test(...,error="...").

@mattdowle

This comment has been minimized.

Copy link
Member

mattdowle commented Nov 7, 2016

Work continued here: #1664

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.