Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up[Bug] fwrite in large environments (tables up to 100M rows) #1968
Comments
|
I have the same problem, reproducible under conditions where multiple threads are used and for numeric values only (integers seem to work fine) as @mgahan wrote. It interestingly appears to be isolated to instances where there's > 1 column in the data table:
|
|
@mattdowle I just tested it out and my tests check out. Thanks for all the hard work! |
|
@mgahan Excellent - thanks! |
|
@mattdowle @mgahan which version of the package has this bug fix? |
|
@mattdowle I am not sure that this issue is resolved. Seeing it in DT v 10.4.0: Browse[1]> nrow(dt[population<1]) Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> nrow(newdt[population<1]) Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> nrow(newdt[population<1]) Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> nrow(newdt[population<1]) Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> nrow(newdt[population<1]) Browse[1]> sessionInfo() locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
|
Resolution is not on CRAN. Install the current development version (1.10.5) |
|
@MichaelChirico I don't think so. Just installed dev version, same behavior: Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> fwrite(dt, file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> newdt <- fread(file.path(worker.dir, paste0(country, '.csv'))) Browse[1]> nrow(dt[population<1]) Browse[1]> nrow(newdt[population<1]) Browse[1]> sessionInfo() locale: attached base packages: other attached packages: loaded via a namespace (and not attached): |
|
Potentially could be specific to our cluster environment but others at my institute are seeing this same issue. Seems to be related to columns being accidentally reordered inside a row. I would recommend that people be very cautious using fwrite in production code at this stage, this bug seems to be pervasive and is really difficult to track down in large outputs. |
I have been using
fwritein a large AWS environment detailed below:Instance: m4.4xlarge
RAM: 64gb
Threads: 16
Cost: $0.862 hourly
I notice that when writing out numeric values, the written output is incorrect. When the same data is coerced to the integer class, the output seems to be correct. This does not seem to be a problem with
setDTthreads(1). However, errors start to creep in whensetDTthreads(2), albeit less errors thansetDTthreads(16). I have detailed the two scenarios below.Writing numeric output with fwrite
Writing integer output with fwrite