Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite hangs if called from within mclapply #1727

Closed
rdiaz02 opened this issue Jun 3, 2016 · 7 comments
Closed

fwrite hangs if called from within mclapply #1727

rdiaz02 opened this issue Jun 3, 2016 · 7 comments
Milestone

Comments

@rdiaz02
Copy link

rdiaz02 commented Jun 3, 2016

When calling fwrite from inside mclapply the execution hangs.

Here are two reproducible examples. The first uses a unique temporary file name via tempfile; the second doesn't.

library(data.table)
library(parallel)

mm <- matrix(1:10, ncol = 2)

f1 <- function(data) {
    tmpn <- tempfile()
    cat(paste("\n tmpn is ", tmpn, "\n"))
    write(t(data), file = tmpn)
}

f2 <- function(data) {
    tmpn <- tempfile()
    cat(paste("\n tmpn is ", tmpn, "\n"))
    data.table::fwrite(data.table(data), file = tmpn, verbose = TRUE)
}

## OK
f1(mm)
## OK
f2(mm)

## OK
mclapply(1:3, function(x) f1(mm), mc.silent = FALSE)
## OK; of course, just using lapply
mclapply(1:3, function(x) f2(mm), mc.silent = FALSE, mc.cores = 1)
## Hangs; does not even show attempting the third
mclapply(1:3, function(x) f2(mm), mc.silent = FALSE, mc.cores = 2)

Identical to the above, but without the tempfile (we create a filename using the index of mclapply), to make sure that is not the problem:

library(data.table)
library(parallel)

mm <- matrix(1:10, ncol = 2)

f1b <- function(index, data) {
    tmpn <- paste0("/tmp/afile", index)
    cat(paste("\n tmpn is ", tmpn, "\n"))
    write(t(data), file = tmpn)
}

f2b <- function(index, data) {
    tmpn <- paste0("/tmp/afile", index)
    cat(paste("\n tmpn is ", tmpn, "\n"))
    data.table::fwrite(data.table(data), file = tmpn, verbose = TRUE)
}

## OK
f1b(1, mm)
## OK
f2b(1, mm)

## OK
mclapply(1:3, function(x) f1b(index = x, data = mm), mc.silent = FALSE)
## OK; of course, just using lapply
mclapply(1:3, function(x) f2b(index = x, data = mm), mc.silent = FALSE, mc.cores = 1)
## Hangs and we do not even get to the third
mclapply(1:3, function(x) f2b(index = x, data = mm), mc.silent = FALSE, mc.cores = 2)

Expected behavior: run finishes without hanging, much like it does with write.

What does verbose in fwrite say? The runs show

tmpn is  /tmp/afile1 

 tmpn is  /tmp/afile2 
Maximum line length is 26 calculated in 0.000s
Writing column names ... done in 0.000s
Writing data rows in 1 batches of 40329 rows (each buffer size 1.000MB, turbo=1) ... Maximum line length is 26 calculated in 0.000s
Writing column names ... done in 0.000s
Writing data rows in 1 batches of 40329 rows (each buffer size 1.000MB, turbo=1) ... 

What can I see if I look at the stuff that is written?

If I look at afile1 and afile2, in the run that hangs, the only thing that is outputed is the header, not the contents of the file.

Output from sessionInfo()

> sessionInfo()
R version 3.3.0 Patched (2016-05-26 r70677)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux stretch/sid

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] data.table_1.9.7

loaded via a namespace (and not attached):
[1] compiler_3.3.0 tools_3.3.0   
> 

Is the cause of problem that the object we are writing is the same?
I would not expect that to be an issue, but that is not the problem, because this happens even if we write different R objects (here, each call modifies the object before writing it):

library(data.table)
library(parallel)

mm <- matrix(1:10, ncol = 2)

f1c <- function(index, data) {
    tmpn <- paste0("/tmp/afile", index)
    cat(paste("\n tmpn is ", tmpn, "\n"))
    data <- data[-index, ]
    write(t(data), file = tmpn)
}

f2c <- function(index, data) {
    tmpn <- paste0("/tmp/afile", index)
    cat(paste("\n tmpn is ", tmpn, "\n"))
    data <- data[-index, ]
    data.table::fwrite(data.table(data), file = tmpn, verbose = TRUE)
}

## OK
f1c(1, mm)
## OK
f2c(1, mm)

## OK
mclapply(1:3, function(x) f1c(index = x, data = mm), mc.silent = FALSE)
## OK; of course, just using lapply
mclapply(1:3, function(x) f2c(index = x, data = mm), mc.silent = FALSE, mc.cores = 1)
## Hangs and we do not even get to the third
mclapply(1:3, function(x) f2c(index = x, data = mm), mc.silent = FALSE, mc.cores = 2)


@arunsrinivasan
Copy link
Member

Can't reproduce on OS X El Capitan, v1.9.7 latest.

@rdiaz02
Copy link
Author

rdiaz02 commented Jun 23, 2016

I just run it again (with data table downloaded right now), with same results, on two different Linux machines. I guess it might be an OS/compiler dependent issue?

@yitang
Copy link

yitang commented Sep 7, 2016

@rdiaz02 I am having the same problem on my debian machine. apart from fwrite, i also noticed other functions in data.table that if used in mclapply, R hangs. The cpu usage goes down to and stay at 0.

@yitang
Copy link

yitang commented Sep 7, 2016

I don't have the same problem on another old linux machine though. I guess it's probably caused by c libraries so I don't expect it will be resolved soon. I'd replace with fwrite with saveRDS at the moment.

@ChristK
Copy link

ChristK commented Sep 7, 2016

This sound like another case of implicit/explicit parallelisation conflict. Have you tried setthreads(1L) after loading data.table to force fwrite use only one thread?

@jangorecki
Copy link
Member

jangorecki commented Sep 7, 2016

@yitang Here is example of mixing parallel package and parallel functions in data.table, you should be able to measure performance and choose your working/preferred parallelism layer by setting two options.

@yitang
Copy link

yitang commented Sep 8, 2016

@ChristK @jangorecki thank you very much. I didn't know setthreads function before. It would solve the problem. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants