Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] allow tmpDir to be supplied as argument: fread can run out of tmpfs space on unix during preprocessing #1139

Closed
everdark opened this issue May 6, 2015 · 5 comments

Comments

@everdark
Copy link

everdark commented May 6, 2015

Hi,

Recently I've encountered an issue for large compressed files that could stop the functioning of fread due to tmpfs out off space. Since currently (in the master branch) fread on unix system will use tmpfs (/dev/shm) as long as it exists, the size of tmpfs will limit the capability of fread to read potentially large files before any preprocessing can be done. This is more severe when multi-threading is used to simultaneously load multiple files for speed gain, say, mclapply(input_list, fread, mc.cores=4), where input list may be something like

"zcat file1.gz | grep blabla | ..."
"zcat file2.gz | grep blabla | ..."
...

Each gz file could have several GBs uncompressed. I don't need them all in my analysis and a preprocessing could be done to significantly reduce the size of each file. However, the preprocessing requires each file to be uncompressed to disk in the first place, occupying all the space available in tmpfs. (There are, of course, several work-a-rounds for this kind of situation but it could be great to directly address it in one R function call, which is fread in discuss.)

It hence could be nice if a user-input argument is allowed to force tempfile location other than tmpfs on unix system. For example dat <- fread("zcat file.gz", tmpDir="/data"). The performance may be a bit worse due to disk I/O but the raw data will not be limited by size of tmpfs, which is usually by far smaller than any disk device at hand. (On my machine I have 8 GBs in tmpfs and that's it.)

A possible minor change to make this issue fixed on unix is to rewrite fread.R as everdark@4aaa745.

I only test it on my local machine and it works fine. There could be some ramification that I don't take into account in this simple modification so I create this request issue to open the discussion. :) Did anybody else also encounter such tmpfs out-of-space issue?

@alecw
Copy link

alecw commented Jun 10, 2016

I've also run into this issue. What is particularly frustrating is that if /dev/shm exists, then the value of TMPDIR is ignored and /dev/shm is used. It would be great if TMPDIR were respected.

@mgahan
Copy link

mgahan commented Jun 16, 2016

Agreed. This would really help my workflow.

@borisclemencon
Copy link

borisclemencon commented Nov 25, 2016

+1 It would save my life!

@shyam334
Copy link

It's not good that the shm is abused, and worse that there is no way to switch this behaviour off.
Please patch this soon. (happy to help if required)

@jpecar
Copy link

jpecar commented Jun 28, 2017

Something like

diff -ur data.table/R/fread.R data.table.fix/R/fread.R
--- data.table/R/fread.R        2017-01-31 03:10:52.000000000 +0000
+++ data.table.fix/R/fread.R    2017-06-29 08:34:42.764465000 +0000
@@ -85,8 +85,11 @@
         tt = tempfile()
         on.exit(unlink(tt), add = TRUE)
         if (.Platform$OS.type == "unix") {
-            if (file.exists('/dev/shm') && file.info('/dev/shm')$isdir) {
-                tt = tempfile(tmpdir = '/dev/shm')
+            tmp = Sys.getenv("TMPDIR")
+            if (file.exists(tmp) && file.info(tmp)$isdir) {
+                tt = tempfile(tmpdir = tmp)
+            } else {
+                tt = tempfile(tmpdir = '/tmp')
             }
             system(paste('(', input, ') > ', tt, sep=""))
         } else {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants