Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support .gz file format for fread #717

Closed
renqian opened this issue Jul 4, 2014 · 52 comments
Closed

Support .gz file format for fread #717

renqian opened this issue Jul 4, 2014 · 52 comments

Comments

@renqian
Copy link

@renqian renqian commented Jul 4, 2014

I have several thousands of .gz files containing data in csv format - about 60GB in total in terms of .gz files. Decompressing them and load some pieces via fread turns out a huge pain in the first step. I'm wonder whether it is possible to improve the functionality of fread so that it can read compressed file formats just as read.table does?

Perhaps file connection issues are highly relevant, as mentioned in #341, #543, and #561.
Some other reference:

http://stackoverflow.com/questions/5764499/decompress-gz-file-using-r

http://blog.revolutionanalytics.com/2009/12/r-tip-save-time-and-space-by-compressing-data-files.html

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jul 7, 2014

You can just do fread('zcat file.gz'), or some loop variation, if you have many files.

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Sep 28, 2014

:Bump: quite useful (and coming up frequently).

Here's one SO post.

@gbonamy
Copy link

@gbonamy gbonamy commented Sep 28, 2014

Yes this would be a very useful feature to have. Using a command line, is a temporary solution at best, since it relies on the underlying system to have the tools for decompression. For instance 'zcat' is not available on windows unless one installs cygwin etc.

Since FRead is by far the best tool in R to read file, it would be a huge performance gain to read gzipped/bziped/... files directly.

@rsaporta
Copy link
Contributor

@rsaporta rsaporta commented Sep 30, 2014

I just saw @Arun's bump in my email, and literally a few hours ago I was ingesting 200+ such files. +1 for usefullness

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Oct 7, 2014

Would have been useful here as well :
http://www.magesblog.com/2014/10/visualising-seasonality-of-atlantic.html
Will take a look.

@xiaodaigh
Copy link

@xiaodaigh xiaodaigh commented Nov 20, 2014

I agree with @gbonamy readding directly from zip files would be a fantastic addition!!

@rmscriven
Copy link

@rmscriven rmscriven commented Mar 16, 2015

Reading from a connection with unz() would also be quite useful. I have a function that downloads a zip file, and only reads one file then throws it away. So if I could use fread(unz(zipfile, file = file)) it would be a great addition.

@statquant
Copy link

@statquant statquant commented Apr 17, 2015

I ++ about directly from gz files. I would personally use it every day.

@mspivakov
Copy link

@mspivakov mspivakov commented Apr 26, 2015

+1 from me as well.

@fleimgruber
Copy link

@fleimgruber fleimgruber commented Jun 3, 2015

+1

4 similar comments
@zx8754
Copy link

@zx8754 zx8754 commented Jul 1, 2015

+1

@rickdonnelly
Copy link

@rickdonnelly rickdonnelly commented Jul 15, 2015

+1

@qgeissmann
Copy link

@qgeissmann qgeissmann commented Jul 17, 2015

+1

@jayjacobs
Copy link

@jayjacobs jayjacobs commented Jul 28, 2015

+1

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jul 28, 2015

I'm curious - are people requesting this mostly working on Windows? I have trouble seeing the desire for this kind of specialization on Linux.

I personally mostly use .xz compression, but wouldn't care if fread directly supported it - I very frequently pipe the uncompressed result and do some post-processing before loading it in R (e.g. fread('xzcat file.xz | grep smth | awk blah')) and I like not depending on fread's file-format reading abilities - my shell processes are almost always going to be more advanced than whatever is implemented in fread.

@zachmayer
Copy link

@zachmayer zachmayer commented Aug 20, 2015

👍

@dselivanov
Copy link

@dselivanov dselivanov commented Sep 14, 2015

Just put here my tip for OS X users.
zcat syntax on OS X is little bit different to other linux systems. For reading *.gz files use following call:

dt <- fread(input = 'zcat < data.gz')

@gdkrmr
Copy link

@gdkrmr gdkrmr commented Dec 8, 2015

This is probably not the most efficient way but it works for me, you will probably have to change unz for gzfile :

zread <- function(zf,f,...){
  require(data.table)
  res <- fread(paste(readLines(tmp <- unz(zf,f)), collapse = "\n"),...)
  close(tmp)
  res
}

@dselivanov
Copy link

@dselivanov dselivanov commented Dec 8, 2015

readLines incredibly slow...

@gdkrmr
Copy link

@gdkrmr gdkrmr commented Dec 8, 2015

@dselivanov it gets the job done on small files, never tried it on large ones though... my method probably does a lot of useless memory allocation passing the whole file around as a character vector.

@statquant
Copy link

@statquant statquant commented Dec 8, 2015

Just zcat the file, see previous posts

On Tuesday, 8 December 2015, gdkrmr notifications@github.com wrote:

@dselivanov https://github.com/dselivanov it gets the job done on small
files... my method probably does a lot of useless memory allocation passing
the whole file around as a character vector.


Reply to this email directly or view it on GitHub
#717 (comment)
.

@cybaea
Copy link

@cybaea cybaea commented Dec 21, 2015

+1 for us hapless Windows users and for portability. There may be reasons why fread cannot accept a connection (as in help("connections", package="base")) but if not that would be a great and portable solution. Would also help with some common encoding issues (eg BOMs in UTF-8 files).

@setempler
Copy link

@setempler setempler commented Nov 18, 2016

+1 - also for other connections (gzfile, bzfile, xzfile, unz)

@TuSKan
Copy link

@TuSKan TuSKan commented Nov 23, 2016

+1 My first wish from fread / fwrite

@borisclemencon
Copy link

@borisclemencon borisclemencon commented Nov 25, 2016

@webbp, I have the same pbm. I cannot use zcat, although it is pretty, because too little size in /dev/shm on my AWS EC2 instance. I should try to redirect /dev/shm to a EBS disk, but did not figure out how yet. Meanwhile, "zcat file.tsv.gz > file.tsv followed by fread('file.tsv')" is a penible workaround, but at least it works.

An alternative idea would be to use a specific tmp directory. Any idea?

@sznadas
Copy link

@sznadas sznadas commented Feb 6, 2017

+1

2 similar comments
@mGalarnyk
Copy link
Contributor

@mGalarnyk mGalarnyk commented Feb 10, 2017

+1

@rargelaguet
Copy link

@rargelaguet rargelaguet commented Feb 15, 2017

+1

@xhdong-umd
Copy link

@xhdong-umd xhdong-umd commented Feb 23, 2017

Is it possible to make a R package have command line tools for windows, mac, liunx wrapped in same interface. Then we can use the zcat usage with fread when that package is installed.

An example of this kind of package

I realized this kind of package will not be allowed in CRAN if you need to pack a gzip windows version in package. Either hosting it in other place, or ask user to download gzip windows by themselves.

@xhdong-umd
Copy link

@xhdong-umd xhdong-umd commented Feb 23, 2017

To uncompress file into temp file on disk will always work, but that could be slow because of disk access. If we read the file into a raw vector in RAM, then uncompress it with memDecompress before feeding a uncompressed raw vector to fread, will that work?

@xhdong-umd
Copy link

@xhdong-umd xhdong-umd commented Mar 2, 2017

I wrote a function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file. So we can use temp_unzip(file, fread, ...).

The code is pure R so it should work in all platforms. I feel the zcat method is good enough for linux/mac(I do need to quote the file name sometimes), but too complex for windows.

The code is inspired by R.utils but I really don't like its default behavior of removing input file by default. Also I think R.utils author just modified the compressFile code to use for decompressFile. There is need to call gzfile and bzfile separately for compression, but you don't have to call gzfile, bzfile and xzfile separately because gzfile can handle all compression formats (except zip, which I used unzip).

Here are some benchmarks:

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)
Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1

@frenchja
Copy link

@frenchja frenchja commented May 12, 2017

One thing to note is that the zcat solution appears to only work if the file exists in the same directory that R is launched:

Error in fread("zcat < data/directory/test.csv.gz") :
  File is empty: /var/folders/41/asdf_kj80000gn/T//RtmpwtAttt/fileebeb5e124cef

@map2085
Copy link

@map2085 map2085 commented May 17, 2017

i forgot about this issue and tried to fread a gz file, only to get a mysterious error causing me to waste time, again, searching for the solution.

3 years later, still waiting for this elementary fix.

@frenchja
Copy link

@frenchja frenchja commented May 17, 2017

After further exploration, my error above only occurs when there are spaces in the directory name:

fread("zcat < data/directory\ one/test.csv.gz"

But not with underscores:

fread("zcat < data/directory_two/test.csv.gz"

And can be alleviated by escaping the backslash again:

fread("zcat < data/directory\\ one/test.csv.gz"

Hope this helps. Otherwise, the zcat solution works fine.

@jaapwalhout
Copy link

@jaapwalhout jaapwalhout commented Feb 27, 2018

Another example on StackOverflow why this feature is needed:
data.table fread error - gzip file - set temporary directory

@webbp
Copy link

@webbp webbp commented Mar 1, 2018

How about:

library(readr)
DT = as.data.table(read_csv("myfile.gz"))

@mspivakov
Copy link

@mspivakov mspivakov commented Mar 1, 2018

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Mar 2, 2018

@malcook
Copy link

@malcook malcook commented Mar 18, 2018

@frenchja - agree - though you might prefer to escape those spaces with R's shQuote

@byapparov
Copy link

@byapparov byapparov commented Apr 17, 2018

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@swvanderlaan
Copy link

@swvanderlaan swvanderlaan commented Jun 28, 2018

@frenchja How would this work with past0? I have now the code below, but that throws an error:

SOME_DIR = "/Users/swvanderlaan/some_dir"
data <- fread('zcat < paste0(SOME_DIR,"/somedata.txt.gz")', 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

Ah got it, it should be this:

data <- 
  fread(paste0("zcat < '", SOME_DIR,"/somedata.txt.gz","'"), 
                                                          header = TRUE, na.strings = "NA", 
                                                          verbose = TRUE, showProgress = TRUE)

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jul 2, 2018

@swvanderlaan I tend to use sprintf for cases like this; you should also use file.path and shQuote to be platform-robust:

fread(sprintf('zcat %s', shQuote(file.path(SOME_DIR, 'somedata.txt.gz'))))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests