Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .bgz file decompression to data.table::fread() (compatible with .gz) #5461

Closed
TMRHarrison opened this issue Sep 14, 2022 · 3 comments · Fixed by #5474
Closed

Add .bgz file decompression to data.table::fread() (compatible with .gz) #5461

TMRHarrison opened this issue Sep 14, 2022 · 3 comments · Fixed by #5474

Comments

@TMRHarrison
Copy link

Currently, .bgz files are read as plain text, which fails due to invalid characters. .bgz files are compatible with gunzip, and have the same data header (0x1F 0x8B). renaming *.bgz files to *.gz files allows them to be decompressed normally.

Adding .bgz to the list of files that can be decompressed by data.table::fread shouldn't require anything other than R.utils. I think adding ".bgz" to the vector in this line:

if ((w <- endsWithAny(file, c(".gz",".bz2"))) || (gzsig <- identical(head(file_signature, 2L), gz_signature)) || identical(head(file_signature, 3L), bz2_signature)) {

And checking for w<=2 on this line

FUN = if (w==1L || gzsig) gzfile else bzfile

Would allow fread to decompress .bgz files automatically. However, I haven't tested this.

@TMRHarrison TMRHarrison changed the title Add .bgz file decompression (compatible with .gz) Add .bgz file decompression to data.table::fread() (compatible with .gz) Sep 16, 2022
@MichaelChirico
Copy link
Member

Do you have any .bgz files you could share for testing?

@TMRHarrison
Copy link
Author

Sure, here are three files containing the same (uncompressed) contents:

raw_data.txt
raw_data.txt.gz
raw_data.txt.bgz.zip

Github doesn't allow .bgz file extensions, so I put the .bgz file in a .zip archive.

mds5 sums are:

63b4ba65676e975cbe336c74ff489c3a  raw_data.txt
3298f1a8a311b732701e2cfac830860b  raw_data.txt.bgz
fed5bb8aff303ceb7ab1939b7d5e5005  raw_data.txt.gz

And the zcat output for all is:

some file information to compress
arbitrary data 1
data line 2 also

@MichaelChirico
Copy link
Member

Adding a key insight from the bgzip man page (it's hinted in OP's post):

Bgzip compresses files in a similar manner to, and compatible with, gzip(1).

i.e., any tool that can read gz files should also be able to read bgz files. generating bgz files would be another story, but we don't do that in fread().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants