Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fwrite gzip #3288

Merged
merged 65 commits into from Apr 23, 2019
Merged

Fwrite gzip #3288

merged 65 commits into from Apr 23, 2019

Conversation

philippechataignon
Copy link
Contributor

@philippechataignon philippechataignon commented Jan 16, 2019

Closes #2016
Closes #2974
Closes #2612

I reopen a PR to discuss about its contents. Its a complete rewrite of #3278 : it doesn't use gzopen/gzwrite/gzclose because only one thread compress and that's slow.

In fwrite, output file is always open in binary mode. The idea is to take data from uncompressed buffer in normal case and to compress it and take data from compressed buffer when gzip is active. The code for managing file output is quite the same in normal and gzip mode.

The main difference concern the buffers :

  • there are no realloc in this version : if the buffer is too small, fwrite stops and suggests to increase the buffMB parameter. Its too difficult to realloc the compressed and the uncompressed buffer. But, in fact, actual buffer policy doesn't work very good despite realloc. See #2974 #2612.
  • at the beginning, fwrite compute the max line length. Actually it tries to write in the buffer and add the ending position. But if the buffer is too small, its too late : buffer overflow occurs. Thats what happens in #2974. In this PR, I use strlen for string/factor and an maximum length integer by type (9 for 32 bits, 19 for int 64 for example). It's not a problem to overestimate this line.

Thanks for any advice on this PR.

Philippe Chataignon and others added 30 commits Jan 14, 2019
Use zlib and gzopen/gzwrite/gzclose function to write buffer directly in
a gzipped csv file.

zlib is thread-safe and the gzip compression use the fwrite threads.

Option compress="gzip" is added to fwrite et is automatically set when file ends
with ".gz"
In fwrite, compress has now 3 options :

* default : gzip if file ends with .gz, else csv
* none : force csv
* gzip : force gzip
Before this commit, every column is output one by one.
@mattdowle mattdowle added this to the 1.12.4 milestone Apr 16, 2019
@codecov
Copy link

@codecov codecov bot commented Apr 16, 2019

Codecov Report

Merging #3288 into master will increase coverage by 0.29%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3288      +/-   ##
==========================================
+ Coverage   96.71%   97.01%   +0.29%     
==========================================
  Files          66       66              
  Lines       12424    12475      +51     
==========================================
+ Hits        12016    12102      +86     
+ Misses        408      373      -35
Impacted Files Coverage Δ
src/fwrite.c 97.59% <100%> (+6.58%) ⬆️
R/fwrite.R 98.24% <100%> (+1.94%) ⬆️
src/fwriteR.c 98.02% <100%> (+3.11%) ⬆️
src/nafill.c 100% <0%> (ø) ⬆️
src/freadR.c 96.28% <0%> (+0.12%) ⬆️
src/uniqlist.c 98.98% <0%> (+0.5%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c30e161...7665a8d. Read the comment docs.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Apr 23, 2019

Fantastic PR -- thanks @philippechataignon.

@mattdowle mattdowle merged commit 9559d26 into Rdatatable:master Apr 23, 2019
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants