Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rechunker script #686

Merged
merged 21 commits into from May 6, 2022
Merged

Rechunker script #686

merged 21 commits into from May 6, 2022

Conversation

JoranAngevaare
Copy link
Member

@JoranAngevaare JoranAngevaare commented Apr 9, 2022

What is the problem / what does the code in this PR do
Simple script to rechunk a file to rechunk / recompenses a source folder to another folder.

Example

An example of re-chunking a given folder several times with different compressors.

for COMPRESSOR in zstd bz2 lz4 blosc zstd; do echo $COMPRESSOR; \
   python /home/joran/software/strax/bin/rechunker \
   --source `pwd`/009104-raw_records-rfzvpzj4mf --write_stats_to test.csv --compressor $COMPRESSOR; done

python -c  "import pandas as pd; df=pd.read_csv('test.csv'); df['read_mbs'] = df['uncompressed_mb']/df['load_time']; df['write_mbs']=df['uncompressed_mb']/df['write_time']; print(df.to_string())"

Gives:

Rechunking /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf to /tmp/tmpu37youjh/009104-raw_records-rfzvpzj4mf
move /tmp/tmpu37youjh/009104-raw_records-rfzvpzj4mf to /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
Re-compressed /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
        load_time               6.999560356140137
        write_time              6.778709173202515
        uncompressed_mb         2098.762584
        source_compressor       zstd
        dest_compressor         zstd
        source_mb               608.937729
        dest_mb                 608.937728
bz2
Rechunking /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf to /tmp/tmp07pba1wr/009104-raw_records-rfzvpzj4mf
move /tmp/tmp07pba1wr/009104-raw_records-rfzvpzj4mf to /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
Re-compressed /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
        load_time               7.226304531097412
        write_time              204.91163182258606
        uncompressed_mb         2098.762584
        source_compressor       zstd
        dest_compressor         bz2
        source_mb               608.937728
        dest_mb                 440.363485
lz4
Rechunking /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf to /tmp/tmpl8ida2oy/009104-raw_records-rfzvpzj4mf
move /tmp/tmpl8ida2oy/009104-raw_records-rfzvpzj4mf to /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
Re-compressed /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
        load_time               94.58882331848145
        write_time              8.736302614212036
        uncompressed_mb         2098.762584
        source_compressor       bz2
        dest_compressor         lz4
        source_mb               440.363485
        dest_mb                 960.646327
blosc
Rechunking /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf to /tmp/tmprwyng6nj/009104-raw_records-rfzvpzj4mf
move /tmp/tmprwyng6nj/009104-raw_records-rfzvpzj4mf to /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
Re-compressed /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
        load_time               5.825917959213257
        write_time              5.93792986869812
        uncompressed_mb         2098.762584
        source_compressor       lz4
        dest_compressor         blosc
        source_mb               960.646327
        dest_mb                 1135.12115
zstd
Rechunking /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf to /tmp/tmpgf5x0_8f/009104-raw_records-rfzvpzj4mf
move /tmp/tmpgf5x0_8f/009104-raw_records-rfzvpzj4mf to /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
Re-compressed /mnt/d/strax_data/009104-raw_records-rfzvpzj4mf
        load_time               4.771927833557129
        write_time              10.231157779693604
        uncompressed_mb         2098.762584
        source_compressor       blosc
        dest_compressor         zstd
        source_mb               1135.12115
        dest_mb                 608.937728

   load_time  write_time  uncompressed_mb source_compressor dest_compressor    source_mb      dest_mb    read_mbs   write_mbs
0   6.999560    6.778709      2098.762584              zstd            zstd   608.937729   608.937728  299.842058  309.610950
1   7.226305  204.911632      2098.762584              zstd             bz2   608.937728   440.363485  290.433731   10.242281
2  94.588823    8.736303      2098.762584               bz2             lz4   440.363485   960.646327   22.188272  240.234648
3   5.825918    5.937930      2098.762584               lz4           blosc   960.646327  1135.121150  360.245819  353.450214
4   4.771928   10.231158      2098.762584             blosc            zstd  1135.121150   608.937728  439.814401  205.134417

@coveralls
Copy link

coveralls commented Apr 9, 2022

Coverage Status

Coverage increased (+0.1%) to 93.348% when pulling b35ce34 on rechunker into b69f614 on master.

@JoranAngevaare JoranAngevaare marked this pull request as ready for review April 28, 2022 13:15
Comment on lines +1858 to +1869

def wrapped_loader():
"""Wrapped loader for changing the target_size_mb"""
while True:
try:
# pylint: disable=cell-var-from-loop
data = next(loader)
# Update target chunk size for re-chunking
data.target_size_mb = md['chunk_target_size_mb']
except StopIteration:
return
yield data
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this was a bug that I did not notice before - you could only rechunk until the chunk default target mb size

Copy link
Collaborator

@WenzDaniel WenzDaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I added some small comments. I have not tired it myself, I trust your tests results you are reporting here.

strax/utils.py Outdated Show resolved Hide resolved
strax/storage/files.py Outdated Show resolved Hide resolved
strax/storage/files.py Show resolved Hide resolved
tests/test_storage.py Show resolved Hide resolved
bin/rechunker Show resolved Hide resolved
@JoranAngevaare
Copy link
Member Author

Thanks Daniel for the review!

@JoranAngevaare JoranAngevaare merged commit f331fb0 into master May 6, 2022
@JoranAngevaare JoranAngevaare deleted the rechunker branch May 6, 2022 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants