Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parquet #81

Closed
HvanderStok opened this issue Feb 3, 2023 · 1 comment · Fixed by #85
Closed

Support for parquet #81

HvanderStok opened this issue Feb 3, 2023 · 1 comment · Fixed by #85
Assignees
Labels
enhancement New feature or request

Comments

@HvanderStok
Copy link
Contributor

The support for the data format parquet could be added. This is a binary format and unlike hdf this is compatible with newer Python versions. The example and test data could also be changed to parquet so that they are also working with new Python versions.

@HvanderStok HvanderStok added the enhancement New feature or request label Feb 3, 2023
@HvanderStok HvanderStok self-assigned this Feb 3, 2023
@HvanderStok HvanderStok linked a pull request Feb 9, 2023 that will close this issue
@HvanderStok
Copy link
Contributor Author

I did a little comparison of some data formates to showcase the options of parquet and their performance. For that, I tested the saving and loading time for a TimeSeriesData with 100000 rows and 10 columns and repeated the test 1000 times to get the mean value and standard deviation.
With parquet, we can use additional compressions and two engines. Following, the results are shown for all currently possible options:

Time saving data [100000 rows x 10 columns] 1000 times with fastparquet:

                      mean       std        min        max     bytes
csv               2.338247  0.058849   2.274467   3.115451  21183673
hdf               0.040134  0.003576   0.036772   0.110049   8986376
parquet           0.062957  0.027976   0.055957   0.929494   8803903
parquet.gzip      0.483152  0.007048   0.469280   0.543713   8146963
parquet.brotli   12.124559  0.130672  11.988043  12.747444   7671589
parquet.snappy    0.059237  0.005930   0.053979   0.140890   8795446
parquet.lz4       0.058630  0.005725   0.054529   0.110041   8834165
parquet.lz4_raw   0.058287  0.005368   0.054177   0.099774   8834165
parquet.zstd      0.093887  0.007336   0.088107   0.161452   8186876

Time loading data [100000 rows x 10 columns] 1000 times with fastparquet:

                     mean       std       min       max     bytes
csv              0.285219  0.005293  0.281777  0.366445  21183673
hdf              0.024363  0.001851  0.023099  0.042010   8986376
parquet          0.009746  0.002608  0.008971  0.085448   8803903
parquet.gzip     0.052403  0.001092  0.051588  0.067765   8146963
parquet.brotli   0.113753  0.001802  0.112794  0.156622   7671589
parquet.snappy   0.010721  0.000969  0.010064  0.027777   8795446
parquet.lz4      0.011279  0.001013  0.010564  0.017939   8834165
parquet.lz4_raw  0.011339  0.001066  0.010561  0.018789   8834165
parquet.zstd     0.030967  0.001932  0.029730  0.057159   8186876

Time saving data [100000 rows x 10 columns] 1000 times with pyarrow:

                    mean       std       min       max     bytes
csv             2.264780  0.067000  2.234946  4.250321  21183673
hdf             0.035840  0.021567  0.033657  0.709676   8986376
parquet         0.117107  0.006740  0.113209  0.283405  11149595
parquet.gzip    0.682571  0.005693  0.675007  0.744465  10345249
parquet.brotli  0.582941  0.013460  0.563607  0.717666  10056300
parquet.snappy  0.121738  0.003831  0.118709  0.172192  11141303
parquet.lz4     0.119413  0.003114  0.115966  0.166673  11189052
parquet.zstd    0.141238  0.003858  0.138170  0.211656  10525413

Time loading data [100000 rows x 10 columns] 1000 times with pyarrow:

                    mean       std       min       max     bytes
csv             0.285593  0.004102  0.282070  0.316594  21183673
hdf             0.023983  0.001648  0.022952  0.038274   8986376
parquet         0.017194  0.002561  0.016110  0.089088  11149595
parquet.gzip    0.023788  0.001226  0.022816  0.033830  10345249
parquet.brotli  0.030600  0.001979  0.028827  0.042050  10056300
parquet.snappy  0.018700  0.001288  0.017704  0.026472  11141303
parquet.lz4     0.018704  0.001341  0.017522  0.028246  11189052
parquet.zstd    0.020070  0.001368  0.019193  0.031149  10525413

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant