Skip to content

Feature request: Can we have a more compact output formats than CSV such as Parquet? #3332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jachymb opened this issue Feb 17, 2025 · 1 comment
Labels

Comments

@jachymb
Copy link

jachymb commented Feb 17, 2025

I run some experiments, where the output CSV file easily becomes >100 GiB. An example is when fitting a model with Gaussian process as a latent variable, where there is essentially one parameter for each datapoint and this is repeated on each row of the output file. Running this many times over for different inputs makes it challenging to even manage the file storage and also just reading the file to memory becomes trickier.

It would be cool, if we had the option to directly store the outputs in other formats, in particular Apache Parquet or Avro are popular in data science and use a more compact data representation with some compression on top and allow for natural integration with other big data tooling.

Personally, I would favor Parquet: It is a columnar format, which could be suitable if we want to discard columns with nuisance parameters or the runtime values (I mean the values like stepsize__ etc.) from the stored STAN output without any unnecessary computational overhead (i.e. not processing the entire file). Also, it does support structured values, which means a vector/matrix parameter could be stored as in a single column, making the whole thing easier to parse than the CSV.

@mitzimorris
Copy link
Member

this is a planned feature - see https://github.com/stan-dev/design-docs/blob/master/designs/0032-stan-output-formats.md

@WardBrian WardBrian added the i/o label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants