You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I run some experiments, where the output CSV file easily becomes >100 GiB. An example is when fitting a model with Gaussian process as a latent variable, where there is essentially one parameter for each datapoint and this is repeated on each row of the output file. Running this many times over for different inputs makes it challenging to even manage the file storage and also just reading the file to memory becomes trickier.
It would be cool, if we had the option to directly store the outputs in other formats, in particular Apache Parquet or Avro are popular in data science and use a more compact data representation with some compression on top and allow for natural integration with other big data tooling.
Personally, I would favor Parquet: It is a columnar format, which could be suitable if we want to discard columns with nuisance parameters or the runtime values (I mean the values like stepsize__ etc.) from the stored STAN output without any unnecessary computational overhead (i.e. not processing the entire file). Also, it does support structured values, which means a vector/matrix parameter could be stored as in a single column, making the whole thing easier to parse than the CSV.
The text was updated successfully, but these errors were encountered:
I run some experiments, where the output CSV file easily becomes >100 GiB. An example is when fitting a model with Gaussian process as a latent variable, where there is essentially one parameter for each datapoint and this is repeated on each row of the output file. Running this many times over for different inputs makes it challenging to even manage the file storage and also just reading the file to memory becomes trickier.
It would be cool, if we had the option to directly store the outputs in other formats, in particular Apache Parquet or Avro are popular in data science and use a more compact data representation with some compression on top and allow for natural integration with other big data tooling.
Personally, I would favor Parquet: It is a columnar format, which could be suitable if we want to discard columns with nuisance parameters or the runtime values (I mean the values like
stepsize__
etc.) from the stored STAN output without any unnecessary computational overhead (i.e. not processing the entire file). Also, it does support structured values, which means a vector/matrix parameter could be stored as in a single column, making the whole thing easier to parse than the CSV.The text was updated successfully, but these errors were encountered: