Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for posterior predictive distributions #323

Open
tilmantroester opened this issue Oct 10, 2023 · 9 comments
Open

Add support for posterior predictive distributions #323

tilmantroester opened this issue Oct 10, 2023 · 9 comments

Comments

@tilmantroester
Copy link
Contributor

This requires functionality to draw samples of data vectors from the likelihood and passing them back to the sampling framework.

The ability to draw samples from the likelihood is useful in other contexts as well, such as generating mock data vectors.

@vitenti
Copy link
Collaborator

vitenti commented Oct 10, 2023

This is already supported by NumCosmo's connector. Moreover, in Augur you can find code to do that, see for example srd_y1_3x2_like.py, where they generate a data vector from a theory vector and build a likelihood to be used by any framework.

@marcpaterno
Copy link
Collaborator

@tilmantroester is there something that Augur does, or something part of what it does, that you thing should be moved from Augur to Firecrown?

@tilmantroester
Copy link
Contributor Author

There are two reasons why I think this should be in firecrown:
One reason is that it's easiest to create the PPD draws while sampling instead of trying to create them after the fact. For a Gaussian likelihood with fixed covariance doing it in a post processing step is relatively straightforward if the model predictions get saved during sampling but for other likelihoods this might require re-evaluating the likelihood at a large number of points, which we want to avoid.
Drawing posterior predictive samples conditioned on parts of the data vector is probably easier to do in firecrown as well, since the description on how the data vector is structured is readily available there.
The other reason is that I might want to be able to use firecrown and generate mock data without the augur dependency, especially when building experimental pipelines.

@joezuntz
Copy link
Collaborator

The ability to return data vectors is also useful for general debugging, and I'd recommend saving the information to do this.

However, in cosmosis I did find that the one case where this was slow compared to likelihood evaluations was Supernovae, so perhaps make it optional?

@vitenti
Copy link
Collaborator

vitenti commented Oct 16, 2023

The CosmoSIS connector presently includes a section in the DataBlock labeled data_vector. This section contains three elements: firecrown_theory (the theory vector), firecrown_data (the data vector), and firecrown_inverse_covariance (the inverse covariance). To have these components written in the output chains, you can add them to the CosmoSIS .ini file under the extra_output section.

This behavior is automatically enabled for GaussFamily likelihoods, but the current implementation is not considered ideal. We are working to refine this process, with the goal of achieving the same outcome using DerivedParameter. The reason for the delay in implementing this change is the inherent difficulty of handling vector-derived parameters without resorting to the solution of appending _n to the derived parameter name to match the vector index.

Furthermore, as pointed out by @joezuntz, including a lengthy theory vector in the output chains can have a detrimental impact on processing speed. In NumCosmo, any data added to the output chains undergoes further processing, which includes computing statistics such as mean, variance, autocorrelation, and more. Additionally, including an extensive theory vector in the output would not only significantly slow down these processing tasks but also result in exceptionally large output files.

Thus, I think we should make this behavior optional and eventually move to a more general solutions using DerivedParameter so all frameworks can use it equally. @tilmantroester, would you prefer a more complete solution where random draws are also performed from each theory + covariance?

@tilmantroester
Copy link
Contributor Author

tilmantroester commented Oct 16, 2023

At this point I'm not too concerned about how this gets piped back to the sampling frameworks. For now I imagine just implementing a method sample in the likelihood class.
This could then be optionally be put into some data block of the sampling framework.

As you said, treating theory or mock data vectors as derived parameters and dumping them into the chain output is at best cumbersome and at worst breaks the IO.
Dealing with derived data that isn't just a parameter is something that the sampling frameworks would have to implement I think. I don't know if there is such a functionality in cosmosis yet @joezuntz

@joezuntz
Copy link
Collaborator

Samplers in CosmoSIS (or scripts using it interactively) can fully access the data block containing all the products of a pipeline, including the data vectors, so yes, this is already there. It's used in the Fisher sampler, for example.

@tilmantroester
Copy link
Contributor Author

Sorry, what I meant was, is there a way to efficiently save parts of the data block while sampling, independently of the default chain output.
For example, saving the theory vector at each chain sample to a file, taking care of the usual IO pitfalls like multiple MPI processes, and without having an unwieldy extra_output option with an entry for each data point.

@joezuntz
Copy link
Collaborator

Oh, I see. You can specify a vector output for extra_output, if you know the length in advance, by doing, e.g. for a length 222 data vector, extra_output = data_vector/2pt_theory#222. I know that's a bit annoying. I don't have another approach built-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants