Skip to content

Aggregator for csv building #1087

@Jammy2211

Description

@Jammy2211

You should first read this issue the AggregatorPng, as the general design and idea is the same:

#1086

Use Case:

Similar to .png splicing, it was common for me to want to view the numerical results of lens modeling in a single
.csv file, rather than navigating the output folder to find the information I needed.

Implementation:

Here is an example of the .csv file for the 2 example images on the agg_png_csv URL above:

https://github.com/Jammy2211/autolens_workspace_test/blob/main/agg_png_csv/result.csv

This .csv file contains headers which come from different parts of the dataset and output folders, thus navigating
the folders to find the information requires combining information from multiple files.

Here is the example .csv maker python file I was using, which is a bit of a mess, but works:

https://github.com/Jammy2211/autolens_workspace_test/blob/main/agg_png_csv/csv_make.py

AggregatorCSV

I am picturing an AggregatorCSV class that would take the output folder and help build the .csv file.

Using Output Folder

For the example on the workspace, all information used to build the .csv file comes from the info.json and
result.json files in the dataset folder.

So you would just need to write an AggregatorCSV that navigates the dataset folder and loads the info.json and
result.json files, and produces the .csv file.

My .csv building never used the output folder, but this was because I wrote a quite complicated pipeline (in a hurry)
which output all results to the dataset folder.

I think the general use case would be that the output folder is used to extract all information whenever possible.

For example, the einstein_radius_max_lh is stored here:

image

Where samples_summary.json has all the information on the model and instance, with an einstein radius at this part:

    "median_pdf_sample": {
            "type": "instance",
            "class_path": "autofit.non_linear.samples.sample.Sample",
            "arguments": {
                "log_likelihood": 21775.911731379132,
                "log_prior": 1.4651345875102018,
                "weight": 6.53867998265804e-05,
                "kwargs": {
                    "type": "dict",
                    "arguments": {
                        "galaxies.lens.bulge.profile_list.59.centre.centre_0": -0.07025111305343998,
                        "galaxies.lens.bulge.profile_list.59.centre.centre_1": -0.02258690706633907,
                        "galaxies.lens.bulge.profile_list.29.ell_comps.ell_comps_0": 0.250115244578848,
                        "galaxies.lens.bulge.profile_list.29.ell_comps.ell_comps_1": -0.18539079820818452,
                        "galaxies.lens.bulge.profile_list.59.ell_comps.ell_comps_0": 0.05644153124215467,
                        "galaxies.lens.bulge.profile_list.59.ell_comps.ell_comps_1": -0.16025101750452186,
                        "galaxies.source.bulge.profile_list.19.centre.centre_0": -0.02911164136961475,
                        "galaxies.source.bulge.profile_list.19.centre.centre_1": 0.16875116292548728,
                        "galaxies.source.bulge.profile_list.19.ell_comps.ell_comps_0": -0.3747737836275342,
                        "galaxies.source.bulge.profile_list.19.ell_comps.ell_comps_1": -0.191211048679011,
                        "galaxies.lens.mass.ell_comps.ell_comps_0": 0.12954201123863499,
                        "galaxies.lens.mass.ell_comps.ell_comps_1": -0.09446519305629113,
                        "galaxies.lens.mass.einstein_radius": 0.8099997493702147,
                        "galaxies.lens.shear.gamma_1": -0.04293296870204821,
                        "galaxies.lens.shear.gamma_2": 0.10475246785931866
                    }
                }
            }
        },

Therefore an API which allows the user to use the model_path to choose what the .csv headers are would be ideal, something like:

agg = AggregatorCSV.from_directory(
    directory=path.join("output),
)

agg.add_column(
    folder=source_lp[1],
    name="einstein_radius_max_lh",
    model_path="galaxies.lens.mass.einstein_radius",
)

Errors

Note that samples_summary.json also stores errors on parameters, so the API above should be extended to specify the
error on the parameter, e.g.:

agg.add_column(
    folder=source_lp[1],
    name="einstein_radius_max_lh",
    model_path="galaxies.lens.mass.einstein_radius",
    error="errors_at_sigma_3", # this string is in the `samples_summary.json` file
)

Latent Variable API

The AggregatorCSV should also support the latent variable API, so that the user can use the model_path to access the
latent variables of the model.

The example GitHub repo does not have latent variables, but the AggregatorCSV should be designed to support them,
as they will just be in a latent_summary.json file analogous to the samples_summary.json file.

Errors should also be supported for latent variables.

Manual Function API

From samples_summary.json the AggregatorCSV can create an instance of the maximum likelihood or median PDF model.

A user may want to compute a quantity from this instance and add it to the .csv file. This value may be something
you could add as a latent variable, but lets pretend the user forgot to add it before running the pipeline or doesnt
want loads of latent variables in the code.

The following API could allows the user to do this:

agg = AggregatorCSV.from_directory(
    directory=path.join("output),
)

def einstein_radius_x2_from(instance):
    einstein_radius = instance.galaxies.lens.mass.einstein_radius
    return einstein_radius * 2.0

agg.add_column(
    folder=source_lp[1],
    name="einstein_radius_x2_max_lh",
    latent_func=einstein_radius_x2_from,
    use_max_lh_instance=True, # as opposed to the median PDF instance
)

Manual Function with Samples API

The example above uses samples_summary.json to create an instance of the maximum likelihood or median PDF model.
It does not use the full set of non-linear search samples and therefore cannot provide an error estimate on the
quantity computed.

The samples are fully included in samples.json and the AggregatorCSV should support the following API to compute
a quantity from the samples and add it to the .csv file, which can then include an error estimate:

agg = AggregatorCSV.from_directory(
    directory=path.join("output),
)

def einstein_radius_x2_via_samples_from(samples):

    random_draws = 50

    einstein_radius_x2_list = []
    
    for i in range(random_draws):

        instance = samples.draw_randomly_via_pdf()
    
        ell_comps = instance.galaxies.lens.mass.ell_comps
    
        einstein_radius_x2 = al.convert.einstein_radius_x2_from(ell_comps=ell_comps)
    
        einstein_radius_x2_list.append(einstein_radius_x2)
    
    median_einstein_radius_x2, lower_einstein_radius_x2, upper_einstein_radius_x2 = af.marginalize(
        parameter_list=einstein_radius_x2_list,
        sigma=3.0,
    )

    return median_einstein_radius_x2, lower_einstein_radius_x2, upper_einstein_radius_x2
    
agg.add_x3_columns_with_errors(
    folder=source_lp[1],
    name="einstein_radius_x2",
    samples_func=einstein_radius_x2_via_samples_from,
)

Missing .json Files

A user can disable the output of samples.json, so the AggregatorCSV should raise an warning if the user tries to use
the samples_func API and the samples.json file is not present, and leave blank the column in the .csv file.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions