Database: Samples
=================

After fitting a large suite of data, we can use the aggregator to load the database's results. We can then
manipulate, interpret and visualize them using a Python script or Jupyter notebook.

This script uses the results generated by the script `/autolens_workspace/database/tutorial_1_introduction.py`, which
fitted 3 simulated datasets with:

 - An `Isothermal` `MassProfile` for the lens galaxy's mass.
 - An `Sersic` `LightProfile` representing a bulge for the source galaxy's light.

__Samples__

This script covers how to manipulate the `Samples` object returned from a *PyAutoLens* model-fit, which you have most
likely already encountered when analysing the results of a model-fit.

We'll learn how to use the database and `Aggregator`!

In [None]:
%matplotlib inline
from pyprojroot import here
workspace_path = str(here())
%cd $workspace_path
print(f"Working Directory has been set to `{workspace_path}`")

from os import path
import autofit as af
import autolens.plot as aplt

__Database File__

The results are not contained in the `output` folder after each search completes. Instead, they are
contained in the `database.sqlite` file, which we can load using the `Aggregator`.

In [None]:
database_file = "database.sqlite"
agg = af.Aggregator.from_database(filename=database_file)

__Generators__

The `start_here.py` database example gives an explanation of what Python generators are and why and how they are used.
Refer back to that example if you are unsure.

In [None]:
samples_gen = agg.values("samples")

__Samples__

The `Samples` class contains all the parameter samples, which is a list of lists where:
 
 - The outer list is the size of the total number of samples.
 - The inner list is the size of the number of free parameters in the fit.

In [None]:
for samples in agg.values("samples"):
    print("All parameters of the very first sample")
    print(samples.parameter_lists[0])
    print("The third parameter of the tenth sample")
    print(samples.parameter_lists[9][2])
    print()

print("Samples: \n")
print(agg.values("samples"))
print()
print("Total Samples Objects = ", len(agg), "\n")

The `Samples` class contains the log likelihood, log prior, log posterior and weight_list of every sample, where:

 - The log likelihood is the value evaluated from the likelihood function (e.g. -0.5 * chi_squared + the noise  
 normalization).
    
 - The log prior encodes information on how the priors on the parameters maps the log likelihood value to the log
 posterior value.
      
 - The log posterior is log_likelihood + log_prior.
    
 - The weight gives information on how samples should be combined to estimate the posterior. The weight values 
 depend on the sampler used, for example for MCMC they will all be 1`s.

In [None]:
for samples in agg.values("samples"):
    print("log(likelihood), log(prior), log(posterior) and weight of the tenth sample.")
    print(samples.log_likelihood_list[9])
    print(samples.log_prior_list[9])
    print(samples.log_posterior_list[9])
    print(samples.weight_list[9])

__Maximum Likelihood Model__

We can use the outputs to create a list of the maximum log likelihood model of each fit to our three images.

In [None]:
ml_vector = [
    samps.max_log_likelihood(as_instance=False) for samps in agg.values("samples")
]

print("Max Log Likelihood Model Parameter Lists: \n")
print(ml_vector, "\n\n")

__Parameter Names__

Vectors return a lists of all model parameters, but do not tell us which values correspond to which parameters.

The following quantities are available in the `Model`, where the order of their entries correspond to the parameters 
in the `ml_vector` above:
 
 - `paths`: a list of tuples which give the path of every parameter in the `Model`.
 - `parameter_names`: a list of shorthand parameter names derived from the `paths`.
 - `parameter_labels`: a list of parameter labels used when visualizing non-linear search results (see below).

In [None]:
for samples in agg.values("samples"):
    model = samples.model
    print(model)
    print(model.paths)
    print(model.parameter_names)
    print(model.parameter_labels)
    print()

These lists will be used later for visualization, how it is often more useful to create the model instance of every fit.

In [None]:
ml_instances = [samps.max_log_likelihood() for samps in agg.values("samples")]
print("Maximum Log Likelihood Model Instances: \n")
print(ml_instances, "\n")

__Instances__

A model instance contains all the model components of our fit, for example the list of galaxies we specified during 
model composition.

In [None]:
print(ml_instances[0].galaxies)
print(ml_instances[1].galaxies)
print(ml_instances[2].galaxies)

These galaxies will be named according to the search (in this case, `lens` and `source`).

In [None]:
print(ml_instances[0].galaxies.lens)
print()
print(ml_instances[1].galaxies.source)

Their `LightProfile`'s and `MassProfile`'s are also named according to the search.

In [None]:
print(ml_instances[0].galaxies.lens.mass)
print(ml_instances[1].galaxies.source.bulge)

__Median PDF__

We can access the `median pdf` model, which is the model computed by marginalizing over the samples of every parameter 
in 1D and taking the median of this PDF.

In [None]:
mp_instances = [samps.median_pdf() for samps in agg.values("samples")]

print("Median PDF Model Instances: \n")
print(mp_instances, "\n")
print(mp_instances[0].galaxies.lens.mass)
print()

__Ordering__

The default ordering of the results can be a bit random, as it depends on how the sqlite database is built. 

The `order_by` method can be used to order by a property of the database that is a string, for example by ordering 
using the `unique_tag` (which we set up in the search as the `dataset_name`) the database orders results alphabetically
according to dataset name.

In [None]:
agg = agg.order_by(agg.search.unique_tag)

We can also order by a bool, for example making it so all completed results are at the front of the aggregator.

In [None]:
agg = agg.order_by(agg.search.is_complete)

__Errors__

We can compute the model parameters at a given sigma value (e.g. at 3.0 sigma limits).

These parameter values do not account for covariance between the model. For example if two parameters are degenerate 
this will find their values from the degeneracy in the `same direction` (e.g. both will be positive). we'll cover
how to handle covariance in a later tutorial.

The `uv3` below signifies this is an upper value at 3 sigma confidence, with `lv3` indicating a the lower value.

In [None]:
uv3_vectors = [
    samps.values_at_upper_sigma(sigma=3.0) for samps in agg.values("samples")
]

uv3_instances = [
    samps.values_at_upper_sigma(sigma=3.0) for samps in agg.values("samples")
]

lv3_vectors = [
    samps.values_at_lower_sigma(sigma=3.0) for samps in agg.values("samples")
]

lv3_instances = [
    samps.values_at_lower_sigma(sigma=3.0) for samps in agg.values("samples")
]

print("Errors Lists: \n")
print(uv3_vectors, "\n")
print(lv3_vectors, "\n")
print("Errors Instances: \n")
print(uv3_instances, "\n")
print(lv3_instances, "\n")

We can compute the upper and lower errors on each parameter at a given sigma limit.

The `ue3` below signifies the upper error at 3 sigma. 

In [None]:
ue3_vectors = [
    samps.errors_at_upper_sigma(sigma=3.0) for samps in agg.values("samples")
]

# ue3_instances = [
#     samps.errors_at_upper_sigma(sigma=3.0) for samps in agg.values("samples")
# ]

le3_vectors = [
    samps.errors_at_lower_sigma(sigma=3.0) for samps in agg.values("samples")
]
# le3_instances = [
#     samps.errors_at_lower_sigma(sigma=3.0) for samps in agg.values("samples")
# ]

print("Errors Lists: \n")
print(ue3_vectors, "\n")
print(le3_vectors, "\n")
print("Errors Instances: \n")
# print(ue3_instances, "\n")
# print(le3_instances, "\n")

__Bayesian Evidence__

The maximum log likelihood of each model fit and its Bayesian log evidence (estimated via the nested sampling 
algorithm) are also available.

Given each fit is to a different image, these are not very useful. However, in a later tutorial we'll look at using 
the aggregator for images that we fit with many different models and many different pipelines, in which case comparing 
the evidences allows us to perform Bayesian model comparison!

In [None]:
print("Maximum Log Likelihoods and Log Evidences: \n")
print([max(samps.log_likelihood_list) for samps in agg.values("samples")])
print([samps.log_evidence for samps in agg.values("samples")])

__PDFs__

The Probability Density Functions (PDF's) of the every model-fit can be plotted using Dynesty's in-built visualization 
tools, which are wrapped via the `DynestyPlotter` object.

In [None]:
for samples in agg.values("samples"):
    search_plotter = aplt.DynestyPlotter(samples=samples)
#  search_plotter.cornerplot()

__Samples Filtering__

The samples object has the results for all model parameter. It can be filtered to contain the results of specific 
parameters of interest.

The basic form of filtering specifies parameters via their path, which was printed above via the model and is printed 
again below.

In [None]:
samples = list(agg.values("samples"))[0]

print("Parameter paths in the model which are used for filtering:")
print(samples.model.paths)

print("All parameters of the very first sample")
print(samples.parameter_lists[0])

samples = samples.with_paths(
    [
        ("galaxies", "lens", "mass", "einstein_radius"),
        ("galaxies", "source", "bulge", "sersic_index"),
    ]
)

print(
    "All parameters of the very first sample (containing only the lens mass's einstein radius and "
    "source bulge's sersic index)."
)
print(samples.parameter_lists[0])

print(
    "Maximum Log Likelihood Model Instances (containing only the lens mass's einstein radius and "
    "source bulge's sersic index):\n"
)
print(samples.max_log_likelihood(as_instance=False))

Above, we specified each path as a list of tuples of strings. 

This is how the PyAutoFit source code stores the path to different components of the model, but it is not in-line 
with the PyAutoLens API used to compose a model.

We can alternatively use the following API:

In [None]:
samples = list(agg.values("samples"))[0]

samples = samples.with_paths(
    ["galaxies.lens.mass.einstein_radius", "galaxies.source.bulge.sersic_index"]
)

print(
    "All parameters of the very first sample (containing only the lens mass's einstein radius and "
    "source bulge's sersic index)."
)

Above, we filtered the `Samples` but asking for all parameters which included the
path ("galaxies", "lens", "mass", "einstein_radius").

We can alternatively filter the `Samples` object by removing all parameters with a certain path. Below, we remove
the centres of the mass model to be left with 10 parameters.

In [None]:
samples = list(agg.values("samples"))[0]

print("Parameter paths in the model which are used for filtering:")
print(samples.model.paths)

print("Parameters of first sample")
print(samples.parameter_lists[0])

print(samples.model.total_free_parameters)

samples = samples.without_paths(
    [
        # "galaxies.lens.mass.centre"),
        "galaxies.lens.mass.centre.centre_0",
        # "galaxies.lens.mass.centre.centre_1),
    ]
)

print("Parameters of first sample without the lens mass centre.")
print(samples.parameter_lists[0])

We can keep and remove entire paths of the samples, for example keeping only the parameters of the lens or 
removing all parameters of the source's bulge.

In [None]:
samples = list(agg.values("samples"))[0]
samples = samples.with_paths(["galaxies.lens"])
print("Parameters of the first sample of the lens galaxy")
print(samples.parameter_lists[0])

samples = list(agg.values("samples"))[0]
samples = samples.with_paths(["galaxies.source.bulge"])
print("Parameters of the first sample without the source's bulge")
print(samples.parameter_lists[0])

__Latex__

If you are writing results up in a paper, you can use PyAutoFit's inbuilt latex tools to create latex 
table code which you can copy to your .tex document.

By combining this with the filtering tools below, specific parameters can be included or removed from the latex.

Remember that the superscripts of a parameter are loaded from the config file `notation/label.yaml`, providing high
levels of customization for how the parameter names appear in the latex table. This is especially useful if your 
model uses the same model components with the same parameter, which therefore need to be distinguished via superscripts.

In [None]:
latex = af.text.Samples.latex(
    samples=list(agg.values("samples"))[0],
    median_pdf_model=True,
    sigma=3.0,
    name_to_label=True,
    include_name=True,
    include_quickmath=True,
    prefix="Example Prefix ",
    suffix=r"\\[-2pt]",
)
print(latex)

Finished.