# Tutorial for HBMCMC code

The set of modules in acrg_hbmcmc allow a convenient way to perform 'hierachical Bayesian MCMC'. \
That is, we want to infer emissions of a partiular species (and the influence from the domain boundary) using a Markov Chain Monte Carlo algorithm. The 'hierarchy' is that this depends on 'hyperparameters', which are generally the uncertainties in the non-latent parameters involved in the estimation (i.e. not those we are trying to derive – the emissions – but others that are necessary, e.g. the measurement error). \
For a work-in-progress introuction to inverse modelling see: https://www.overleaf.com/project/5f8d6217aeca1900019a84ce 

The code within acrg_hbmcmc relies heavily on other code in the ACRG repository, mainly name.py in acrg_name. This code is generally used to read footprints, a priori emissions etc from the ACRG directory structure. \
Currently the MCMC estimation is completed using the pymc3 library (see https://docs.pymc.io/), a well-established statistical library.

## Running the code

The easiest way to run the hbmcmc code is to copy the file hbmcmc_input_template.ini in acrg_hbmcmc/config/ to your run directory, and edit this code with your desired set up. \
The code generally explains the various inputs, but we will go through them below:

**species** is a single string of the species you wish to do an inversion for, e.g. "CH4" or "CFC-11". As it says above, checkout out acrg_species_info.json for the various options currently available, or add your own if needed. \
**sites** is a list containing the sites you wish to use measurements from defined by their 3-letter code, e.g. ["MHD", "TAC"] for Mace Head and Tacolneston measurement sites. Again, as above, see acrg_site_info.json for options or add your own. \
**meas_period** is a list of the averaging you wish to apply to the measurements at the differnt sites. Often not much is gained by having many measurements made at really high frequency, especially as the footprints are rarely such high resolution. Instead we may wish to use a coarser frequency, e.g. every 6 hours, where all measurements made within a 6-hour period are averaged into a single measurement. A meas_period much be supplied in the list for each site, e.g. ["6H","6H"]. \
**start_date** is when you want the inversion to begin, as a string (see above). \
**end_date** is when you want the inversion to end, as a string (see above). Note that the day is not included (e.g. "2000-01-01" would be until 23:59 on 1999-12-31) \



These inputs provide information about where the measurement data is read from, and to use measurements that use something other than the default set-up. These inputs can probably be ignored if you are just getting started. \
**inlet** is the height at which the air is sampled from. If set to None, the default will be used. For many sites, there is only one inlet anyway and so it will just default to this one height (e.g. MHD at 10m). Other sites have multiple inlets, e.g. TAC has inlets at 50m, 100m and 185m (NB. these heights are above ground level). TAC will default to 185m, but we may wish to use a different height, e.g. 100m, and so can specify this here. Note that you then have to specify this in a list of the same length as the *sites* input. So for MHD and TAC, we could write inlet=[None, "100m"] to use the 100m inlet at TAC and the defauls for MHD. \
**intrument** specified which instrument made the measurement. Many gases are measured at a site using multiple instruments. If this is the case, the defaul should reflect the best choice. Sometimes you might want to override this, e.g. if the default instrument was down for a long period. Again, if specifying this for one site you must have an entry for each site, e.g. to use the CRDS instrument at MHD and default at TAC you could use intrument=["CRDS", None]. \
**obs_directory** specifies a directory other than the standard ACRG structure to read the measurement data from, specified as a string. The directory should have the structure /<site>/<obs_file> where the site and the obs file reflect what you would expect in the ACRG directory structure. This might be e.g. if you have some experimental data that is not suitable to be stored in the main ACRG obs file structure.

This section of inputs is named "INPUT.PRIORS". This is a little misleading as it doesn't have anything to do with the priors... \
What this section does contain is the information about the region (defined by ACRG's preexisting regions) in which the inversion will take place and information about the footprints (sensitivities) that will be used to map the emissions to the measurements. \
**domain** is the model domain in which the inference will take place. Examples of a domain is "EUROPE" or "SOUTHAFRICA". You will probably have a good idea of which domain you're trying to infer before you start.\
**fpheight** is the disperion-model equivalent of the inlet height. The reason that this may be different to the true inlet height is that the topography in models does not always capture true topography (e.g. a site in a valley may be underground in the model). This has to be input as a dictionary and the height as magl, e.g. fpheight={"MHD":None, "TAC":"100magl"}.
**emissions_name** specifies if you want to use a particular emissions file. An emissions file is is containst spatial information about the 'best guess' of what the emissions are in a particular domain (such as from an inventory). By default the inversion will use the file without a tag or named total and the most recent emissions file available. E.g., if you are running CH4 in EUROPE and the most recent emissions file is from 2010, it would default to something like ch4_EUROPE_2010.nc. But, if you want to use the emissions file ch4-oil-production_EUROPE_2010.nc, then set emissions_name="oil-production".
**fp_directory** just specifies the path to the footprints you want to use if not the defaults. The names and directory structure must mirror those in e.g. /shared/LPDM/fp_NAME/. You may need this if you are, e.g. experimenting with different footprints that are not suitable for sharing.
**flux_directory** specifies the path to the emissions file you wish to use, if not the default ACRG path. The structure must mirror that of /shared/LPDM/emissions/.

Basis functions are the computational representation of the emissions. Most simply, this could be thought of as the underlying grid resolution of the inversion, but could also be something such a different countries or spatially distinct sectors. For the boundaries, this is how the boundary is broken down (e.g. as 4 sides, as a gradient etc.)\
**bc_basis_case** is how the inversion interprets the boundaries. Currently (unless this is out of date now) it can only handle "NESW", which means that the influence from the boundary at each cardinal direction is inferred.\
**fp_basis_case** is the basis function representation of the emissions. For example, if using the file 16x16_EUROPE_2012.nc, fp_basis_case="16x16". Set to None (and quadtree_basis=True) to create the emissions basis functions on the fly.\
**quadtree_basis** set to True uses a quadtree algorithm to find a suitable basis function representation for emissions. Set to False if fp_basis_case is not None. The quadtree algorithm recursively divides the domain until the desired number of basis functions is reached, with the aim that regions with a higher contribution to the measurement (higher signal to noise) and more spatial variablity will have a finer spatial resolution than those that contribute little or are spatially uniform (e.g. far from the measurement site, oceans if not emitters, etc.) This is based on the a priori ("best guess") emissions multiplied by the average footprint for the inversion period, to give a mole fraction contributon at the dispersion-model resolution. Firstly, the domain is split into 4 new basis functions. The basis function that is most variable is then split into four new basis functions. Again, the basis function is split into 4 more until the desired number is achieved. There will always be a multiple of 3n+1 basis function.\
**nbasis** is the number of desired basis functions if using the quadtree algorithm. Note that if this is not a multiple of 3n+1 it will take th closest nubmer that works. There is no 'correct' number to use, but it will be harder to estimate a larger number of basis functions. However, too few may not represent reality well. This will require some experimentation on a case-by-case basis.\
**basis_directory** is the directory if not using the default ACRG location for the basis function. Again, this should mirror that of the ACRG file structure.



Currently this bit can't be changed, but in future there may be different options (e.g. trans-dimensional MCMC) all called from the same input script.

The explanation above is quite self explanatory in terms of what to input. If in doubt, then see https://docs.pymc.io/api/distributions/continuous.html.

These inputs control the time period over which some variables are estimated. For example, if the inversion is over one year but we want to estimate the boundary conditions each month (rather than just once over the whole period). 
**bc_freq** is the frequency at which to estimate the boundary conditions. The way to input this is quite clearly explained above. Note that, as we're scaling, setting bc_freq=None does not mean that the boundary condition will have the same value for the whole inversion period, but it means that the whole period will be scaled by the same factor.\
**sigma_freq** is as bc_freq, but for the model uncertainty estimated in the process.\
**sigma_per_site** set to True estimates the model-measurement uncertainty for each site, else if set to False it does one estimate for all sites. 


The above is quite self explanatory. If the terms don't make sense then it's maybe best to do a little reading on MCMC algorithms. Some things to note, the number of iterations stored will be **nit** minus **burn**. That is, **nit** is the total iterations used for sampling. **tune** are iterations before sampling takes place, and so in total each MCMC chain will do **tune** + **nit** iterations.

[MCMC.NCHAIN]
; Number of chains to run simultaneously. Must be >=2 to allow convergence to be checked.

nchain = 2


This is the number of chains that will be run (or how many independent MCMC estimates you will do). Only one chain will ever be stored, as one chain should contain all the info you need. If it doesn't then you have to up the number of iterations (and/or tune it better). The reason for running more than one chain is to try to understand whether your estimate is correct or not. In reality there is no way of knowing for certain that your esimate is correct, but we can tell if it wrong. In short, if we run an MCMC sampler twice (2 chains) and they give different estimates then we know we haven't run it for long enough as it hasn't converged. The more chains we run, the more sure we can be that it is giving the correct answer. 

This adds an additional error to the measurement error. If, for example, our measurement instrument makes a measurement every hour, and we average this into 24H measurements, then averagingerror=True will add the variability in the measurements in this 24H period to the instrumental measurement error. The rationale behind this is that if the measurements are fairly similar, we can probably smooth them over 24H and represent them well in our emissions models. If they are very variable, then smoothing them over this window is likely not going to be represented well and thus leads to a larger error.

**outputpath** is simply the path to where you want your output to be stored.\
**outputname** is the tag you want to use to identify your output. E.g., if outputname="v1", for the domain "EUROPE" and specied "CH4", with the run starting 2010-01-01, your output would be called "CH4_EUROPE_v1_2010-01-01.nc".