July 12, 2015 (Updated on 08/08/16)
Install package from Github
Only needs to be done the first time :
library(devtools) install_github("PecanProject/pecan", subdir="all")
Parameter Data Assimilation in PEcAn
Currently, there are four ways of doing Parameter Data Assimilation (PDA) in PEcAn :
Which one to use?
bruteforce : You can choose bruteforce as a method to use PEcAn's natively implemented Metropolis-Hastings Markov Chain Monte Carlo (MCMC) algorithm. This would perform a Bayesian MCMC on model parameters by proposing one parameter value at a time, and accepting or rejecting it according to the calculated likelihood value. This algorithm also has an adaptation functionality that can be turned on (recommended) and off (see below). As each (i-th) parameter is dependent of the previous (i-1-th) one, this algorithm can only be run sequentially (but different chains can be run in parallel). Therefore, if you have just one parameter to calibrate and a relatively fast model that can be run couple of hundreds (couple of thousands would even be better) of times within an hour, it is possible to use this algorithm.
bruteforce.bs : This algorithm is basically identical to the bruteforce, but rather than proposing parameters one at a time, it proposes new values for all parameters at once ("bs" stands for "block sampling"). If you have more than one parameter to calibrate and a relatively fast model, you can use this algorithm, preferably with the adaptation functionality turned on.
emulator : When a model is slow, it is practically not possible to run it many times in order to explore the parameter space and draw enough samples to converge the target distribution with bruteforce algorithms. Instead, we can run the model for a relatively smaller number of times with parameter values that have been carefully chosen to give good coverage of parameter space. Then we can interpolate the likelihood calculated for each of those runs to get a surface that "emulates" the true likelihood and perform regular MCMC (just like the "bruteforce" approach), except instead of actually running the model on every iteration to get a likelihood, this time we will just get an approximation from the likelihood emulator.
bayesian.tools : There are other MCMC algorithms with different proposal schemes and acceptance criterion than Metropolis-Hastings. BayesianTools is an R-package that includes MCMC and SMC samplers and other tools for Bayesian parameter calibration. If you choose bayesian.tools option, PEcAn framework will hand the PDA calculations over to the BayesianTools package. Although this package includes algorithms that are designed to explore the parameter space more efficiently than the regular MH-MCMC, you would still need a relatively faster model to use these algorithms. The BayesianTools R-package itself is currently under development, once it is fully integrated with PEcAn it will be possible to run some of those algorithms in parallel.
Adding PDA tags to pecan.xml
The easiest way to use PEcAn's parameter data assimilation module is to add an
<assim.batch> block to pecan.xml, load the file with
read.settings, and pass the resulting settings object to
pda.mcmc(). There are some differences in the settings for using different PDA methods (see below), but here is an example
<assim.batch> block :
<assim.batch> <iter>10000</iter> <method>bruteforce</method> <chain>3</chain> <param.names> <temperate.coniferous> <--- YOUR PFT <param>veg_respiration_Q10</param> <param>stem_respiration_rate</param> <param>Vm_low_temp</param> </temperate.coniferous> </param.names> <inputs> <file> <input.id>1000000384</input.id> <path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2005.nc</path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2006.nc</path> </path> <likelihood>Laplace</likelihood> <variable.id>298</variable.id> <variable.name> <variable.name>LE</variable.name> <variable.name>UST</variable.name> </variable.name> </file> </inputs> <inputs> <file> <input.id>...</input.id> <path> <path>...</path> </path> <likelihood>multipGauss</likelihood> <hyper.pars> <parama>0.001</parama> <paramb>0.001</paramb> </hyper.pars> <ss.positive>TRUE</ss.positive> <variable.name> <variable.name>...</variable.name> </variable.name> </file> </inputs> <jump> <ar.target>0.3</ar.target> <adapt>200</adapt> <adj.min>0.1</adj.min> </jump> <diag.plot.iter>500</diag.plot.iter> </assim.batch>
Here are details about the settings:
<iter>Specifies the number of MCMC iterations to run. If continuing a previous MCMC, this is the number of additional iterations, which will be added to the previous total. Defaults to 100 if missing. Ignored by pda.emulator().
<chain>Specifies the number of MCMC chains to be run.
<prior>Identifies the prior to be used for PDA. Can be one of either:
+ `<posterior.id>` A posterior ID in BETY specifying the posterior from a previous PEcAn analysis (e.g., meta-analysis or previous PDA) to be used as the prior for PDA. Defaults to the most recent relevant posterior in the database if omitted (and no `<path>` specified instead; see below). + `<path>` As an alternative to using a posterior ID, can specify a file path to either a `prior.distns.Rdata` or `post.distns.Rdata` file generated from an earlier analysis. Conceptually, using a posterior distribution as the prior for PDA is preferred, as this allows the multiple analyses to work together to iteratively constrain parameters. In practice, previous analyses may have over-constrained parameters to ranges that do not actually optimize model outputs, so using a less informative prior for PDA might yield better results. NOTE: It is recommended to leave the `<prior>` tag empty if you are doing this for the first time. PEcAn workflow will handle it for you.
<param.names>The names of parameters to be constrained by assimilation, listed in individual
<param>tags. These must be the standard names given in the
idcolumn of the
trait.dictionary, i.e. :
data(trait.dictionary, package = "PEcAn.utils") head(trait.dictionary[,c("id", "figid")])
NOTE : `<param.names>` chunk should always be on your xml file regardless of the method you use in the PDA.
<inputs>Observation data to be compared to the model. In principle, can be one or more datasets, specified in a variety of ways. In practice, the code is tested for assimilating Ameriflux dataset currently, and assumes the input is Ameriflux NEE/FC or LE.
<file>Denotes a set of tags for a single input. Would be repeated for multiple datasets/variables, e.g. in this case note the differences in
... <inputs> <file> <input.id>1000000384</input.id> <path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2005.nc</path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2006.nc</path> </path> <likelihood>Laplace</likelihood> <variable.id>298</variable.id> <variable.name> <variable.name>LE</variable.name> <variable.name>UST</variable.name> </variable.name> </file> <file> <input.id>1000000384</input.id> <path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2005.nc</path> <path>/fs/data1/pecan.data/input/Ameriflux_site_0-796/US-Bar.2006.nc</path> </path> <likelihood>Laplace</likelihood> <variable.id>1000000042</variable.id> <variable.name> <variable.name>FC</variable.name> <variable.name>UST</variable.name> </variable.name> </file> </inputs> ...
<input.id>BETY input ID for looking up the input.
<path>File path to the input. Both
<path>of the observation data should be supplied for the PDA.
<source>A standardized source of input data (e.g., Ameriflux). Not implemented yet, but the idea would be similar to the met workflow, PEcAn would be able to use standard data sources automatically where available. Only used if no
<likelihood>Identifier for the likelihood to use. E.g., the Ameriflux NEE/FC and LE data use a Laplacian likelihood.
<hyper.pars>Optional. Hyperparameters for your likelihood. E.g. Prior parameters of the precision for Gaussian likelihood. Defaults to scaled values if not provided.
<variable.id>The BETY variable ID associated with this dataset. The idea is that specific preprocessing steps (e.g., estimating heteroskedastic error for tower NEE) would be associated with particular IDs. Could automate further by assigning default
<likelihood>to variable.id values (allowing
<likelihood>to be omitted from pecan.xml).
<inputs>chunk should always be on your xml file regardless of the method you use in the PDA.
<jump>Settings for the specifics of the proposal schema and adaptation functionality.
<ar.target>Target acceptance rate for the adaptive jump algorithm. Defaults to 0.5 if missing.
<adapt>Number of iterations between jump variance adaptations. Defaults to
floor(iter/10)if missing. If set equal to the number of
<iter>, basically turns off the adaptation functionality, it is not recommended to turn-off this functionality.
<adj.min>Minimum factor by which to reduce jump variance when adapting. Prevents jump variances from degenerating to 0. Defaults to 0.1 if missing.
<diag.plot.iter>Interval between saving diagnostic plots. Omit or set to NULL to skip them.
<params.id>(Not shown.) A BETY dbfile ID for an MCMC output from previous PDA. If specified, that file is loaded, the new MCMC starts from the last parameter values of the previous, and when finished the extended chain is saved as a new output. If missing, then MCMC starts fresh from prior median parameter values. Regardless, the MCMC parameter values of the PDA are saved to file and inserted in BETY, and the new dbfile ID is inserted into
pda.mcmc()funtion returns the
<assim.batch>settings, which can then be saved. Then, calling a new round of PDA using these returned settings will automatically continue the previous MCMC.
Method specific settings
If you want to extend your previous bruteforce and bruteforce.bs MCMC-chains, you can add an
<extension> tag within your
<assim.batch> chunk after reading your post-PDA settings (which would be saved as "pecan.pda[UNIQUE_ID].xml" in your working directory) and set it to "longer".
... <method>bruteforce.bs</method> <iter>10000</iter> <method>bruteforce</method> <chain>3</chain> <extension>longer</extension> ...
If you are using methods other than bruteforce and bruteforce.bs, some additional tags and different settings may apply.
emulator would use additional tags:
... <method>emulator</method> <n.knot>20</n.knot> <GPpckg>GPfit</GPpckg> <chain>3</chain> <extension>round</extension> <knot.par>0.75</knot.par> ...
<n.knot>Specifies the number of locations in parameter space to be sampled by the Latin Hypercube design. These locations are where the model will actually be run. In other words the model will be run for
<GPpckg>Specifies which R package to use for fitting a Gaussian process to interpolate the likelihood surface in between the calculated values that are obtained from actual model runs. Current options are
<chain>Specifies the number of MCMC chains to be run.
<extension>Specifies the extension type of additional emulator runs, should be skipped if this a first emulator run of its own. Otherwise it is possible to extend emulator PDA runs in two ways:
<extension>longer</extension>using the same emulator, this extension run takes the MCMC sampling from where it was left in the previous emulator run and runs a longer MCMC chain.
<extension>round</extension>this extension run proposes new points in the parameter space in addition to the previous ones, and builds a new emulator including these additional points for a new MCMC sampling. These new points can come from both your inital PDA prior and the posterior of your first round of emulator run. You can determine the percentage of new knots coming from the posterior of your previous run in the
<knot.par>tag. If you leave it empty, 75% of your new points will be drawn from the posterior of your previous run by default.
<file> <input.id>...</input.id> <path> <path>...</path> </path> <likelihood>multipGauss</likelihood> <ss.positive>TRUE</ss.positive> <variable.id>...</variable.id> <variable.name> <variable.name>...</variable.name> </variable.name> </file> </inputs>
<ss.positive>When using the emulator, it is important to let the algorithm know that you have a likelihood whose sufficient statistics is zero bound.
bayesian.tools would look for sampler specific settings that can be passed through the pecan.xml as a block under the
<bt.settings> tag. Currently, the available samplers in the BayesianTools package are:
- Standard MH-MCMC
- With pre-optimization
- Adaptive MCMC
- Delayed rejection
- Gibbs updating
- M : Another implementation of standard MH-MCMC.
- AM : Adaptive Metropolis
- DR : Delayed Rejection
- DRAM : Delayed Rejection Adaptive Metropolis
- DE : Differential Evolution
- DEzs : Differential Evolution with a snooker updater
- DREAM : Differential Evolution Adaptive Metropolis
- DREAMzs : Differential Evolution Adaptive Metropolis with a snooker updater
- Twalk : "traverse" or "thoughtful" walk, a general purpose sampling algorithm
- SMC : Sequential Monte Carlo
The name of the chosen sampler would be passed under
<sampler> tag within the
<bt.settings> block :
It is not possible to use some of the samplers in this package for univariate cases. The ones that you can use for univariate cases are: "Metropolis", "DE", "DEzs" and "Twalk".
For Metropolis variations:
... <method>bayesian.tools</method> <bt.settings> <iter>10000</iter> <sampler>Metropolis</sampler> <DRlevels>1</DRlevels> <-- set 2 for Delayed Rejection <optimize>FALSE</optimize> <-- set TRUE for pre-optimization <adapt>FALSE</adapt> <-- set TRUE for Adaptive Metropolis <adaptationNotBefore>500</adaptationNotBefore> <-- set for Adaptive Metropolis </bt.settings> ...
Some of the samplers are also restartable : "Metropolis", "DE", "DEzs", "DREAM", "DREAMzs", "Twalk"