Skip to content

Commit

Permalink
Minor fixes for the tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Jūlija Pečerska committed Jun 18, 2018
1 parent 5f278ec commit 55382af
Show file tree
Hide file tree
Showing 4 changed files with 54 additions and 49 deletions.
41 changes: 21 additions & 20 deletions README.md
@@ -1,6 +1,6 @@
--- ---
author: David A. Rasmussen author: David A. Rasmussen
beastversion: 2.4.1 beastversion: 2.4.1+
tracerversion: 1.6.0 tracerversion: 1.6.0
level: Beginner level: Beginner
title: Troubleshooting title: Troubleshooting
Expand All @@ -11,7 +11,7 @@ title: Troubleshooting


The primary goal of most phylogenetic analyses in BEAST is to infer the posterior distribution of trees and associated model parameters using Markov chain Monte Carlo (MCMC) sampling. In this tutorial, we will learn how to analyze the output of a MCMC analysis in BEAST using Tracer. This program allows us to easily visualize BEAST's output and summarize results. As we will see, we can also use Tracer to troubleshoot some of the most common MCMC problems encountered in BEAST. The primary goal of most phylogenetic analyses in BEAST is to infer the posterior distribution of trees and associated model parameters using Markov chain Monte Carlo (MCMC) sampling. In this tutorial, we will learn how to analyze the output of a MCMC analysis in BEAST using Tracer. This program allows us to easily visualize BEAST's output and summarize results. As we will see, we can also use Tracer to troubleshoot some of the most common MCMC problems encountered in BEAST.


While BEAST's MCMC algorithm is fairly well optimized for phylogenetic inference, problems can arise, especially as the complexity of our data and models increase. A MCMC run may not converge on a stationary target distribution. More commonly, a run might converge but mix poorly - meaning our samples from the posterior are highly autocorrelated and therefore not independent. In these cases, it is often necessary to tune the performance of the MCMC algorithm. While BEAST's MCMC algorithm is fairly well optimised for phylogenetic inference, problems can arise, especially as the complexity of our data and models increase. A MCMC run may not converge on a stationary target distribution. More commonly, a run might converge but mix poorly - meaning our samples from the posterior are highly autocorrelated and therefore not independent. In these cases, it is often necessary to tune the performance of the MCMC algorithm.


In this tutorial, we will consider a relatively simple example where we would like to infer a phylogeny and evolutionary parameters from a small alignment of sequences. Our job will be to work together to increase the efficiency of the MCMC so we can make BEAST purr... In this tutorial, we will consider a relatively simple example where we would like to infer a phylogeny and evolutionary parameters from a small alignment of sequences. Our job will be to work together to increase the efficiency of the MCMC so we can make BEAST purr...


Expand Down Expand Up @@ -84,12 +84,11 @@ We are now ready to run our first analysis in BEAST.
<figure> <figure>
<a id="fig:beast_run1"></a> <a id="fig:beast_run1"></a>
<img style="width:50.0%;" src="figures/beast_input_run1.png" alt=""> <img style="width:50.0%;" src="figures/beast_input_run1.png" alt="">
<figcaption>Figure 2: Running BEAST with a user-specified random number seed.</figcaption> <figcaption>Figure 2: Running BEAST with the specified XML configuration file.</figcaption>
</figure> </figure>
<br> <br>



Since we are only running 200,000 iterations, the MCMC should finish running in under 30 seconds.
If you set the random number seed to 1985, you should be able to reproduce my results exactly ([Figure 2](#fig:beast_run1)). Feel free to use another seed if you would like to experiment. Since we are only running 200,000 iterations, the MCMC should finish running in under 30 seconds.




### Visualizing BEAST's output in Tracer ### Visualizing BEAST's output in Tracer
Expand All @@ -98,24 +97,26 @@ If you set the random number seed to 1985, you should be able to reproduce my re
> Open Tracer and navigate to **File > Import Trace File**, then open `tutorial_run1.log` > Open Tracer and navigate to **File > Import Trace File**, then open `tutorial_run1.log`
> >
The results of your run will look similar, but not necessarily identical, to my results shown in [Figure 2](#fig:beast_run1).

Tracer allows us to quickly visualize BEAST's MCMC output in order to check performance and see our parameter estimates. On the left there is a panel where all model parameters are listed along with their mean posterior estimates and Effective Sample Size (ESS). Recall that the ESS tells us how many pseudo-independent samples we have from the posterior, so the higher the better. Here, we can see that the ESS is low for all parameters, indicating that we do not yet have a good estimate of the posterior distribution. Tracer allows us to quickly visualize BEAST's MCMC output in order to check performance and see our parameter estimates. On the left there is a panel where all model parameters are listed along with their mean posterior estimates and Effective Sample Size (ESS). Recall that the ESS tells us how many pseudo-independent samples we have from the posterior, so the higher the better. Here, we can see that the ESS is low for all parameters, indicating that we do not yet have a good estimate of the posterior distribution.


By selecting a parameter in the left panel and then clicking on the **Trace** tab, we can see how the MCMC explored parameter space ([Figure 3](#fig:tracer_run1)). For the clockRate parameter for instance, we see that the chain quickly converged to a value of about 0.01 (the true value used to simulate the sequence data), but mixing was poor, hence the low ESS. By selecting a parameter in the left panel and then clicking on the **Trace** tab, we can see how the MCMC explored parameter space ([Figure 3](#fig:tracer_run1)). For the **clockRate** parameter for instance, we see that the chain quickly converged to a value of about 0.01 (the true value used to simulate the sequence data), but mixing was poor, hence the low ESS.


<figure> <figure>
<a id="fig:tracer_run1"></a> <a id="fig:tracer_run1"></a>
<img style="width:80.0%;" src="figures/tracer_run1.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run1.png" alt="">
<figcaption>Figure 3: A trace plot for the clockRate parameter</figcaption> <figcaption>Figure 3: A trace plot for the **clockRate** parameter</figcaption>
</figure> </figure>
<br> <br>




We can also see our posterior estimates for each parameter by clicking on the **Estimates** tab while highlighting the desired parameter in the left panel. This provides us with various summary statistics and a frequency histogram representing our estimate of the posterior distribution constructed from our MCMC samples. For the clockRate parameter, we can see that our estimate of the posterior is extremely rough, again because we have so few uncorrelated samples from the posterior ([Figure 4](#fig:tracer_run1_ests)). We can also see our posterior estimates for each parameter by clicking on the **Estimates** tab while highlighting the desired parameter in the left panel. This provides us with various summary statistics and a frequency histogram representing our estimate of the posterior distribution constructed from our MCMC samples. For the **clockRate** parameter, we can see that our estimate of the posterior is extremely rough, again because we have so few uncorrelated samples from the posterior ([Figure 4](#fig:tracer_run1_ests)).


<figure> <figure>
<a id="fig:tracer_run1_ests"></a> <a id="fig:tracer_run1_ests"></a>
<img style="width:80.0%;" src="figures/tracer_run1_ests.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run1_ests.png" alt="">
<figcaption>Figure 4: Posterior estimates of the clockRate in Tracer.</figcaption> <figcaption>Figure 4: Posterior estimates of the **clockRate** in Tracer.</figcaption>
</figure> </figure>
<br> <br>


Expand All @@ -126,7 +127,7 @@ We can also see our posterior estimates for each parameter by clicking on the **
By checking the ESS, trace plots and parameter estimates, we got the picture that none of our parameters in Run 1 mixed well. In this case, the simplest thing to try is to rerun the MCMC for more iterations. By checking the ESS, trace plots and parameter estimates, we got the picture that none of our parameters in Run 1 mixed well. In this case, the simplest thing to try is to rerun the MCMC for more iterations.




> Load the `tutorial_run1.xml` back into BEAUti using **File > Load**. Navigate to the MCMC panel and increase the chain length to 1 million . You may also want to change the file name for the log file to `tutorial_run2.log` so we do not overwrite our previous results. When done, navigate **File > Save As** and save as `tutorial_run2.xml`. > Load the `tutorial_run1.xml` back into BEAUti using **File > Load**. Navigate to the MCMC panel and increase the chain length to 1 million. You may also want to change the file name for the log file to `tutorial_run2.log` and to `tutorial_run2.trees` for the tree file so we do not overwrite our previous results. When done, navigate **File > Save As** and save as `tutorial_run2.xml`.
> >


Expand All @@ -141,23 +142,23 @@ By checking the ESS, trace plots and parameter estimates, we got the picture tha
<br> <br>




Looking at the MCMC output in Tracer, we see that that increasing the chain length did help some ([Figure 6](#fig:tracer_run2)). The ESS values are higher and the traces look better, but still not great. In the next section, we will continue to focus on the clockRate parameter because it still has a low ESS and appears to mix especially poorly. Looking at the MCMC output in Tracer, we see that that increasing the chain length did help some ([Figure 6](#fig:tracer_run2)). The ESS values are higher and the traces look better, but still not great. In the next section, we will continue to focus on the **clockRate** parameter because it still has a low ESS and appears to mix especially poorly.


<figure> <figure>
<a id="fig:tracer_run2"></a> <a id="fig:tracer_run2"></a>
<img style="width:80.0%;" src="figures/tracer_run2.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run2.png" alt="">
<figcaption>Figure 6: A trace plot for the clockRate parameter</figcaption> <figcaption>Figure 6: A trace plot for the **clockRate** parameter</figcaption>
</figure> </figure>
<br> <br>






### Run 3: Changing the clockRate operators ### Run 3: Changing the **clockRate** operators


If one parameter in particular is not converging or mixing well, we can try to tweak that parameter's operator(s). Remember that BEAST's operators control what new parameter values are proposed at each MCMC iteration and how these proposals are made (i.e. the proposal distribution). Since the **clockRate** parameter was not mixing well in Run 2, we will try increasing the frequency at which new **clockRates** are proposed. If one parameter in particular is not converging or mixing well, we can try to tweak that parameter's operator(s). Remember that BEAST's operators control what new parameter values are proposed at each MCMC iteration and how these proposals are made (i.e. the proposal distribution). Since the **clockRate** parameter was not mixing well in Run 2, we will try increasing the frequency at which new **clockRates** are proposed.




> Load the `tutorial_run2.xml` back into BEAUti and select `View > Show Operators` panel. This will bring up a new panel showing all the operators in use ([Figure 7](#fig:beauti_run3)). Click on the black triangle next to `Scale:clockRate`. In the menu that drops down, check **Optimize**. In the box to the right of `Scale.clockRate`, change the value from **0.1** to **3.0**. Now navigate to the MCMC panel and change the file name for the log file to `tutorial_run3.log`. When done, save as `tutorial_run3.xml`. > Load the `tutorial_run2.xml` back into BEAUti and select `View > Show Operators` panel. This will bring up a new panel showing all the operators in use ([Figure 7](#fig:beauti_run3)). Click on the black triangle next to `Scale:clockRate`. In the menu that drops down, check **Optimise**. In the box to the right of `Scale.clockRate`, change the value from **0.1** to **3.0**. Now navigate to the MCMC panel and change the file name for the log file to `tutorial_run3.log` and the tree file to `tutorial_run3.trees`. When done, save as `tutorial_run3.xml`.
> >
<figure> <figure>
Expand All @@ -168,19 +169,19 @@ If one parameter in particular is not converging or mixing well, we can try to t
<br> <br>




So, what just happened? We told BEAST to try to automatically optimize the scale operator on the clockRate, which moves the parameter value up or down. Note that by default, operators are automatically optimized in BEAST, but for the purposes of this tutorial I turned off auto-optimization to get especially bad mixing. We also increased the weight on this scale operator so that new **clockRates** will be proposed more often in the MCMC. Going from a weight of 0.1 to 3.0 means that new proposals for that parameter will be made thirty times as often, but the frequency at which a given operator is called depends on the weights given to other operators. So if there are parameters with very high ESS values, we may want to reallocate weight on their operators to operators on less well mixing parameters. For fun, you may want to guess what each operator in the **Operators** panel is doing. So, what just happened? We told BEAST to try to automatically optimise the scale operator on the **clockRate**, which moves the parameter value up or down. Note that by default, operators are automatically optimised in BEAST, but for the purposes of this tutorial I turned off auto-optimization to get especially bad mixing. We also increased the weight on this scale operator so that new **clockRates** will be proposed more often in the MCMC. Going from a weight of 0.1 to 3.0 means that new proposals for that parameter will be made thirty times as often, but the frequency at which a given operator is called depends on the weights given to other operators. So if there are parameters with very high ESS values, we may want to reallocate weight on their operators to operators on less well mixing parameters. For fun, you may want to guess what each operator in the **Operators** panel is doing.






> Run the `tutorial_run3.xml` in BEAST and then open the `tutorial_run3.log` in Tracer. > Run the `tutorial_run3.xml` in BEAST and then open the `tutorial_run3.log` in Tracer.
> >
We can see that optimizing the operator dramatically improves mixing for the clockRate ([Figure 8](#fig:tracer_run3)). But there is still room for improvement. We can see that optimizing the operator dramatically improves mixing for the **clockRate** ([Figure 8](#fig:tracer_run3)). But there is still room for improvement.


<figure> <figure>
<a id="fig:tracer_run3"></a> <a id="fig:tracer_run3"></a>
<img style="width:80.0%;" src="figures/tracer_run3.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run3.png" alt="">
<figcaption>Figure 8: A trace plot for the clockRate parameter</figcaption> <figcaption>Figure 8: A trace plot for the **clockRate** parameter</figcaption>
</figure> </figure>
<br> <br>


Expand All @@ -190,7 +191,7 @@ One thing to keep in mind is that BEAST is using MCMC to explore a multidimensio
<figure> <figure>
<a id="fig:tracer_run3Joint"></a> <a id="fig:tracer_run3Joint"></a>
<img style="width:80.0%;" src="figures/tracer_run3Joint.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run3Joint.png" alt="">
<figcaption>Figure 9: The joint posterior distribution of TreeHeight and clockRate</figcaption> <figcaption>Figure 9: The joint posterior distribution of **TreeHeight** and **clockRate**</figcaption>
</figure> </figure>
<br> <br>


Expand All @@ -201,7 +202,7 @@ One thing to keep in mind is that BEAST is using MCMC to explore a multidimensio
The easiest way to improve MCMC performance when two parameters are highly negatively correlated is to add an **UpDown** operator. This operator scales one parameter up while scaling the other parameter down. If two parameters are highly positively correlated we can also use this operator to scale both parameters in the same direction, up or down. The easiest way to improve MCMC performance when two parameters are highly negatively correlated is to add an **UpDown** operator. This operator scales one parameter up while scaling the other parameter down. If two parameters are highly positively correlated we can also use this operator to scale both parameters in the same direction, up or down.




> Load the `tutorial_run3.xml` back into BEAUti and select `View > Show Operators` panel. Click on the black triangle next to **UpDown clockRate**. In the menu that drops down, check **Optimize**. In the box to the right, change the weight on the UpDown operator from **0.0** to **3.0**. In the MCMC panel, change the file name for the log file to `tutorial_run4.log`. When done, save as tutorial_run4.xml`. > Load the `tutorial_run3.xml` back into BEAUti and select `View > Show Operators` panel. Click on the black triangle next to **UpDown clockRate**. In the menu that drops down, check **Optimise**. In the box to the right, change the weight on the UpDown operator from **0.0** to **3.0**. In the MCMC panel, change the file name for the log file to `tutorial_run4.log` and the tree file to `tutorial_run4.trees`. When done, save as tutorial_run4.xml`.
> >
We just added an **UpDown** operator on the **clockRate** and the **TreeHeight**. The fact that these two parameters are highly negatively correlated makes perfect sense. An increase in the **clockRate** means that less time is needed for substitutions to accumulate along branches; meaning branches can be shorter and yet still explain the same amount of accumulated evolutionary change in the sequence data. If all branches in the tree become shorter, then the total **TreeHeight** will also decrease. Thus it makes sense to include an **UpDown** operator on **clockRate** and **TreeHeight**. In fact, by default BEAUTi includes this operator. However, I disabled it in the original XML file by setting the weight on this operator to zero for the purpose of illustration. We just added an **UpDown** operator on the **clockRate** and the **TreeHeight**. The fact that these two parameters are highly negatively correlated makes perfect sense. An increase in the **clockRate** means that less time is needed for substitutions to accumulate along branches; meaning branches can be shorter and yet still explain the same amount of accumulated evolutionary change in the sequence data. If all branches in the tree become shorter, then the total **TreeHeight** will also decrease. Thus it makes sense to include an **UpDown** operator on **clockRate** and **TreeHeight**. In fact, by default BEAUTi includes this operator. However, I disabled it in the original XML file by setting the weight on this operator to zero for the purpose of illustration.
Expand All @@ -215,7 +216,7 @@ Looking at the MCMC output in Tracer, we see that all parameters are starting to
<figure> <figure>
<a id="fig:tracer_run4"></a> <a id="fig:tracer_run4"></a>
<img style="width:80.0%;" src="figures/tracer_run4.png" alt=""> <img style="width:80.0%;" src="figures/tracer_run4.png" alt="">
<figcaption>Figure 10: A trace plot for the clockRate parameter</figcaption> <figcaption>Figure 10: A trace plot for the **clockRate** parameter</figcaption>
</figure> </figure>
<br> <br>


Expand Down
Binary file removed figures/beast_input_run1.pdf
Binary file not shown.
Binary file modified figures/beast_input_run1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 55382af

Please sign in to comment.