Skip to content

Substitution Models

Alexey Kozlov edited this page Jan 8, 2019 · 10 revisions

[DRAFT]

Rate heterogeneity among sites

Proportion of invariant sites

When a proportion of invariant sites is into consideration, branch lengths are properly scaled during the P-matrix computation. However, it also affects the log-likelihood:

where p is the proportion of invariant sites, LKi(t) is the site likelihood assuming it is variable and LK*i(t) is the site likelihood assuming it is invariable.

proportion of invariant sites

If we want to use models considering a proportion of invariant sites (commonly denominated as +I models), pll provides the following instructions:

unsigned int pll_count_invariant_sites(pll_partition_t * partition,
                                       unsigned int * state_inv_count);
int pll_update_invariant_sites(pll_partition_t * partition);
int pll_update_invariant_sites_proportion(pll_partition_t * partition,
                                          unsigned int params_index,
                                          double prop_invar);

Note the following important fields in pll_partition_t structure:

typedef struct pll_partition
{
  ...
  double * prop_invar;
  int * invariant;
  ...
} pll_partition_t;

prop_invar is the proportion of invariant sites parameter, and invariant is a fixed vector with the same length as the sequences, that contains either '-1' if the site is not invariant, or the encoded site state otherwise.

for example,

MSA:        A A C T G G T
            C A T T A G T
            C A C T A G T
            A A C T A G T
            -------------
invariant:  x A x T x G T

where 'x' would be -1, and A,G,T the encoded state.

The invariant vector needs to be computed after setting the tip states, and it does not need to be updated anymore. However, this is automatically done when calling pll_set_invariant_sites_proportion() with an argument greater than 0. Of course, it is important to set the tip states before calling this function. Otherwise the function will return an error.

Each time the proportion of invariant sites p is updated, effective branch lengths for computing the P-matrices are scaled to a factor of (1-p). Therefore, it is also necessary to update all the probability matrices by calling pll_update_prob_matrices()

If for some reason the sequences are changed during the execution, the invariant vector could become invalid. In such a case, the client code should explicitly update the vector by calling pll_update_invariant_sites().

Each time the proportion of invariant sites is updated, it is also necessary to update the probability matrices: pll_update_prob_matrices(...)

Models of amino acid replacement

Although mechanistic models exist (Thorne and Goldman, 2003), the models of protein evolution currently used are preferentially based on empirical matrices for computational and data-complexity reasons. These matrices were constructed upon large datasets consisting of many diverse protein families.

In PLL, the empirical parameters are divided in two arrays, containing the substitution rates and the amino acid frequencies respectively:

double pll_aa_rates_MODEL[190];

double pll_aa_freqs_MODEL[20];

where model is one of the following (in lower case):

  1. Dayhoff (Dayhoff et al., 1978)
  2. LG (Le and Gascuel, 2008)
  3. DCMut (Kosiol and Goldman, 2005)
  4. JTT (Jones et al., 1992)
  5. mtREV (Adachi and Hasegawa, 1996)
  6. WAG (Whelan and Goldman, 2001)
  7. RtREV (Dimmic et al., 2002)
  8. CpREV (Adachi and Waddell, 2000)
  9. VT (Muller and Vingron 2000)
  10. Blosum62 (Henikoff and Henikoff, 1992)
  11. MtMam (Cao et al., 1998)
  12. MtArt (Abascal et al., 2007)
  13. MtZoa (Rota-Stabelli et al., 2009)
  14. PMB (Veerassamy, Smith and Tillier, 2003)
  15. HIVb (Nickle et al. 2007)
  16. HIVw (Nickle et al. 2007)
  17. JTT-DCMut (Kosiol and Goldman, 2005)
  18. FLU (Dang et al., 2010)
  19. StmtREV (Liu et al., 2014)
  20. DEN (Kim et al., 2018)

References

  • Abascal, F., Posada, D., and Zardoya, R. 2007. MtArt: a new model of amino acid replacement for Arthropoda. Mol Biol Evol 24: 1-5.
  • Adachi, J., and Hasegawa, M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42: 459-468.
  • Adachi, J., Waddell, P.J., Martin, W., and Hasegawa, M. 2000. Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol 50: 348-358.
  • Cao, Y., Janke, A., Waddell, P.J., Westerman, M., Takenaka, O., Murata, S., Okada, N., Paabo, S., and Hasegawa, M. 1998. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J Mol Evol 47: 307-322.
  • Dang, CC, Le, Q.S., Gascuel, O and Le, V. S. 2010. FLU, an amino acid substitution model for influenza proteins. BMC Evolutionary Biology 2010, 10:99.
  • Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. 1978. A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure. (ed. M.O. Dayhoff), pp. 345-352. National Biomedical Research Foundation, Washington, DC.
  • Dimmic, M.W., Rest, J.S., Mindell, D.P., and Goldstein, R.A. 2002. rtREV: an amino acid substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol 55: 65-73.
  • Henikoff, S., and Henikoff, J.G. 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89: 10915-10919.
  • Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapid generation of mutation data matrices from protein sequences. Comp. Appl. Biosci. 8: 275-282.
  • Kim, T. Le, Cao, C. D. and Le, V. Sy, "Building a Specific Amino Acid Substitution Model for Dengue Viruses," 2018 10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, Vietnam, 2018, pp. 242-246.
  • Kosiol, C., and Goldman, N.2005. Different Versions of the Dayhoff Rate Matrix. Mol. Biol. Evol. 22:193-199.
  • Le, S.Q., and Gascuel, O. 2008. An improved general amino acid replacement matrix. Mol Biol Evol 25: 1307-1320.
  • Liu, Y., Cox, C. J., Wang, W., & Goffinet, B. 2014. Mitochondrial phylogenomics of early land plants: Mitigating the effects of saturation, compositional heterogeneity, and codon-usage bias. Systematic biology, 63(6), 862-878.
  • Muller, T., and Vingron, M. 2000. Modeling amino acid replacement. J Comput Biol 7: 761-776.
  • Nickle, D.C., Heath, L., Jensen, M.A., Gilbert, P.B., Mullins, J.I., and Kosakovsky Pond, S.L. 2007. HIV-specific probabilistic models of protein evolution. PLoS ONE 2: e503.
  • Rota-Stabelli, O., Yang, Z., & Telford, M. J. 2009. MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies. Molecular phylogenetics and evolution, 52(1), 268-272.
  • Thorne, J.L., and Goldman, N. 2003. Probabilistic models for the study of protein evolution. In Handbook of Statistical Genetics. (ed. M.B. D.J. Balding, and C. Cannings), pp. 209-226. John Wiley & Sons, Ltd., Chichester, England.
  • Veerassamy, S., Smith, A., & Tillier, E. R. 2003. A transition probability model for amino acid substitutions from blocks. Journal of Computational Biology, 10(6), 997-1010.
  • Whelan, S., & Goldman, N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular biology and evolution, 18(5), 691-699.