# Tutorial: Maximum Likelihood phylogeny inference

<p> Dr. Mozes Blom (2025)<br>
Museum für Naturkunde Berlin<br>
Leibniz Institute for Evolutionary and Biodiversity Science</p>

## 1. Introduction

### 1.1 Course note

This tutorial was developed as a practical exercise that complements Lecture Day 4 of the MSc course Phylogenetics in Ecology & Evolution, Potsdam University. For further information see the [course website](https://amniota.org/phylogenetics/).

### 1.2 Background

Phylogenetic inference with genetic data, the reconstruction of phylogenetic trees using sequences, has advanced tremendously over the past decades. Model based methods (Maximum Likelihood & Bayesian inference) use advanced algorithms in probabilistic modelling and computational science, for likelihood calculations and can now integrate a wide-range of complex models that reflect molecular mechanisms of sequence evolution.

Identifying the correct substitution model is a key step in model-based phylogenetic inference and the various models differ in the complexity of rate variation that they capture across loci (transition/transversion rate, among site variation, etc.) or genes. Partitioned models can be used to differentiate between or to cluster sites that are more likely to share similar rates of sequence evolution (by gene or codon position for example).

In this tutorial, we will use IQ-Tree3, a software package for phylogenetic inference using Maximum Likelihood that is very versatile, fast and able to conduct several distinct tasks for which we would otherwise need multiple independent software packages. The aim of this tutorial is to demonstrate the several steps typically involved when inferring a Maximum-Likelihood tree and to illustrate a scenario where choice of substitution model/data partitioning can lead to distinct biological interpretations.

### 1.3 Practical overview

This computer practical is styled in a Jupyter Notebook (NB) format (see Github [README](https://github.com/MozesBlom/tutorials/tree/main) for more details). In short, Jupyter NBs contain either text cells (such as the current cell) in Markdown format or code cells (frequently) in Python3 format. The aim of this practical is not to provide an exhaustive introduction into Python/Markdown/Jupyter, but to provide an introduction into phylogenomic inference using Maximum Likelihood. Therefore, all code to run this practical is already in place. In addition, there are several questions to be answered and you can double-click on the corresponding cells to enter your answer. Alternatively, you can just write them down for yourself.

This tutorial is loosely based on a workshop organised by the developers of IQ-Tree (Minh Bui et al.) and includes a subset of data from [Chiara et al. (2012)](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65). It contains all the commands to infer a ML phylogeny for this dataset but I would strongly encourage to explore the rich functionality of the programme and the fantastic [documentation](https://iqtree.github.io/doc/) with examples, explanations and walk-throughs.

### 1.4 Requirements

To run this practical, the following Python modules are needed:
- Python 3+
  - [biopython](https://biopython.org/)
- [condacolab](https://github.com/conda-incubator/condacolab)
- [iqtree3](https://iqtree.github.io/) (installed using conda and the [bioconda repository](https://anaconda.org/bioconda/iqtree))

## 2. Getting Started

### 2.1 Install the Conda package manager and IQtree3 on your Google Colab instance

The software and Python modules mentioned under *1.4 Requirements* first need to be installed and imported before proceeding with the rest of the NB. If running on a Google Colab instance (see [Github](https://github.com/MozesBlom/tutorials/tree/main/2023_Phy_Eco_Evol) for further details) software and modules will always need to be installed. Imagine that, each time you start a new Google Colab instance, you are basically starting a new computer for the very first time! For our installation of **iqtree3**, we will make use of a well-known package manager called Conda. [Conda](https://anaconda.org/anaconda/conda) has been developed to enable the use of distinct software environments on the same computer and it also makes it very easy to install and deploy software packages. Here we are particularly interested in the latter functionality and use it to install **iqtree3** on our google colab instance.

If running on your personal desktop, in JupyterLab for example, then this may not be needed if a conda environment is already loaded and selected with the correct software and modules installed. NOTE, if you go the latter route then make sure that JupyterLab itself and all dependencies are installed in the relevant Conda environment. Otherwise, the environment will not pop up in the JupyterLab Desktop environment list and cannot be selected. In all other user cases, first proceed with installing the following software tools:

In [None]:
#Install conda using the conda-colab library
!pip install -q condacolab
import condacolab
condacolab.install()

#Install iqtree3 from bioconda repository
!conda install bioconda::iqtree

What have we done? First, note the difference between sentences starting with an exclamation mark (!) and without. These are general *command line statements* which can be likened to installing software on your computer. In this case we used pip to download and install conda colab.

Sentences that start without an exclamation mark are in python syntax and executed within a python environment. In the second line, we imported condacolab into the python environment and then used the condacolab function *'condacolab.install()'* to run conda colab on our colab instance.

Once that is done, we have successfully installed conda on our Google colab instance and we can use conda to install the latest version of iqtree from the [bioconda repository](https://anaconda.org/bioconda/iqtree)! Have a look for yourself:

In [None]:
!iqtree3 --help

**If everything was installed as expected you should now be able to see a list of all the iqtree3 options, as if the program was installed on your own computer!**

![iqtree3_installed.png](img/iqtree3_installed.png)

### 2.2 Import python modules needed

Finally, we will use conda again to download and install biopython

In [None]:
# Install biopython from anaconda repository
!conda install anaconda::biopython

In [None]:
# Import biopython into python environment
import Bio

## 3. Input Data

For this tutorial, we will use a subset of a genomic dataset that was originally devised to resolve the contested relationship between turtles and other amniotes:

**The phylogenetic placement of turtles**

![turtle_phylo.png](img/turtle_phylo.png)

**Question:** Tree thinking! Can you formulate the difference between the three phylogenies into three competing hypotheses?

**Answer:**

To differentiate between these competing hypotheses, [Chiara et al. (2012)](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65) used high-throughput sequencing to obtain seven new transcriptomes from the blood, liver, or jaws of four turtles, a caiman, a lizard, and a lungfish. They complemented this with existing data that was available at the time. For the IQtree3 tutorial, [Minh bui et al.](https://iqtree.github.io/workshop/molevol_tutorial2025#1-input-data) took this dataset and created a subset that includes 29 genes. The alignment file itself is stored in *'.fasta'* format and the *'.nex'* is a partition file in nexus format that we will user later onwards.

**Exercise:** Download the alignment and partition file to your own computer first. Drag and drop, or open it in your favourite alignment viewer. For example, you can use the Geneious program that we used for the Tutorial of Lecture 3, or the Free-to-Use viewer [Jalview](https://www.jalview.org/). The partition file (*'.nex'*) can be opened with any text editor of your choice (e.g. [vscode](https://code.visualstudio.com/)).

**Question a):** Zoom out completely. The alignment file contains 29 concatenated genes, with concatenation meaning that every gene alignment is pasted directly adjacent to the previous gene. If you observe the zoomed-out alignment file, can you identify the gene boundaries? Does it roughly match with the patterns of missing data?

**Question b):** The missing data is non-randomly distributed between the different species. Given the information proided above, why do you think this may be? Hint: This topic was also mentioned during the last lecture of Day 3.

**Answer:**

### 3.1 Upload data to Google Colab instance

In [None]:
# Now upload both files to Google colab
!wget -cq https://iqtree.github.io/workshop/data/turtle.fa
!wget -cq https://iqtree.github.io/workshop/data/turtle.nex

For bioinformatics (and phylogenomics) we often cannot use GUI's such as the Geneious package introduced yesterday. Instead, we rely on *executables and command line tools* and interact with our computer using shell commands in bash. You've already used some shell commands above to install software (e.g. each code sentence starting with '!'). The two *'!wget'* commands above were instructions to the Google colab instance to download data from the subsequent links. Let's have a look at the 'download' or 'desktop' level of your google colab instance.

In [None]:
# Bash command to list all items in the current folder/location.
!ls

Both files have been downloaded to your Google colab instance. You now have everything that you need: i) IQtree3, for ML inference, ii) the alignment file and iii) a data partition.

## 4. Identify the optimal substitution model

A major reason why ML inference is frequently used for phylogenomic studies, is that IQtree3 (and similar software packages, e.g. [RAxML](https://cme.h-its.org/exelixis/web/software/raxml/)) can do a complete ML analysis from start to finish with a single command. In contrast to Bayesian analyses, there is no need for setting and checking priors, to assess model convergence, etc. IQtree3 can identify the optimal substitution model, create a data partition and infer the ML phylogeny in one run. A ML analysis can therefore be easily scaled up and parallelized when using a High-Performance Computing (HPC) cluster, and is often used for phylogenomic scale data. For the purpose of the present tutorial, we will divide the analysis in inidividual components and discuss what happens at each step. **Please note:** From here on onwards, the commands used can often take a while (a few mins) so please be patient and have a look at the IQtree3 output while it's running. There is a lot of computational calculations going on under the hood but undoubtedly you'll recognize some terms that we discussed during our lecture.

In [None]:
# Let's first assess the optimal substitution model for the entire dataset (rather than a partitioned analysis). Here's the command to do so:
!iqtree3 -s turtle.fa -m TESTONLY

**Question:** Which substitution model was identified as the optimal model for this alignment? Does this surprise you given the size of the alignment?

**Answer:**

**Exercise:** You may have noticed that IQtree3 also considers a number of extra models that we haven't discussed yet. Please have a look at the IQtree3 [documentation](https://iqtree.github.io/doc/Substitution-Models) and check out all the possible models. Can you write out what the entire 'code' stands for? E.g. F81+F = Equal substitution rates and unequal (empirical) base frequencies.

**Answer:**

Hint: If for some reason you lost the analyses output, you can always double check the log using the following command:

In [None]:
!cat turtle.fa.log

## 5. Inferring a ML phylogeny with IQtree3

### 5.1 Unpartitioned

**Exercise:** Great job, so you have now identified the optimal substitution model for this specific alignment. Below you see the IQtree3 command to run a ML search using the turtle alignment and to do 1000 ultra-fast bootstraps. You can read [here](https://academic.oup.com/mbe/article/35/2/518/4565479) more about the subtle difference between 'regular', as discussed during Lecture 2, and 'ultra-fast' bootstraps. The output files have the user-defined prefix *'turtle_ML'*. Replace the placeholder, incl. the brackets, with the exact code for the optimal substitution model that you've found above and run the code. E.g. If the substitution model was 'F81+F', then the entire call should read: *'!iqtree3 -s turtle.fa -B 1000 -m F81+F --prefix turtle_ML'*

In [None]:
# Run IQtree3 with a user defined substitution model
!iqtree3 -s turtle.fa -B 1000 -m [place_holder] --prefix turtle_ML

This took a while, but don't forget that the likelihoods were calculated for a large number of possible topologies and branch lengths! Before looking at the phylogeny with the ML, let's dive a bit deeper into the analysis itself and the analysis output:

**Question a):** Have a look at the first 20 or so sentences of the output. IQtree3 reports that it has read in the alignment, it interprets the alignment as DNA/RNA sequences and it then provides some details about the alignment size, the number of sequences and how many parsimony informative sites it has detected. How many parsimony informative sites did it find and what does that mean?

**Question b):** IQtree3 also records the amount of missing data. Which species have more than 50% missing data?

**Answer:**

Once the alignment is read, IQtree3 is then ready to start a ML search. However, as you may recall from today's lecture, to do so it will need a starting tree since the possible tree search space with 16 sequences is huge. In the output section above *'INITIALIZING CANDIDATE TREE SET'* you can see that a few reasonable first steps are taken to start searching in the right direction.

**Question c):** We have primarily discussed 'character' based methods to infer a phylogeny (e.g. Parsimony, ML & Bayesian). However, 'distance' based approaches are another class of phylogeny building methods. These are extremely rapid, but have several weaknessess that we will not discuss further today (I encourage you to dive deeper into this class of methods though). Either way, IQtree3 uses a distance based method to infer its first starting tree. Can you spot which method was used? Hint, it's the first log-likelihood calculated for a tree.

**Question d):** After the *'INITIALIZING CANDIDATE TREE SET'* header, it then generates 98 parsimony trees and does a tree search on the 20 best initial trees. Which local tree rearrangement operation approach is used for proposing new phylogenies? Hint, these approaches were briefly discussed today in the slide on *'heuristic searches'* and in Lecture 2 as well.

**Answer:**

Now move on to the final section of the analysis log, after the header *'FINALIZING TREE SEARCH'*

**Question e):** After the *'FINALIZING TREE SEARCH'* header, IQtree3 provides a summary of the ML tree search. It lists the difference in the log-likelihood score between the first tree and the tree with the best log-likelihood score. Are these values very different, does this surprise you? 

**Question f):** Have a look at the difference in substitution rate parameters that IQtree3 found. Which substitutions were more common than others. Does this match theoretical expectations?

**Answer:**

**Ok. we have looked at enough details of the analysis!! Please show me the tree, I want to know the phylogenetic placement of turtles!!**
Fair enough! Normally, when you run these analyses on your local HPC cluster or computer desktop, you download the output file to your own computer and use a tree visualization tool (e.g. [Figtree](https://tree.bio.ed.ac.uk/software/figtree/)) to have a look at the phylogeny. However, don't forget, we are currently working on one of Google's computers, but we can use a python package to read the newick output of IQtree3 and then the matplotlib python module to visualise a graphical object

In [None]:
## Import the Phylo class of biopython and visualise the phylogeny
from Bio import Phylo
import matplotlib.pyplot as plt
tree = Phylo.read("turtle_ML.treefile", "newick")
fig, ax = plt.subplots(figsize=(10, 6))  # Adjust the size as needed
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Phylo.draw(tree, axes=ax)
plt.show()

**Question G):** Which of our initial hypotheses does the ML phylogeny agree with? What is the bootstrap support for this relationship? Does it match the findings in the [Chiara et al. 2012](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65) paper?

**Answer:**

### 5.2 Partitioned

During today's lecture (Lecture 4), you have learnt that substitution rates may vary between sites (e.g. by codon position) and genes. In the above ML search we used a substitution model where the rate of substitution can vary depending on substitution type (e.g. A to T) but we have assumed that the same type of substitution (A to T) has an equal rate across all 29 genes. One can argue that this is an unlikely assumption and that we need to account for variation in substitution rates between genes. We can do so by conducting a 'partitioned' analysis instead and you have already downloaded a file where the gene boundaries in the 'turtle.fa' alignment have been delineated. Please note that this file only describes the gene boundaries and not the codon positions.

**Partitionfinder2** is a popular tool to partition large alignments and to simultaneously identify the correct substitution model for each data partition. However, it has also been shown that partitioning can easily lead to overparameterization; e.g. partitioning an alignment of four genes, by gene and codon position, already yields 12 different partitions. One helpful task that Partitionfinder takes care of is to identify partitions that can be clustered together and treated as a single data partition. For example, some genes may share similar rates of sequence evolution and Partitionfinder can identify such patterns and merge codon position 1 and 2 of both genes into a single data partition for which the best fit substitution model is then subsequently identified. For more details on the method, you can learn more about the method in the original paper via the following [link](https://academic.oup.com/mbe/article/29/6/1695/1000514?login=false). IQtree3 has [a built in call](https://iqtree.github.io/doc/Advanced-Tutorial#partitioned-analysis-for-multi-gene-alignments) for running PartitionFinder2 and include this in a fully automated workflow which: i) Identifies an optimal number of data partitions, ii) identify the best substitution model for each partition and iii) conduct a ML search for the tree with the highest ML score. Using the following command:

In [None]:
## Run a partitioned ML search with IQtree3 by providing both the sequence alignment and a nexus file identifying the corresponding boundaries for each gene in the alignment.
## Note that we also do 1000 ultra-fast bootstraps and use a prefix for all output files.
!iqtree3 -s turtle.fa -p turtle.nex -B 1000 -T AUTO -m MFP+MERGE -rcluster 10 --prefix turtle.merge

You may have noticed that this analysis took quite a bit longer than the unpartitioned analysis. But if you've paid close attention to the output of the analysis you may have noticed that it has also has conducted quite a few extra steps. Let's dissect what IQtree3 has done.

Have a look at the output of IQtree3, the first operations are the same: It's starting the program and reading in the alignment. However, it then starts to *'Identifying sites to remove'*.

**Question a):** When looking at the summary table of the filtered alignments, what do you notice? Hint: Also have a look at the alignment in your alignment viewer, does it make sense why IQtree3 has started with this step?

**Answer:**

After alignment filtering, Partitionfinder is comparing the rate variation between genes and clusters genes together that match.

In [None]:
## Let's have a look at the number of partitions that IQtree3 ended up with:
!cat turtle.merge.iqtree

**Question b):** We started with 29 partitions, one for each gene, how many partitions do we have now?

**Answer:**

Finally, let's have a look at the effect of accounting for possible rate variation between dataset partitions.

In [None]:
tree = Phylo.read("turtle.merge.treefile", "newick")
fig, ax = plt.subplots(figsize=(10, 6))  # Adjust the size as needed
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Phylo.draw(tree, axes=ax)
plt.show()

**Question C):** Which of our initial hypotheses does the ML phylogeny, based on a partitiond dataset, agree with? What is the bootstrap support for this relationship? Does it match the findings in the [Chiara et al. 2012](https://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-65) paper?

**Answer:**