# Tutorial: Gene Trees - Species Trees

<p> Dr. Mozes Blom (2025)<br>
Museum für Naturkunde Berlin<br>
Leibniz Institute for Evolutionary and Biodiversity Science</p>

## 1. Introduction

### 1.1 Course note

This tutorial was developed as a practical exercise that complements a Lecture on **Gene trees and Species tree inference** for the MSc course Phylogenetics in Ecology & Evolution, Potsdam University. For further information see the [course website](https://amniota.org/phylogenetics/).

### 1.2 Background

In a previous [tutorial on ML](https://github.com/MozesBlom/tutorials/tree/main/2025_PU_Phylo_Eco_Evol/ML), we have learnt how IQtree3 can be used to infer a Maximum-Likelihood phylogeny for a single alignment. However, we have also learnt that novel genomic approaches enable us to generate genome-scale datasets and that there is often incongruence between gene-trees and the species-tree. See this [Review](https://www.nature.com/articles/s41576-023-00620-x) with some further details. Incomplete Lineage Sorting (ILS) and interspecfic hybridization are two evolutionary processes that are frequently investigated as possible explanations for widespread discordance. In the [tutorial on ILS](https://github.com/MozesBlom/tutorials/tree/main/2025_PU_Phylo_Eco_Evol/ILS), we used msprime to simulate different evolutionary scenarios and demonstrated how the interaction between divergence time and effective population size can determine the extent of ILS, while at the same time interspecific gene flow can lead to additional incongruence between gene- and species tree.

The aim of the present tutorial is to bring this all together and demonstrate how we can use large-scale datasets to infer phylogenies for genomic windows spread across a chromosome. These individual window trees can then be used to infer a summary-coalescent species tree and we will use gene- and site-concordance factor to investigate how well specific branches are supported across all trees. Finally, by doing so, I also want to demonstrate how you can upload your own datasets to Google Colab and how we can use Command-Line for phylogenetic inference of larger datasets.

### 1.3 Practical overview

This computer practical is styled in a Jupyter Notebook (NB) format (see Github [README](https://github.com/MozesBlom/tutorials/tree/main) for more details). In short, Jupyter NBs contain either text cells (such as the current cell) in Markdown format or code cells (frequently) in Python3 format. The aim of this practical is not to provide an exhaustive introduction into Python/Markdown/Jupyter, but to provide an introduction into phylogenomic inference using Maximum Likelihood. Therefore, all code to run this practical is already in place. In addition, there are several questions to be answered and you can double-click on the corresponding cells to enter your answer. Alternatively, you can just write them down for yourself.

This tutorial includes a subset of data from [Blom et al. (2025)](https://royalsocietypublishing.org/doi/10.1098/rsbl.2024.0611). It contains all the commands to successfully complete this tutorial and some additional examples that may be of use when analyzing your own datasets for the project work of this course.

### 1.4 Requirements

To run this practical, the following Python modules or software packages are needed:
- Python 3+
  - [biopython](https://biopython.org/)
- [condacolab](https://github.com/conda-incubator/condacolab)
- [iqtree3](https://iqtree.github.io/) (installed using conda and the [bioconda repository](https://anaconda.org/bioconda/iqtree), v3.0.1)
- [ASTRAL](https://github.com/smirarab/ASTRAL) (installed using conda and the [bioconda repository](https://anaconda.org/bioconda/astral-tree), v5.7.8)

## 2. Getting Started

### 2.1 Install the Conda package manager, IQtree3 and ASTRAL on your Google Colab instance

The software and Python modules mentioned under *1.4 Requirements* first need to be installed and imported before proceeding with the rest of the NB. If running on a Google Colab instance (see [Github](https://github.com/MozesBlom/tutorials/tree/main/2023_Phy_Eco_Evol) for further details) software and modules will always need to be installed. Imagine that, each time you start a new Google Colab instance, you are basically starting a new computer for the very first time! For our installation of **iqtree3**, we will make use of a well-known package manager called Conda. [Conda](https://anaconda.org/anaconda/conda) has been developed to enable the use of distinct software environments on the same computer and it also makes it very easy to install and deploy software packages. Here we are particularly interested in the latter functionality and use it to install **iqtree3** on our google colab instance.

If running on your personal desktop, in JupyterLab for example, then this may not be needed if a conda environment is already loaded and selected with the correct software and modules installed. NOTE, if you go the latter route then make sure that JupyterLab itself and all dependencies are installed in the relevant Conda environment. Otherwise, the environment will not pop up in the JupyterLab Desktop environment list and cannot be selected. In all other user cases, first proceed with installing the following software tools:

In [None]:
# Install the biopython module, which we will later use for plotting the phylogeny
!pip install biopython

In [None]:
# Double check that it downloaded and was imported as expected
import Bio

In [None]:
#Install conda using the conda-colab library
!pip install -q condacolab
import condacolab
condacolab.install()

#Install iqtree3 from bioconda repository
!conda install bioconda::iqtree

#Install ASTRAL from bioconda repository
!conda install bioconda::astral-tree

What have we done? First, note the difference between sentences starting with an exclamation mark (!) and without. These are general *command line statements* which can be likened to installing software on your computer. In this case we used pip to download and install biopython and conda colab.

Sentences that start without an exclamation mark are in python syntax and executed within a python environment. In the second line, we imported condacolab into the python environment and then used the condacolab function *'condacolab.install()'* to run conda colab on our colab instance.

Once that is done, we have successfully installed conda on our Google colab instance and we can use conda to install the latest version of iqtree3 and ASTRAL from the bioconda repository! Have a look for yourself:

In [None]:
!iqtree3 --help

**If everything was installed as expected you should now be able to see a list of all the iqtree3 options, as if the program was installed on your own computer!**

![iqtree3_installed.png](img/iqtree3_installed.png)

In [None]:
!astral --help

**If everything was installed as expected you should now be able to see a list of all the ASTRAL options, as if the program was installed on your own computer!**

![astral_installed.png](img/astral_installed.png)

## 3. Input Data

For this tutorial, we will use a subset of a genomic dataset that was originally devised to assess species diversity and relationships among New Guinean Jewel babblers.

**Chestnut-backed Jewel babbler *(Ptilorrhoa castanota)***

![chestnut_jewel_babbler.png](img/chestnut_jewel_babbler.png)

The endemic jewel-babblers of New Guinea (genus: Ptilorrhoa) comprise 17 named taxa divided into four recognized species. Together with quail-thrushes (genus: Cinclosoma) they represent one family (Cinclosomatidae) out of 31 Corvides families within passerine birds. The taxonomic division is based on plumage patterns and geographic distributions, which span the lowlands (Ptilorrhoa caerulescens (0−300 m.a.s.l.) and P. geislerorum (0−1200 m.a.s.l.)), lower montane elevations (P. castanonota (900–1450 m.a.s.l.)) and the highlands (P. leucosticta (1750–2400 m.a.s.l.)). Thus, Ptilorrhoa species represent various examples of geographical and elevational displacement across New Guinea. However, much remained still unclear regarding the true species diversity and the phylogenetic relationships between various (sub-)species. To address this issue, [Blom et al. 2025](https://royalsocietypublishing.org/doi/10.1098/rsbl.2024.0611) assembled a draft reference genome for a closely related outgroup and whole-genome resequencing data for a large number of representatives across all named species, sub-species and with a geographic spread that matches the known distribution for each lineage.

The average avian genome-size is roughly 1.3 Gb and it would be computationally too challenging to replicate the phylogenetic study in its entirety. To reduce computational complexity, [Blom et al. 2025](https://royalsocietypublishing.org/doi/10.1098/rsbl.2024.0611) did not infer phylogenies based on the entire genome but instead sampled and filtered genomic 'windows' that were each 10kb. in size and sampled at least 100 Kb. apart. This resulted in **8097** autosomal and **614** Z-linked genomic 'windows'. Every genomic window is effectively a chunk of the genome and can include exons, introns, regulatory elements etc. Nonetheless, across large numbers of loci and relatively large window sizes this should not affect our phylogenetic estimate (topology in particular) too much. Here, we analyse a small subset of this dataset, namely 50 autosomal window alignments randomly sampled from Chromosome 1. For each of these genomic windows, we will infer a ML phylogeny and we will aggregate these window trees to then infer a summary-coalescent species tree. Finally, we will introduce a new proxy for assessing confidence in each branch.

### 3.1 Upload data to Google Colab instance

**Data can be uploaded to Google Colab in various ways.** For the sake of the present tutorial just choose one of the options. But now you know, in case you want to run your own data for your project for example.

#### 1. Direct download from a website link. I have uploaded a compressed folder with all alignments to this Github repository:

In [None]:
# Direct download from website link
!wget -cq https://github.com/MozesBlom/tutorials/tree/main/2025_PU_Phylo_Eco_Evol/GTST/data/chr1_filt_random.tar.gz

#### 2. Download the data to your Google Drive and link your Drive to Google Colab. You can download the data from the same [link](https://github.com/MozesBlom/tutorials/tree/main/2025_PU_Phylo_Eco_Evol/GTST/data/chr1_filt_random.tar.gz), download option the three dots at the top-right of webpage, as used above and then store it in your Google Drive folder.

In [None]:
# Once you've uploaded the data to Google Drive, link that specific Google Drive folder to the Google Colab instance. This command, creates a hyperlink folder on your Google Colab instance to the most upper directory of your Google Drive folder. 
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
# To highlight that you can now access your Google Drive folder
!ls "/content/gdrive/"

In [None]:
# Now copy the data file to the main directory on your Google colab instance. If for instance you stored it at /gdrive_folder/work/GTST_data/chr1_filt_random.tar.gz. Then use:
!cp /gdrive_folder/work/GTST_data/chr1_filt_random.tar.gz "/content/"

Finally, regardless of how you uploaded the data to your Google Colab instance, you will now need to unpack the .tar.gz file

In [None]:
!tar -zxvf /content/chr1_filt_random.tar.gz

In [None]:
# Let's have a look at the contents of the unzipped folder:
!ls /content/chr1_filt_random

## 4. ML Window Trees

**Before you do anything else, go to the next two code blocks and run both of them!** While it is running, you can go to the bathroom, grab a coffee and read the subsequent paragraphs. The next code blocks will probably run for at least 10 mins. or so

In [None]:
# Import the os and subprocess python modules
import subprocess
import os

In [3]:
# Use Python to loop over all 50 alignments in the folder and generate the ML Window trees using IQtree3.
# NOTE, this command is likely to run for at least 10 mins. or so. Therefore it was recommended above to get started on this right away
# You can keep track of the number of ML window trees that have been inferred by guesstimating the number that have been printed as complete to the screen
for fn in os.listdir("/content/chr1_filt_random/"):
    prefix = fn.split(".")[0]
    fn_path = os.path.join('/content/chr1_filt_random/' + fn)
    result = subprocess.run(["iqtree3", "-s", fn_path, "-nt", "AUTO", "-m", "JC", "--prefix", fn], capture_output=True, text=True, check=False)
    print(result)

So what did we do here? The chr1_filt_random folder contains 50 filtered alignments, each roughly 10Kb in length. We created a *for* loop that iterates over all alignments in the folder, creates a prefix by removing the '.fa' from the filename, specifying the exact location of the file (fn_path) and then using *subprocess.run* to execute iqtree3 from within Python. Note that, to limit our waiting time here, we don't do any model selection and only use the most simplest substitution model (Jukes Cantor). This is most likely incorrect for these alignments, but it would otherwise take around an hour or so to infer the ML tree for each of the alignments.

**Once ALL 50 window trees have been inferred**, you can proced with the following code blocks

In [None]:
# Let's have a look at the folder
!ls

The folder now contains a huge number of files. 300 to be exact, for every alignment IQtree3 produced 6 files. The ones that we are most interested in are the:
1. .log: Run output
2. .iqtree: IQ-TREE report
3. .treefile: **Maximum-likelihood tree**

In [None]:
!cat chr1_RagTag_92010000_92020000_filt_indivs_aln.fa.log

In [None]:
!cat chr1_RagTag_92010000_92020000_filt_indivs_aln.fa.iqtree

In [None]:
!cat chr1_RagTag_92010000_92020000_filt_indivs_aln.fa.treefile

**We can double check that 50 phylogenies were inferred by counting the .treefile's and we then need to store all trees into a .trees file**

In [None]:
!cat *.treefile > /content/chr1_filt_random_autosomal_windows.trees

## 5. Inferring a Summary-Coalescent Species Tree with ASTRAL and concordance factor annotation

### 5.1 ASTRAL

Running a full multi-species coalescent analysis is only feasible in a Bayesian framework. In such cases, both the species tree and gene trees are co-estimated and each have different parameters associated that require evaluation. This is frequently computationally intractable and for large-scale datasets we therefore often resort to a two-step approach where we first estimate gene trees and then infer the species tree based on the frequency distribution of gene trees. This means of course that during this second step we assume that the gene trees inferred are correct (and you now know that inferring phylogenies is not a trivial exercise!). Here we will use such a two-step approach and use the 50 gene trees that we have inferred above to estimate an ASTRAL summary-coalescent phylogeny. I will not dive much deeper into the theory behind summary-coalescent approaches, but see this [review](https://www.nature.com/articles/s41576-020-0233-0) if you would like to know more! The main aim here is to demonstrate how to infer a summary-coalescent phylogeny in practice and how we can annotate uncertainty via gene and site concordance factors.

In [None]:
# Above we have gathered all window trees in a single file
# Run Astral on the aggregate of window trees to infer the summary coalescent species tree
!astral -i /content/chr1_filt_random_autosomal_windows.trees -o /content/chr1_filt_random_autosomal_windows_ASTRAL.tree

In [None]:
## Import the Phylo class of biopython and visualise the phylogeny
from Bio import Phylo
import matplotlib.pyplot as plt
tree = Phylo.read("/content/chr1_filt_random_autosomal_windows_ASTRAL.tree", "newick")
fig, ax = plt.subplots(figsize=(10, 6))  # Adjust the size as needed
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
Phylo.draw(tree, axes=ax)
plt.show()

ASTRAL has generated the phylogeny that is statistically concordant with a full-coalescent Bayesian species tree approach. It has taken the underlying distribution of gene trees and summarized that into a species tree. You may have noticed that ASTRAL ran incredibly fast and this is because it's only estimating the species tree topology, it doesn't do anything in terms of phylogenetic inference of the gene trees (e.g. likelihood calculations etc.). However, the support annotation is a bit confusing and what we really would like to know is how many of the underlying window trees support any given relationship in the Species Tree. We can do this with Gene concordance factors in IQtree3. Gene concordance factors are basically conveying the number of genes/window trees support a given relationship across the tree. Here is some further information with regards to gCF: [link](https://iqtree.github.io/doc/Concordance-Factor#putting-it-all-together)

In [None]:
## Use IQtree3 to calculate the gene concordance factors on the Summary-Coalescent phylogeny
!iqtree3 -te /content/chr1_filt_random_autosomal_windows_ASTRAL.tree --gcf /content/chr1_filt_random_autosomal_windows.trees --prefix chr1_filt_random_autosomal_windows_ASTRAL_gCF.tree

In [None]:
## Let's see what IQtree3 generated:
!ls chr1_filt_random_autosomal_windows_ASTRAL_gCF.tree.cf*

In [None]:
!cat chr1_filt_random_autosomal_windows_ASTRAL_gCF.tree.cf.stat

In [None]:
## Let's download this annotated tree from the Google colab instance
from google.colab import files
files.download('/content/chr1_filt_random_autosomal_windows_ASTRAL_gCF.tree.cf.stat') 

What did we do here? We took the file with all the 50 window trees that we inferred and asked IQtree3 to summarise how frequently we see the same relationships supported as in the ASTRAL tree.

**Question a):** How do the gene-concordance factors correlate with the internal branch lengths of the summary-coalescent species tree?

**Question b):** How does your tree compare to the phylogeny in the [paper](https://royalsocietypublishing.org/doi/10.1098/rsbl.2024.0611#d1e705)?