# Phylogenetic Inference

<a id='setup'></a>

## 0. Setup

In [2]:
from qiime2 import Visualization

# location of this week's data and all the results produced by this notebook 
# - this should be a path relative to your working directory
data_dir = '../data'

#### 1.1.1 Sequence alignment
needed: rep-seqs-filtered.qza

In [3]:
! qiime alignment mafft \
    --i-sequences $data_dir/rep-seqs-filtered.qza \
    --o-alignment $data_dir/aligned-rep-seqs.qza

[32mSaved FeatureData[AlignedSequence] to: ../data/aligned-rep-seqs.qza[0m
[0m

#### 1.1.2 Alignment masking

It has been suggested by some authors that masking (removing) the ambiguously aligned regions from the alignment (i.e.: regions that are phylogenetically uninformative due e.g. to alignment errors) can increase the performance of the reconstructed phylogeny.:

In [4]:
! qiime alignment mask \
    --i-alignment $data_dir/aligned-rep-seqs.qza \
    --o-masked-alignment $data_dir/masked-aligned-rep-seqs.qza

[32mSaved FeatureData[AlignedSequence] to: ../data/masked-aligned-rep-seqs.qza[0m
[0m

#### 1.1.3 Tree construction

Finally, we can use that alignment to construct our phylogenetic tree. There are many methods to do that, e.g: FastTree, RAxML or IQ-TREE (all of those supported in QIIME 2). Here, we will use FastTree, mainly due to its speed. FastTree produces an unrooted tree, hence in the second step we will place the root of the tree at the midpoint of the longest tip-to-tip distance in the unrooted tree.

In [5]:
! qiime phylogeny fasttree \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --o-tree $data_dir/fasttree-tree.qza

! qiime phylogeny midpoint-root \
    --i-tree $data_dir/fasttree-tree.qza \
    --o-rooted-tree $data_dir/fasttree-tree-rooted.qza

[32mSaved Phylogeny[Unrooted] to: ../data/fasttree-tree.qza[0m
[0m[32mSaved Phylogeny[Rooted] to: ../data/fasttree-tree-rooted.qza[0m
[0m

#### 1.1.4 Tree visualization

has to be run on Euler (package installed)

In [11]:
! qiime empress tree-plot \
    --i-tree $data_dir/fasttree-tree-rooted.qza \
    --m-feature-metadata-file $data_dir/taxonomy.qza \
    --o-visualization $data_dir/fasttree-tree-rooted.qzv

[31m[1mError: QIIME 2 has no plugin/command named 'empress'.[0m


Open the qzv files on [view.qiime2.org](https://view.qiime2.org).

Now, for comparison, you can try [iTOL](https://itol.embl.de/upload.cgi).

After opening the web page, click _Choose File_ and select the tree artifact we generated above. Click _Upload_: after a few seconds you should see the tree. In order to label all the nodes with corresponding taxonomies, find the _taxonomy.qza_ artifact and drag-and-drop it onto the tree: this will add the labels (don't worry if a warning about a couple of missing features appears: these are the taxa we filtered out earlier). If you want, you can also add the alignment itself to the tree! Just drag-and-drop it onto the tree again.

You may find it easier to navigate the tree in its "rectangular" representation: to change the view, select the _Rectangular_ option in the _Mode_ section of the _Basic_ tab.

#### 1.1.5 Bootstrapping

Bootstrapping trees is a statistical approach to asserting robustness of the branch splits. In simple terms, it is based on reconstructing the same tree _n_ times by resampling and counting how often a certain branch occurs at the same position. Bootstrapping is a lengthy process, but if you are interested you can see below how it can be done in QIIME 2. The tree generated with this method will have an additional set of _bootstrap values_ that you will then be able to see on the tree (in the iTOL browser).

**Note:** This step takes >30 min to run.

In [None]:
! qiime phylogeny raxml-rapid-bootstrap \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --p-seed 1723 \
    --p-rapid-bootstrap-seed 9384 \
    --p-bootstrap-replicates 100 \
    --p-substitution-model GTRCAT \
    --p-n-threads 3 \
    --o-tree $data_dir/raxml-cat-bootstrap-tree.qza

Now visualize the new tree using your method of choice. Remember to root the tree first, as the `raxml-rapid-bootstrap` action produces an unrooted tree.

In [None]:
! qiime phylogeny midpoint-root \
    --i-tree $data_dir/raxml-cat-bootstrap-tree.qza \
    --o-rooted-tree $data_dir/raxml-cat-bootstrap-tree-rooted.qza

! qiime empress tree-plot \
    --i-tree $data_dir/raxml-cat-bootstrap-tree-rooted.qza \
    --m-feature-metadata-file $data_dir/taxonomy.qza \
    --o-visualization $data_dir/raxml-cat-bootstrap-tree-rooted.qzv

Open the qzv files on [view.qiime2.org](https://view.qiime2.org).

<a id='fragm_insert'></a>

### 1.2 Fragment insertion

A method alternative to _de novo_ tree reconstruction is **fragment insertion**. In this method, instead of constructing the entire tree from scratch, we rather use a tree that was already constructed and only try to insert our sequences into that existing tree.

As our reference, we will use a tree that was built from the Greengenes 13_8 database at 99% identity.

In [None]:
! wget -nv -O $data_dir/sepp-refs-gg-13-8.qza https://data.qiime2.org/2021.4/common/sepp-refs-gg-13-8.qza

**Note:** This is a resource intensive command that again requires a large amount of memory and may take quite long to run (>30 min). Do not increase the number of threads below to more than 2 as this also increases memory demand and may cause your workspace to crash.

In [None]:
! qiime fragment-insertion sepp \
    --i-representative-sequences $data_dir/rep-seqs-filtered.qza \
    --i-reference-database $data_dir/sepp-refs-gg-13-8.qza \
    --p-threads 2 \
    --o-tree $data_dir/sepp-tree.qza \
    --o-placements $data_dir/sepp-tree-placements.qza

Finally, you can proceed to tree visualization with your method of choice. Keep in mind that this tree is already rooted so no need to run the `phylogeny midpoint-root` action.

### 1.3 Checkpoint

Look at the trees obtained using the _de novo_ and fragment insertion approach. What is the main difference between them?

- The main difference is the size (expressed in the amount of branches) of each tree. The tree obtained _via_ fragment insertion is considerably larger as it comprises all the species present in the Greengenes database (and our sequences are only _inserted_ into that tree), while the _de novo_ tree consists only of the taxa that we identified in our samples.