# Molecular Omics Workshop - Phylogenetic Placement

This notebook contains a workflow used in performing phylogenetic placement. This includes: (1) alignment of query sequences (QS) and reference sequences (RS) using PaPaRa, (2) inserting the QS in the reference tree (RT) using RAxML-EPA, and (3) visualizing the tree and generating some metrics to assess the inserted QS.

## <font color="blue">How to Use This Notebook</font>

1. Open jupyter notebook with the command below and select the notebook.
>`jupyter-notebook`
2. To run the cells in this notebook, press Shift+Enter.
3. At any point in the workshop when running a command fails or takes too long, you can copy the necessary files from the finished folder to your data folder to be able to proceed to the next step.

## <font color="blue">References</font>

1. The data used in this demonstration was obtained from the study below. The description of their methods are available in this GitHub repository: [https://github.com/lczech/dinoflagellate-paper](https://github.com/lczech/dinoflagellate-paper).

   <i>Gottschling, M., Czech, L., Mahé, F., Adl, S., & Dunthorn, M. (2021). The windblown: possible explanations for dinophyte DNA in forest soils. Journal of Eukaryotic Microbiology, 68(1), e12833.</i>


2. PaPaRa

   <i>Berger, S. A., & Stamatakis, A. (2011). Aligning short reads to reference alignments and trees. Bioinformatics, 27(15), 2068-2075.</i>

   <i>Berger, S. A., & Stamatakis, A. (2012). PaPaRa 2.0: a vectorized algorithm for probabilistic phylogeny-aware alignment extension. Heidelberg Institute for Theoretical Studies, 12.</i>


3. RAxML-EPA

   <i>Berger, S. A., Krompass, D., & Stamatakis, A. (2011). Performance, accuracy, and web server for evolutionary placement of short sequence reads under maximum likelihood. Systematic biology, 60(3), 291-302.</i>


4. Gappa

   <i>Czech, L., Barbera, P., & Stamatakis, A. (2020). Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data. Bioinformatics, 36(10), 3263-3265.</i>

---
## <font color="blue">Table of Contents</font>
 * [**Step 1: Alignment of QS and RS using PaPaRa**](#Step-1:-Alignment-of-QS-and-RS-using-PaPaRa)  
 * [**Step 2: Phylogenetic placement using RAxML-EPA**](#Step-2:-Phylogenetic-placement-using-RAxML-EPA)  
 * [**Step 3: Check some placement statistics**](#Step-3:-Check-some-placement-statistics)
     * [Calculate EDPL](#Calculate-EDPL)
     * [Tally LWR](#Tally-LWR)
 * [**Step 4: Visualize tree using Gappa - without multiplicities**](#Step-4:-Visualize-tree-using-Gappa---without-multiplicities)
 * [**Step 5: Step 5: Revisualize - with multiplicities**](#Step-5:-Revisualize---with-multiplicities)
     * [Add multiplicities to tree](#Add-multiplicities-to-tree)
     * [Visualize tree with multiplicities](#Visualize-tree-with-multiplicities)
 * [**Step 6: Revisualize - max LWRs only**](#Step-6:-Revisualize---max-LWRs-only)
 
---

# <font color = 'gray'>Step 1: Alignment of QS and RS using PaPaRa</font>

The first step involves aligning the query sequences (QS) (`OTU_representatives_dinoflagellates.upper`) to the existing alignment of the reference sequences (RS) (`2_ref_aln.fas`). There are plenty of tools that can accomplish this task, however, `papara` is unique as it incorporates the information from the reference tree (RT) (`3_ref_tree.tre`) when doing this task.

In [None]:
#Align QS to existing reference alignment using PaPaRa
!papara \
   -t "1_init_data/3_ref_tree.tre" \
   -s "1_init_data/2_ref_aln.fas" \
   -q "1_init_data/OTU_representatives_dinoflagellates.upper" \
   -n dino_aln

#Move output files to 2_alignment
!mv papara* 2_alignment

In [None]:
#Copy the finished output if run is taking too long
!cp ../phylo_placement_finished/2_alignment/* 2_alignment/

# <font color = 'gray'>Step 2: Phylogenetic placement using RAxML-EPA</font>

After aligning the QS and the RS, we are now ready to place the QS om the RT. 

Briefly, `raxmlHPC` attempts to place a QS to each branch/edge of the RT, and calculates a score for each placement. It will then tally the associated scores for each placement to calculate the likelihood weight ratios (LWR; LWR = [branch LWR] / [total LWR]). This could be interpreted as a probability of the QS being placed onto a certain branch.

---
<div>
<img src="jupyter_figures/epa_schematic.jpg" width="700"/>
</div>

---

*This may take a long time to finish. Copy the finished output instead.*

In [None]:
#Perform phylogenetic placement using RAxML
!raxmlHPC \
   -f v \
   -s "2_alignment/papara_alignment.dino_aln" \
   -t "1_init_data/3_ref_tree.tre" \
   -m GTRGAMMAX \
   -n dino_epa \
   --epa-accumulated-threshold=0.99 

#Move output files to 3_epa
!mv RAxML* 3_epa

In [None]:
#Copy the finished output instead
!cp ../phylo_placement_finished/3_epa/* 3_epa/

# <font color = 'gray'>Step 3: Check some placement statistics</font>

For sanity check, we can look at some statistics regarding the placement of the QS onto the RT. In this demo, we will simply be looking at 2 measures.

### Calculate EDPL

The expected distance between placement locations (EDPL) is a measure of the uncertainty of the QS placements. Alternatively, we can view it as how dispersed the QS placements are, weighted by the LWR of each placement. 

Consider the diagram below (source: [Matsen, 2010](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-538)). Although there are plenty of possible placements for the hollow stars, because they are concentrated on small section of the RT, its EDPL value would be low. However, the opposite is true for the filled stars.

---
<div>
<img src="jupyter_figures/edpl_diagram.png" width="300"/>
</div>

---

In [None]:
#Calculate EDPL value of QS
!gappa examine edpl \
   --jplace-path 3_epa/*.jplace \
   --out-dir "4_gappa_no_multips" \
   --file-prefix dino_ \
   --allow-file-overwriting

In [None]:
#Draw histogram of the EDPL values of the QS
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd

edpl_list=pd.read_csv("4_gappa_no_multips/dino_edpl_list.csv")

plt.hist(edpl_list["EDPL"], 20, edgecolor='black')
plt.title("EDPL Histogram")
plt.show()

### Tally LWR

We can also check the distribution of the LWR of the QS' placements. In the commands below, we will be looking at a histogram plot of the highest LWR only of all QS.

In [None]:
#List LWR per QS; only top 5 highest LWRs are listed by default
!gappa examine lwr-list \
   --jplace-path 3_epa/*.jplace  \
   --out-dir "4_gappa_no_multips" \
   --file-prefix dino_ \
   --allow-file-overwriting 

In [None]:
#Draw histogram of the top LWR value of the QS
lwr_list=pd.read_csv("4_gappa_no_multips/dino_lwr-list.csv")

plt.hist(lwr_list["LWR.1"], 20, edgecolor='black')
plt.title("LWR.1 Histogram")
plt.show()

# <font color = 'gray'>Step 4: Visualize tree using Gappa - without multiplicities</font>

Now, we can draw heat tree to see the distribution of the QS placements across the RT. Note that this heat tree accounts for all the possible placements of the QS and not the highest LWR placement only. Also, We are considering branches with at least 0.5 cumulative LWR (`--min-value 0.5`) only.

In [None]:
#Generate a heat tree from the results of RAxML EPA
!gappa examine heat-tree \
   --jplace-path 3_epa/*.jplace \
   --log-scaling \
   --write-svg-tree \
   --out-dir "4_gappa_no_multips" \
   --file-prefix no_multips_dino_heat_ \
   --allow-file-overwriting \
   --under-color "#cccccc" \
   --min-value 0.5 

In [None]:
import webbrowser

webbrowser.open("4_gappa_no_multips/no_multips_dino_heat_tree.svg", new=2)

This heat tree does not take into account the frequency/multiplicity of each QS in the sample. Hence, this visual focuses more on the presence/absence of QS and answers the question: <i>**who is present in the sample?**</i>

# <font color = 'gray'>Step 5: Revisualize - with multiplicities</font>

As mentioned, the visualization above assumes that each QS has a frequency/multiplicity of 1 in the sample. However, in reality, the QS in the sample occur at different multiplicities. To account for this, we can do the following steps.

### Add multiplicities to tree

In [None]:
#Create new jplace tree file which includes mutliplicties/abundances of QS
!gappa edit multiplicity \
   --jplace-path 3_epa/*jplace \
   --multiplicity-file "5_gappa_w_multips/QS_multips.txt" \
   --file-prefix multips_ \
   --out-dir "5_gappa_w_multips" \
   --allow-file-overwriting \
   --verbose

### Visualize tree with multiplicities

In [None]:
#Create heat tree of new jplace file with mutiplicities
!gappa examine heat-tree \
   --jplace-path 5_gappa_w_multips/*jplace  \
   --log-scaling \
   --write-svg-tree \
   --out-dir "5_gappa_w_multips" \
   --file-prefix multips_dino_heat_ \
   --allow-file-overwriting \
   --under-color "#cccccc" \
   --min-value 0.5

In [None]:
webbrowser.open("5_gappa_w_multips/multips_dino_heat_tree.svg", new=2)

This time, since we have added the multiplicity information on the tree (`multips_RAxML_portableTree.dino_epa.jplace`), the resulting visual now focuses more on: <i> **what is the distribution of the observed taxons?** </i>

# <font color = 'gray'>Step 6: Revisualize - max LWRs only</font>

Finally, we can also visualize the heat tree by taking into account the highest LWR (probability) placements of each QS.

### Visualize tree: max LWRs, without multiplicities

In [None]:
#Create jplace tree considering the max LWR only for the tree produced by RAxML EPA
!gappa examine heat-tree \
   --jplace-path 3_epa/*jplace \
   --log-scaling \
   --write-svg-tree \
   --out-dir "6_gappa_max_lwr_only" \
   --file-prefix no_multips_dino_max_heat_ \
   --allow-file-overwriting \
   --under-color "#cccccc" \
   --min-value 0.5 \
   --point-mass

In [None]:
webbrowser.open("6_gappa_max_lwr_only/no_multips_dino_max_heat_tree.svg", new=2)

### Visualize tree: max LWRs, with multiplicities

In [None]:
#Create jplace tree considering the max LWR only for the tree with multiplicities 
!gappa examine heat-tree \
   --jplace-path 5_gappa_w_multips/*jplace \
   --log-scaling \
   --write-svg-tree \
   --out-dir "6_gappa_max_lwr_only" \
   --file-prefix multips_dino_max_heat_ \
   --allow-file-overwriting \
   --under-color "#cccccc" \
   --min-value 0.5 \
   --point-mass

In [None]:
webbrowser.open("6_gappa_max_lwr_only/multips_dino_max_heat_tree.svg", new=2)