# Introduction to Bioinformatics

## 04 - Pyhlogenetic trees

In the third practice session, we learned how to write a Python function that generated multiple sequence alignment (MSA) objects in BioPython's format. These MSA objects are the entry point for most of the methods of evolutionary sequence comparison.

This practical session will explore how to generate phylogenetic trees for comparing our sequences. 

This notebook assumes that you have all the programs installed. If you have not installed them yet, please refer to the preparation instructions that appear in the README.md file in the practical session 04 GitHub's page.

### Introduction

Phylogenetic trees try to capture the realistic aspects of genomic sequence evolution. They are particular representations that, in molecular biology, depict the evolutionary relationships between a set of homologous sequences based upon similarities and differences in their physical or genetic characteristics. Trees are a type of data structure with a tree design with a group of connected nodes. 

Some definitions are essential to work with trees (extracted from [http://etetoolkit.org](http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#treess)):

- A node with a child is called the child's parent node (or ancestor node, or superior). 
- A node has at most one parent.


- The topmost node in a tree is called the root node.
- The root node has no parents.
- All other nodes can be reached from it by following edges or links.


- The height of a node is the length of the longest downward path to a leaf from that node. 
- The height of the root is the height of the tree.
- The depth of a node is the length of the path to its root (i.e., its root path).


- Nodes at the bottommost level of the tree are called leaf nodes.
- Leaf nodes do not have any children.
- An internal node or inner node is any node of a tree with child nodes and is thus not a leaf node.


- A subtree is a portion of a tree data structure that can be viewed as a complete tree in itself.
- Any node in a tree T, together with all the nodes below it, comprise a subtree of T. 

In bioinformatics, trees are the result of many analyses, such as phylogenetics or clustering. ETE is a python toolkit that assists in the automated manipulation, analysis, and visualization of any hierarchical trees. It provides general methods to handle and visualize tree topologies and specific modules to deal with phylogenetic and clustering trees.

### Building a Phylogenetic Tree from a multiple sequence fasta file

We will use the command-line and Python Application Programming Interface (API) of the ETE library to work with Tree constructions and visualizations, respectively.

We will use as an example our previous 20 cluster-center sequences for the 2-hydroxymuconate tautomerase enzyme. The ETE command line program has workflows for all the steps in constructing a Phylogenetic Tree from a fasta file. We start from an unaligned fasta file containing our sequences (in the input directory). We execute:

```ete3 build -w standard_raxml -a input/2HDXMT_clusters.fasta -o output_tree```

### Accessing Tree information with the Python library ete3

The previous command created a directory called output_tree, in which it saved all the output produced by the tree construction. We can now import particular objects from the ete3 library to start accessing the tree data: 

In [None]:
from ete3 import Tree, TreeStyle, TextFace, NodeStyle

We will parse the calculated phylogenetic tree data from a [Newick-format](https://en.wikipedia.org/wiki/Newick_format#:~:text=In%20mathematics%2C%20Newick%20tree%20format,Maddison%2C%20Christopher%20Meacham%2C%20F.) file. First, we print the content of this file:

In [None]:
tree_file = 'output_tree/clustalo_default-none-none-raxml_default/2HDXMT_clusters.fasta.final_tree.nw'

#  The output is stored in one line
with open(tree_file) as tf:
    for l in tf:
        print(l)
        print()

We observe that the tree nodes are positioned inside different parentheses with the node names and the edge distances. It is better to load this information into the ETE library's Python API to help us build more valuable visualizations from this data. 

We start by loading the same file as a Tree object within the ete3 Python library:

In [None]:
# Load a tree structure from a newick file.
t = Tree(tree_file)
print(t)

When we call print upon the Tree object, it displays an ASCII cartoon of how our Tree looks. However, the ASCII visualization does not show the correct edge distances from each tree node. 

We can see the documentation of the Tree() function output to know more about the methods and attributes inside it:

In [None]:
print(type(t))
help(t)

The TreeNode object is the output of the function that parses the Newick format file. This object is how the ETE library represents trees and their information. We can access all the tree information from this object. For example, we can select a specif leaf node in the tree:

In [None]:
# We can select specific nodes by name
A = t.search_nodes(name="tr|A0A135L6Q7|A0A135L6Q7_9BACI")[0]
print(type(A))
print(A)

Note that each node in the TreeNode (even leaves) is also a TreeNode instance. This choice entails the fact that each node can be treated as a subtree of our tree. The TreeNode method can be iterated to access all the leaves in it:

In [None]:
# Iterate leafs inside the tree
for l in t:
    print(l)
    print(l.name)
    print()

The names of the leaves are the first part of the descriptions in our fasta file. We can change this from directly in the TreeNode objects representing each leaf. Let's rename each leaf node name to the UniProt ID only:

In [None]:
# The default Tree() object iteration is by leafs.
t = Tree(tree_file)
for x in t:
    x.name = x.name.split('|')[1] # Change the name attribute value to the UniProt ID
print(t)

We see now that the displayed names correspond correctly to the UniProt IDs of our sequences.

### Visulizying phylogenetic trees with Python

We can create alternative visualizations for our tree depending on our needs. The ETE library offers several ways of displaying trees. Let us try a few of them. 

Most visualizations will be rendered as image files (they can be written into PNG, PDF, and SVG formats), so we import the Image function from our notebook to display them directly in our Jupyter Notebooks.

In [None]:
from IPython.display import Image

Before generating the images of our alternative tree visualizations, we create a folder called "images" to store them all. For that, we import the os library, which has methods for creating system folders:

In [None]:
import os

In [None]:
# Create a folder called "images" if it does not exists
if not os.path.exists('images'):
    os.mkdir('images')

We start with a circular display of our tree. The code below has some commented options that affect how the tree visualization changes. Feel free to uncomment and change them to see how they affect the tree display.

In [None]:
# Display the Tree in a circular style
circular_style = TreeStyle()
circular_style.mode = "c" # draw tree in circular mode

# Options | Uncomment to see their effect

# circular_style.scale = 20
# circular_style.arc_start = 180 # 0 degrees = 3 o'clock
# circular_style.arc_span = 180
# circular_style.title.add_face(TextFace("2-hydroxymuconate tautomerase enzyme Tree", fsize=10), column=0)

td = t.render("images/mytree_cs.png", w=120, units="mm", tree_style=circular_style)
Image('images/mytree_cs.png')

Now we try the standard tree layout style. 

In [None]:
# Display the Tree in a standard tree style
ts = TreeStyle()

# Options | Uncomment to see their effect

# ts.show_leaf_name = True # Names of the leafs
# ts.show_branch_length = True
# ts.show_branch_support = True
# ts.scale = 100 # Zoom in the Y-axis
# ts.rotation = 90
# ts.title.add_face(TextFace("2-hydroxymuconate tautomerase enzyme Tree", fsize=10), column=0)

td = t.render("images/mytree_ts.png", w=100, units="mm", tree_style=ts)
Image('images/mytree_ts.png')

We can also change the way the objects representing the tree appear in the tree visualization:

In [None]:
# Load and change name to leafs.
t = Tree(tree_file)
for x in t:
    x.name = x.name.split('|')[1] # Change the name attribute value
    
# Draws nodes as small red spheres of diameter equal to 7 pixels
nstyle = NodeStyle()
nstyle["shape"] = "sphere"
nstyle["size"] = 7
nstyle["fgcolor"] = "darkred"

# Gray dashed branch lines
# nstyle["hz_line_type"] = 1
# nstyle["hz_line_color"] = "#cccccc"

# Applies the same static style to all nodes in the tree. Note that,
# if "nstyle" is modified, changes will affect to all nodes
for n in t.traverse():
    n.set_style(nstyle)
    
# for n in t:
#     n.set_style(nstyle)
    
# Display the Tree in a standard tree style
td = t.render("images/mytree_ts_rn.png", w=100, units="mm", tree_style=ts)
Image('images/mytree_ts_rn.png')

Finally, we explore how to change the background of selected nodes.

In [None]:
# Load and change name to leafs.
t = Tree(tree_file)
for x in t:
    x.name = x.name.split('|')[1] # Change the name attribute value
    
# Define specific nodes
nst1 = NodeStyle()
nst1["bgcolor"] = "LightSteelBlue"
nst2 = NodeStyle()
nst2["bgcolor"] = "Moccasin"
nst3 = NodeStyle()
nst3["bgcolor"] = "DarkSeaGreen"
nst4 = NodeStyle()
nst4["bgcolor"] = "Khaki"

# Change background color to specific node
n1 = t.get_common_ancestor("P70994", "A0A553ZU89")
n1.set_style(nst1)

n2 = t.get_common_ancestor("J7J0V0", "A0A5R8QHJ7")
n2.set_style(nst2)

n3 = t.get_leaves_by_name('K6BVQ0')[0]
n3.set_style(nst3)

n4 = t.get_common_ancestor("A0A135L6Q7", "A0A430B1P6")
n4.set_style(nst4)

td = t.render("images/mytree_ts_bgc.png", w=100, units="mm", tree_style=ts)
Image('images/mytree_ts_bgc.png')

### Displaying MSA information beside the phylogenetic tree


We can visualize MSA information together with our tree display. The MSA provides more detailed information to depict domain composition, sequence alignment, or any other information relevant to contrast with the evolutionary relationship among our homologous sequences. 

Here we load the phylogenetic tree file with a different class called PhyloTree. This class supports the linkage of multiple sequence alignment for straightforward depiction. 

In [None]:
from ete3 import PhyloTree

In [None]:
# Define the path to the MSA file
alignment_file = 'output_tree/clustalo_default-none-none-raxml_default/2HDXMT_clusters.fasta.final_tree.used_alg.fa'

# Load the newick format tree file
t = PhyloTree(tree_file)

# Link the msa file to the PhyloNode object
t.link_to_alignment(alignment=alignment_file, alg_format="fasta")

# We change the name of the nodes to the UniProt IDs.
for x in t:
    x.name = x.name.split('|')[1] # Change the name attribute value

# Then we render and display the tree image.
td = t.render("images/mytree_ts_fasta.png", w=1000, units="mm")
Image('images/mytree_ts_fasta.png')

### Working with phylogenetic lineages

Often, we need to relate the distances and groupings in the evolutionary tree to taxonomical lineages. The NCBI database has this information in a taxonomical database to access the ETE library directly. However, first, we need to get the species' names for each protein sequence. For this session, we have created a small script (uniprot.py) that contains a function (getOrganism) that retrieves the name of the organism from the UniProt webpage (so you'll need an internet connection). The function depends on the scrapy library, so we should install it before running this function.

We import this function and create a dictionary relating the UniProt IDs to the species names:

In [None]:
from uniprot import getOrganism

Now we can annotate the tree using the above function:

In [None]:
# Collect the list with all the species names for later
species = {}

# Iterate each leaf and print the node name and the name from the species by calling leaf attributes
for n in t.get_leaves():

    # Append the species name to the list
    species[n.name] = getOrganism(n.name)
    
print(species)

We notice that the species' names are not in the standard binomial nomenclature (genus + specific epithet). Therefore we need to change them by grabbing only the first two words for each entry.

In [None]:
# Print last iteration species
print(species[n.name])
# Print last iteration species splitted
print(species[n.name].split())
# Print last iteration species splitted two first items
print(species[n.name].split()[:2])
# Print last iteration species splitted two first items joined by a space character
print(' '.join(species[n.name].split()[:2]))

In [None]:
# Replace each value by only the two first words of the same value
for s in species:
    species[s] = ' '.join(species[s].split()[:2])
print(species)

Now that we have the species names for each of our proteins, we can get the NCBI taxonomic database data. ete3 has a particular class to do this. If this is the first time the class is used, it takes a while to load it since it needs to get all the data from the NCBI database first.

We import this class and wait until it finishes to load all the data:

In [None]:
from ete3 import NCBITaxa

The first thing we need to do is to create a dictionary mapping our species names to the taxonomic IDs of the NCBI taxonomic database:

In [None]:
ncbi = NCBITaxa()
name2taxid = ncbi.get_name_translator([*species.values()])

In [None]:
for k,v in name2taxid.items():
    print(k,v)

We can use this dictionary to name each node with its corresponding taxonomic ID. Let us define a function to do that:

In [None]:
# This function is employed to assign a name to each tree leaf
def get_species_taxid(node_name):
    
    # Get the name of the species using the global species dictionary
    sp_name = species[node_name]
    
    # Get the taxid of the species using the global name2taxid dictionary
    taxid = name2taxid[sp_name][0]
    
    return taxid

We create a new tree with the PhyloTree class into which we also put the MSA information. We then change the node names to the taxid using our previous function.

In [None]:
# Create new tree form the newick file
t2 = PhyloTree(tree_file)

# Give the tree the multiple alignment information
t2.link_to_alignment(alignment=alignment_file, alg_format="fasta")

# Iterate each leave and change the node name to its NCBI taxonomic ID
for n in t2.get_leaves():
    n.name = get_species_taxid(n.name.split('|')[1])
    
# Print the tree to observe the node names
print(t2)

The PhyloNode object has a unique method to feed complete phylogenetic information to the tree "annotate_tree." It takes as an argument the tree and creates as outputs dictionaries for accessing name, lineages, and rank phylogenetic data.

In [None]:
tax2names, tax2lineages, tax2rank = ncbi.annotate_tree(t2)

In [None]:
print(tax2names)

In [None]:
print(tax2lineages)

In [None]:
print(tax2rank)

In [None]:
for n in t2.get_leaves():
    print(n.sci_name)
    print(n.taxid)
    print(n.named_lineage) 
    print(n.lineage)
    print(n.rank)
    print()

We can now display the tree with different phylogenetic rank names at each node level:

In [None]:
print(t2.get_ascii(attributes=["sci_name"]))

We can also change back the names of the nodes to their species scientific names and redisplay the tree:

In [None]:
for n in t2.get_leaves():
    n.name = ' '.join(n.sci_name.split()[:2])

td = t2.render("images/mytree_ts_fasta.png", w=1000, units="mm")
Image('images/mytree_ts_fasta.png')

### Calculating evolutionary distances between nodes

Now that we have some control over how to visualize a phylogenetic tree, we move to calculate distances between nodes; this is a quantitative metric of how distant are two nodes (i.e., sequences) in our tree. Let's use two nodes to see how these calculations are carried out:

In [None]:
# Locate some nodes
A = t.search_nodes(name="A0A135L6Q7")[0]
B = t.search_nodes(name="A0A433XHH6")[0]
C = t.search_nodes(name="A0A562QMX2")[0]

# # Calculate distance from current node
print("The distance between B and C is",  B.get_distance(C))
print("The distance between C and B is",  C.get_distance(B))
print()

# # Calculate distance from current node
print("The distance between A and B is",  A.get_distance(B))
print("The distance between B and A is",  t.get_distance(A,B))
print()

# Calculate the toplogical distance (number of nodes in between)
print("The number of nodes between A and C is ",
    t.get_distance(A,C, topology_only=True))
print("The number of nodes between C and A is ",
    t.get_distance(C,A, topology_only=True))
print(t)

We can also compute the farthest node of a specific node (and its distance):

In [None]:
# Calculate the farthest node
for x in (A,B,C):
    farthest, distance = x.get_farthest_node()
    print("The farthest node from node "+x.name+" is",  farthest.name)
    print("They are at a distance of",  distance)
    print()

Finally, we use these metrics to create a function to derive a similarity matrix based on phylogenetic distances. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def getDistanceMatrixFromTree(tree, topological=False):
    
    # Create array of shape NxN
    M = np.zeros((len(tree), len(tree)))
    
    # Double iteration to compare each leaf node to all the others
    for i, node_i in enumerate(tree):
        for j, node_j in enumerate(tree):
            
            # If the same element is being compared then the distance is zero.
            if i == j:
                M[i][j] = 0
                
            # If two different elements are compared their tree distance is calcualated
            if j > i:
                M[i][j] = node_i.get_distance(node_j, topology_only=topological)
                M[j][i] = M[i][j]
                
    return M

We get the tree distance matrix and plot it with matplotlib:

In [None]:
ids = [n.name for n in t]

In [None]:
M = getDistanceMatrixFromTree(t)
plt.matshow(M)
cbar = plt.colorbar()
cbar.set_label('Distance between connected nodes')
plt.title('Tree distance matrix')
plt.xlabel('Sequence index i')
plt.ylabel('Sequence index j')

We also show the topological distance matrix:

In [None]:
M = getDistanceMatrixFromTree(t, topological=True)
plt.matshow(M)
cbar = plt.colorbar()
cbar.set_label('Topological distance between connected nodes')
plt.title('Tree topological distance matrix')
plt.xlabel('Sequence index i')
plt.ylabel('Sequence index j')

### Wrapping up

In this fourth practice session, we learned:

- How to generate a phylogenetic tree from a fasta file of sequences
- How to load and access the tree information in Python using the ETE library
- How to display the tree with different visualization options
- How to incorporate the MSA display together with our tree
- How to access the complete phylogenetic information of our sequences using the NCBI taxonomic database
- How we can calculate evolutionary distances between the nodes of our tree