## Using Biopython to produce Phylogenetic trees from Multiple Sequence Alignments

This notebook shows the steps of creating a 'distance matrix' from a multiple sequence alignment and producing a phylogenetic tree from that. 

The example sequences used are tRNA sequences from the human reference genome. There is an introduction to the sequences further down this notebook. 

In [None]:
# run this cell to check your Python version is OK for this notebook!
import sys
def check_python_version_above_3_6():
    major = sys.version_info.major
    minor = sys.version_info.minor
    if major < 3 or minor < 6:
        print('ERROR you need to run this notebook with Python 3.6 or above (as f-strings used)')
        print('ERROR current Python version is {}.{}'.format(major, minor))        
        print('ERROR Please see:\n',
              '      https://canvas.anglia.ac.uk/courses/15139/pages/azure-notebooks-switching-kernel\n'
              '      for information on switching kernel on Azure Notebooks')
    else:
        print('Python version {}.{} you are good to go'.format(major, minor))
check_python_version_above_3_6()

Run this set-up cell.

In [None]:
import copy
from io import StringIO
%matplotlib inline

Unless you have already installed it, you will need biopython to use it, in conda this is easy:

conda install biopython

In [None]:
# this should install biopython on Azure notebooks
# https://notebooks.azure.com/help/jupyter-notebooks/package-installation
!conda install biopython -y

In [None]:
# import the Phylogeny class from BioPython checking 
try:
    from Bio import Phylo
except ModuleNotFoundError:
    print('ERROR BioPython not available you will need to install it')

In [None]:
from Bio.Phylo.TreeConstruction import DistanceCalculator
from Bio.Phylo.TreeConstruction import *
from Bio import AlignIO

## Thr tRNA example sequence alignment

Transfer RNA is a very important class of macromolecules - tRNAs transfer amino acids to the growing protein polypeptide chain at the active site of the ribosome. By recognizing the codon (three-nucleotide 'word' )in the messenger RNA they ensure that the amino-acid coded by that is added. 

The correct amino acid is attached to the transfer-RNA by a specialized enzyme called an amino-acyl tRNA synthetase.

You know that there are more than one codon for some amino acids. For example Thr threonine is specified by three codons in the standard genetic code: ACU, ACC, ACA, and ACG. The final position is not important in the code as any nucleotide would apparently give Thr. This is called the 'wobble' base. 

There are only three classes of Thr tRNAs in humans for ACA (AGT), ACG (CGT), and ACA (TGT) (where the second sequence is the anticodon - the reverse complement of the codon). The final expected tRNA for ACC (GGT) does not occur and this codon is read by one of the others using non-standard base pairing at the 'wobble' position. 

Because tRNA is needed in large amounts there are multiple copies of tRNA genes in most organisms. In humans there is a total of 400 or so and they are scattered around the genome. 

There are 20 tRNA Thr genes in the human reference genome. The alignment here is of just 6 representative ones covering the three classes (prepared using the MUSCLE multiple alignment program).

The sequences here are the mature sequences of the RNAs. Remember that the Thymidine nucleotide (base Thymine) in the gene sequence is replaced with Uridine (base Uracil) in the final molecule. There are also many crucial modified bases in mature tRNAs but we are ignoring these complications.

Read in the example file human-Thr-tRNA-mature-examples.afa

In [None]:
aln = AlignIO.read('human-Thr-tRNA-mature-examples.afa', 'fasta')

In [None]:
print(aln)

In [None]:
calculator = DistanceCalculator('identity')

Here the distance calculator is set up to use a simple identity comparison among the sequences. You can maybe see that there are not very many differences among the sequences. This is because they are all from humans and need to function with the same enzymes to accept the amino acid.

Unlike the previous similarity calculations using identity, here the identity is being used to highlight differences between the sequences as these can be used directly as distances. The identity calculation is expressed as a fractional difference over all the columns in common to the sequences. 

After the calculator is created with the model, simply use the get_distance() method to get the distance matrix of a given alignment object. Then you will get a DistanceMatrix object.

In [None]:
dm = calculator.get_distance(aln)

In [None]:
dm

In [None]:
print(dm)

Remember that distance-based methods were just one of the approaches to constructing a phylogenetic tree from a set of aligned sequences.

Check back to see what the other two approaches were called. 

Within the group of distance-based methods there were a number of different algorithms for creating a tree.

The Biopython Phylo module has a DistanceTreeConstructor. This can use either the neighbour-joining (nj) or unweighted pair group method (upgma). The method to be applied is given as a string parameter.

In [None]:
constructor = DistanceTreeConstructor(calculator, 'upgma')

In [None]:
tree = constructor.build_tree(aln)

The UPGMA algorithm should give a rooted tree. The NJ algorithm would give an unrooted tree. 

The Phylo module has a hierarchical Tree object that uses the phylogenetic term 'Clade' for groups.

It takes the sequence names as the labels of the 'leaf' nodes of the tree.

The tree also has branching nodes - inner nodes - which are given default names by the constructor.

In [None]:
print(tree)

The module has a simple ascii method for representing the tree across the output window. 

In [None]:
Phylo.draw_ascii(tree)

There is also a nicer graphic view available. 

In [None]:
Phylo.draw(tree)

Looking at the anticodon labels for these tRNA groupings, do you notice anything interesting? 