Contact: malalarhm93@gmail.com

In [1]:
from malala import * 
from canonical import *
from unrooted1 import *

### Generating tree

To generate all possible rooted trees for a given list of species, we can use the function enumerate labelled trees().


Each species should be represented as an integer number, as demonstrated by the examples below. 

The representation of a node in the tree is [left_child,right_child]

In [2]:
#trees for 2 species
enumerate_labelled_trees([1,2])

[[1, 2]]

In [3]:
#trees for 3 species
enumerate_labelled_trees([1,2,3])

[[[1, 3], 2], [1, [2, 3]], [[1, 2], 3]]

In [4]:
#trees for 4 species
enumerate_labelled_trees([1,2,3,4])

[[[1, 3], [2, 4]],
 [[[1, 3], 2], 4],
 [[[1, 4], 3], 2],
 [[1, [3, 4]], 2],
 [[[1, 3], 4], 2],
 [[1, 4], [2, 3]],
 [[1, [2, 3]], 4],
 [1, [[2, 4], 3]],
 [1, [2, [3, 4]]],
 [1, [[2, 3], 4]],
 [[1, 2], [3, 4]],
 [[[1, 2], 3], 4],
 [[[1, 4], 2], 3],
 [[1, [2, 4]], 3],
 [[[1, 2], 4], 3]]

### Couting the number of changes using Sankoof algorithm 

The function Sankoff() is used to count the number of changes on a given tree. The inputs for the function are the cost matrix, the observed character at the leaf, the alphabet, and the tree's topology.


An illustration of how to use this function is shown below. 

In [5]:
cost_matrix=np.array([[0,2.5,1,2.5],[2.5,0,2.5,1],[1,2.5,0,2.5],[2.5,1,2.5,0]])
print(cost_matrix)

[[0.  2.5 1.  2.5]
 [2.5 0.  2.5 1. ]
 [1.  2.5 0.  2.5]
 [2.5 1.  2.5 0. ]]


In [6]:
tree=[[1,2],[3,[4,5]]]
alphabet=['A','C','G','T']
observedCharacters=['C','A','C','A','G']
#compute the parsimony score for the tree above
s,v=Sankoff(tree,alphabet,observedCharacters,cost_matrix)
print('The parsimony score of this phylogeny is',s,'. \nThe vector cost in the root of this is',v,".\n")

The parsimony score of this phylogeny is 6.0 . 
The vector cost in the root of this is [6. 6. 7. 8.] .



### Genome generator

generateDriver() is a function that uses the Jukes Cantor method to generate data.


It takes an initial genome and the template of the desired tree as input.


In the template tree, each node is represented as [node gen time,[left child gen time],[right child gen time]]. 

In [7]:
initialGenome=5*['a']
templateTree=[1.0, [5.0,[1.0],[5.0]]  , [0.05,[0.1],[0.1]]]
simulatedTree=generateDriver(initialGenome,templateTree)
print("Tree with edges weighted by molecular time.")
print(templateTree)
print("Simulated tree")
print(simulatedTree)

Tree with edges weighted by molecular time.
[1.0, [5.0, [1.0], [5.0]], [0.05, [0.1], [0.1]]]
Simulated tree
[['c', 'a', 't', 'a', 'a'], [['g', 'g', 'g', 'a', 'c'], [['g', 'g', 'g', 'a', 'g']], [['c', 'g', 'c', 't', 't']]], [['c', 'a', 't', 'a', 'a'], [['c', 'a', 't', 'a', 'c']], [['c', 'a', 'g', 'a', 'a']]]]


The output of generateDriver() is converted to our tree representation by convertTree(). 

In [8]:
def convertTree(tree):
    if len(tree) == 1 :
       return(tree[0])
    else :
       return([convertTree(tree[1]),
               convertTree(tree[2])
              ]
             )
       
convertTree(simulatedTree)
    

[[['g', 'g', 'g', 'a', 'g'], ['c', 'g', 'c', 't', 't']],
 [['c', 'a', 't', 'a', 'c'], ['c', 'a', 'g', 'a', 'a']]]

Two rules for defining a template tree recursively:

    - (time) is a template tree
    
    - (time,tree1,tree2) is a template tree if tree1 and tree2 is a template tree
    
Every template tree may be generated by application of the above two rules a finite number of times

The extract genomes() function extracts the genome from the output of genereteDriver() so that it can be used as a dataset. 

In [9]:
def extract_genomes(tree):
    if len(tree)==1 :
        return(tree)
    else :
        return(extract_genomes(tree[1])+extract_genomes(tree[2]))

data=extract_genomes(simulatedTree)
data

[['g', 'g', 'g', 'a', 'g'],
 ['c', 'g', 'c', 't', 't'],
 ['c', 'a', 't', 'a', 'c'],
 ['c', 'a', 'g', 'a', 'a']]

### Example finding the most parsimonious tree

We use the dataset extract from above to find the most parsimonious tree in this case.


Using the enumerte labelled trees() function, we first generate all possible trees for the list of species.


Then, to obtain the most parsimonious tree, we use the function parsimonious_Sank().


This function takes all possible trees for the species, the genome for each species, the alphabet, and the cost matrix as input. 

In [10]:
cost=np.array([[0,1,1,1],[1,0,1,1],[1,1,0,1],[1,1,1,0]])
tree_list=enumerate_labelled_trees([1,2,3,4])
alphabet_lc=['a','c','g','t']
parsimonious, number_changes=parsimonious_Sank(tree_list,data,alphabet_lc,cost)
print("The number of changes are ",number_changes)
print('The most parsimonious trees are', parsimonious)

The number of changes are  8.0
The most parsimonious trees are [[[1, [3, 4]], 2], [1, [2, [3, 4]]], [[1, 2], [3, 4]], [[[1, 2], 3], 4], [[[1, 2], 4], 3]]


We can see that we have multiple trees as output, but when we try to unroot them, we only get one unrooted tree.


The function canonical() reduces these trees to a single representation.


The function canonical rooted list() displays only one representation of these trees by removing all duplicated trees from the list of trees. 

In [11]:
for tree in parsimonious:
    print(canonical(tree))

[1, [2, [3, 4]]]
[1, [2, [3, 4]]]
[1, [2, [3, 4]]]
[1, [2, [3, 4]]]
[1, [2, [3, 4]]]


In [12]:
c_tree=canonical_rooted_list(parsimonious)
c_tree

[[1, [2, [3, 4]]]]

### Generating all possible tree that have different representation of unrooted tree

For a given number of species, the function generate unrooted trees() generates all possible trees with different representations of unrooted trees. 

In [13]:
generate_unrooted_trees(4)

[[1, [[2, 4], 3]], [1, [[2, 3], 4]], [1, [2, [3, 4]]]]

### Replace tree labels

The replace_tree_labels_in_tree_list() function is used to replace the label in a tree.

In [14]:
label_list=[ 'a', 'b', 'c', 'd']
replace_tree_labels_in_tree_list(c_tree,label_list)

[['a', ['b', ['c', 'd']]]]

### Exercise

Try to find the most parsimonious trees using these datasets

In [15]:
mouse='ACCAAAAAAACATCCAAACACCAACCCCAGCCCTTACGCAATAGCCATACAAAGAATATTATACTACTAAAAACTCAAATTAACTCTTTAATCTTTATACAACATTCCACCAACCTATCCACACAAAAAAACTCATATTTATCTAAATACGAACTTCACACAACCTTAACACATAAACATACCCCAGCCCAACACCCTTCCACAAATCCTTAATATACGCACCATAAATAAC'
m=[i for i in mouse]
bovine='ACCAAACCTGTCCCCACCATCTAACACCAACCCACATATACAAGCTAAACCAAAAATACCATACAACCATAAATAAGACTAATCTATTAAAATAACCCATTACGATACAAAATCCCTTTCGTCTAGATACAAACCACAACACACAATTAATACACACCACAATTACAATACTAAACTCCCATCCCACCAAATCACCCTCCATCAAATCCACAAATTACACAACCATTAACCC'
b=[i for i in bovine]
gibbon='ACTATACCCACCCAACTCGACCTACACCAATCCCCACATAGCACACAGACCAACAACCTCCCACCTTCCATACCAAGCCCCGACTTTACCGCCAACGCACCTCATCAAAACATACCTACAACACAAACAAATGCCCCCCCACCCTCCTTCTTCAAGCCCACTAGACCATCCTACCTTCCTAGCACGCCAAGCTCTCTACCATCAAACGCACAACTTACACATACAGAACCAC'
g=[i for i in gibbon]
orang='ACCCCACCCGTCTACACCAGCCAACACCAACCCCCACCTACTATACCAACCAATAACCTCTCAACCCCTAAACCAAACACTATCCCCAAAACCAACACACTCTACCAAAATACACCCCCAATTCACATCCGCACACCCCCACCCCCCCTGCCCACGTCCATCCCATCACCCTCTCCTCCCAACACCCTAAGCCACCTTCCTCAAAATCCAAAACCCACACAACCGAAACAAC'
o=[i for i in orang]
gorilla='ACCCCATTTATCCATAAAAACCAACACCAACCCCCATCTAACACACAAACTAATGACCCCCCACCCTCAAAGCCAAACACCAACCCTATAATCAATACGCCTTATCAAAACACACCCCCAACATAAACCCACGCACCCCCACCCCTTCCGCCCATGCTCACCACATCATCTCTCCCCTTCAACACCTCAATCCACCTCCCCCCAAATACACAATTCACACAAACAATACCAC'
go=[i for i in gorilla]
chimp='ACCCCATCCACCCATACAAACCAACATTACCCTCCATCCAATATACAAACTAACAACCTCCCACTCTTCAGACCGAACACCAATCTCACAACCAACACGCCCCGTCAAAACACCCCTTCAGCACAAATTCATACACCCCTACCTTTCCTACCCACGTTCACCACATCATCCCCCCCTCTCAACATCTTGACTCGCCTCTCTCCAAACACACAATTCACGCAAACAACGCCAC'
ch=[i for i in chimp]
human='ACCCCACTCACCCATACAAACCAACACCACTCTCCACCTAATATACAAATTAATAACCTCCCACCTTCAGAACTGAACGCCAATCTCATAACCAACACACCCCATCAAAGCACCCCTCCAACACAAACCCGCACACCTCCACCCCCCTCGTCTACGCTTACCACGTCATCCCTCCCTCTCAACACCTTAACTCACCTTCTCCCAAACGCACAATTCGCACACACAACGCCAC'
h=[i for i in human]

In [16]:
primates_and_friends=[b,m,o,h,ch,go,g]

Step 1- Generate all possible tree for these 7 species using enumerate_labelled_trees() function

1: mouse, 2: bovine, 3: gibbon, 4: orang, 5: gorilla, 6: chimp, 7: human

Step 2- Find the most parsimonius trees using parsimonious_Sank() function. 

You can use replace_tree_labels_in_tree_list() to rename the label of the most parsimonius tree.