This is extended documentation for our file
diagnostics_and_testing.md
where we test and explore the divisive and agglomerative algorithms that
make trees from aligned nucleotide sequences. Obviously to test the
accuracy of either algorithm we must compare its result to the “correct”
tree–the tree that actually corresponds to the input sequence. For cases
such as the coiii.nex
dataset or the dataset from Press et al, the
correct trees are already known. However, we would ideally like to test
our algorithms against several datasets of varying parameters (such as
sequence lengths and number of taxa) to get more detailed information
about how they perform. We work as follows: we make a random tree and
generate a sequence alignment that corresponds to it using
evolverRandomTree
from PAML. Then given the sequence alignment we
compute the corresponding divisivie and agglomerative tree using
infotree
and agg_clustering
We then find the Robinson-Foulds
distances between these computed trees and the tree we used to generate
the sequence alignment. We also compile data on the ultrametricity and
addivity of these computed trees and on the relationship between the
branch lengths of the original tree and those of the computed ones. We
repeat this procedure a large number of times and compute summary
statistics and plots of the data we get.
There are obviously two “generative” steps in the process: 1) generating the sequence alignment and 2) generating the tree. We accomplish the latter on the CBCB cluster; but first we document how to do (1) and the pitfalls we should be wary of.
We must be very careful when using simulated alignments for evaluating correctness because, as mentioned on p.186 of Yang 2006, a tree reconstruction algorithm can go wrong because of sampling error in the input sequence alignment. In other words, since the sequence alignment is generated from a stochastic model (such as JC69) on the original tree, the alignment only corresponds perfectly to the tree as the number of sites goes to infinity. However, since we always use a finite number of sites (modern computers don’t yet work with an infinite amount of data), there is a chance that the simulated alignment corresponds to more than just one tree. Consequently the reconstructed tree may not match the original not because of a flaw in the algorithm but simply because the sequence alignment has too few sites to actually reflect its phylogeny. So before we even start testing our algorithms we need to make sure that the data we give it has low enough (though never zero) sampling error or noise.
There are also parameters other than site number that can affect sampling error. For example, in the evolver method a large mutation rate (given all the other parameters are fixed) could cause too many sites to get substituted–resulting in a sequence alignment that looks almost random. Conversely, the mutation rate could be too low so the bifurcation between different groups of taxa would not be accurately reflected in the simulated sequence alignment.
We explored essentially three methods to simulate sequence from randomly generated trees:
-
The
ms()
+seqgen()
approach outlined on p.5 of Wei-Chen Chen’sphyclust
manual -
Using
simSeq()
inphangorn
in R. -
Using
evolver
in PAML outlined in Ziheng Yang’s manual.
We decided to go with option (c) because of the ease with which random
tree generation and sequence generation are integrated in the
evolverRandomTree
option. We also made some changes in the original
PAML code to write both the original tree and the generated sequence
files. To get this going from scratch you need the following:
-
Install and set up PAML from the manual. The instructions are in Chapter 2.
-
Open
evolver.c
in/src
and make the following changes to also store the random tree generated inevolverRandomTree
:
- On line 890 change
fixtree = 1
tofixtree = 0
invoid Simulate()
- On line 887 add
*fseqtree
betweenFILE
and*fin
invoid Simulate()
. - After line 1121 add
fseqtree = gfopen("mctrees.nex", "w");
. - On lines 1198-1200 change
fprintf(fseq,"\nbegin tree;\n tree true_tree = [&U] ");
OutTreeN(fseq,1,1); fputs(";\n",fseq);
fprintf(fseq,"end;\n\n");
to
fprintf(fseqtree,"\nbegin trees;\n tree true_tree = [&R] ");
OutTreeN(fseqtree,1,1); fputs(";\n",fseqtree);
fprintf(fseqtree,"end;\n\n");
- On line 1257 add
fclose(fseqtree)
.
- Open Terminal and change the directory to
src
. Then rungcc -O3 -o evolverRandomTree evolver.c tools.c
if you’re usinggcc
(or the corresponding command inReadme.txt
).
Now the evolverRandomTree
is ready. To make a random tree and generate
a sequence from it, run ...src/evolverRandomTree 5 #PATHTODATFILE
. A
sample dat
file is given in MCbaseRtree.dat
. In this case
#PATHTODATFILE
will be paml4.8/MCbaseRtree.dat
. We have also written
an R wrapper that who_dat()
which writes a .dat
file based on the
model parameters you give it. Here is a sample, where we generate 100
blocks of sequence data for 10 species where each block corresponds to a
tree randomly generated in the TN93 model.
setwd("/Users/shashanksule/Documents/info_theoretic_phylo/")
n <- 100
trees <- list(n)
sequences <- list(n)
seeds <- list(n)
for(i in 1:n){
#Initialize the .dat file and store the seed
seeds[[i]] <- who_dat(seqs = 16,
sites = 1000,
model = 6,
parameters = "5 5",
gamma = "0.5 4",
mutation = 3.5,
equilibrium = "0.25 0.28 0.34 0.13",
spit_seed = TRUE)
#Simulate the trees and the sequences
system("paml4.8/src/evolverRandomTree 5 paml4.8/MCbaseRTree.dat")
trees[[i]] <- write.tree(read.nexus("mctrees.nex"))
sequences[[i]] <- ReadCharacters("mc.nex")
}
Unfortunately, the divisive algorithm infotree
is
where is the number of
OTU’s so the problem quickly becomes too much for our little laptops to
handle. But worry not, since we have been conveniently allocated some
space on the CBCB cluster! If you ssh
into
#YOURUSERNAME@cbcbsub00.umiacs.umd.edu
with your password, you’ll have
entered the headnode of the cluster, ready to submit jobs to the most
powerful computing henchmen in the world (for those of you who take this
stuff too literally: we’re just kidding).
There are some excellent resources online on how to use the cluster and submit jobs to it (including UMD-specific resources). Our favourites (and the ones sufficient for this project) are:
-
Junaid Merchant’s introductory workshop on HPCs at UMD.
-
Using SLURM with R: This one is especially helpful as it has a simple end-to-end example of a computing procedure that uses parallel computing.
-
Passing variables between different scripts: bash to R and referencing a bash variable.
The idea is to compute say a 100 divisive trees using the fastest possible parallel procedure. There are a few things to keep in mind:
-
One may think parallelizing using
parallel
in R for every command is the way to go. But that isn’t necessarily true; for example we may want to compute 10 highly intensive parallelizable tasks independently and have 10 cores available. But we shouldn’t necessarily submit each task to one core. It may be more efficient to do them in sequence but dedicate all 10 cores to each task at a time (or do two tasks at once, 5 cores to each task and so on). We face this sort of problem ininfotree
because it computesmax_info
over all possible partitions. Now we have a choice, either to compute over the partitions simultaneously while dedicating fewer cores tomax_info
or computing over the partitions sequentially while dedicating all our cores tomax_info
. We find that for low number of taxa (<10) it is actually beneficial to go for the latter strategy. -
We can also choose to split a task over multiple nodes (like computing the information gain over batches of partitions) but this has to be done explicitly in a bash script or via
slurmR
which communicates also with the nodes available in the computer.
The workflow is as follows:
- Upload the data on the cluster in a directory
sequence_data
, an R scripttreecomputer.R
that computes a divisive tree from a sequence alignment, and a directorytree_data
that stores the tree. The R script should be able to take the file name o the sequence alignment and write its output (a divisive tree) to file. - Write a bash script
process.sh
to execute the R script along with instructions inSLURM
over exactly how to compute over the data. Note that this is a quick and dirty way to do this. There are many more elegant solutions possible since R scripts don’t actually need to be wrapped in bash unlike MATLAB. This script is where we pass the file name totreecomputer.R
. - Execute
sbatch --process.sh
and wait for everything to finish!
So first we write the data produced via the evolver to the disk:
getwd()
tree_names <- paste(getwd(), "/TN93_Jul21/Trees/tree", 1:100, ".txt", sep = "")
seed_names <- paste(getwd(), "/TN93_Jul21/Seeds/seed", 1:100, ".txt", sep = "")
seqs_names <- paste(getwd(), "/TN93_Jul21/Sequences/sequence", 1:100, ".nex", sep = "")
mapply(function(x,y) write(x,file = y), trees, tree_names)
mapply(function(x,y) write(x, file = y), seeds, seed_names)
mapply(function(x,y) write.nexus.data(x, file = y), sequences, seqs_names)
Navigate to the info_theoretic_phylo
directory in your terminal and
execute the following:
-
SSH into your directory on the headnode by executing
ssh #username@cbcbsub00.umiacs.umd.edu
and create two directories:sequence_data
andtree_data
. You can do this by the commandmkdir #directoryname
. -
Enter
exit
and go back to theinfo_theoretic_phylo
directory on the local machine. Secure copy theutilities.R
,treecomputer.R
andprocess.sh
to your personal directory on the head node and the/#trialname/Sequences
directory to the.../sequence_data/
directory. You can do the former by executingscp #filepath #username@cbcbsub00.umiacs.umd.edu:/cbcbhomes/#username
and the latter by executingscp -r #directorypath #targetpath
-
SSH back into the headnode (step 1). Then execute
sbatch process.sh
.
Done! You can check the status of the computation by executing squeue
on the head node.