-
Notifications
You must be signed in to change notification settings - Fork 16
Home
(For full explanation of the pipeline, see the PCOC paper).
PCOC is a convergent substitution detection tool with two main facets:
-
- Conditions to use the PCOC detection pipeline:
- you already have an empirical dataset, i.e. (at least) an amino-acid alignment and a corresponding gene tree
- Conditions to use the PCOC detection pipeline:
-
-
Conditions to use the PCOC simulation pipeline:
-
you want to start a project on convergence and you wonder how many samples are necessary to have sufficient power to detect convergent sites. With the PCOC simulation pipeline, you can simulate if detection of convergence is possible with a dataset of a given size.
-
you want to test the underlying model of PCOC and compare it to other algorithms used to detect convergence
-
-
Table of contents:
- I. Using PCOC to detect convergence in an empirical dataset
- II. Using PCOC to simulate a dataset and compare methods for detecting convergent evolution on it
The user needs:
- an amino-acid alignment
- a gene tree with values for branch lengths (these values are used by the program)
Once you have these data, you can install PCOC.
If you don't have your data yet, you can use the dataset from (Besnard et al., 2009) used in the PCOC paper and courteously provided by the authors:
Besnard, G., Muasya, A. M., Russier, F., Roalson, E. H., Salamin, N., and Christin, P.-A. 2009. Phylogenomics of C4 Photosynthesis in Sedges (Cyperaceae): Multiple Appearances and Genetic Convergence. Molecular Biology and Evolution, 26(8): 1909–1919, https://doi.org/10.1093/molbev/msp103
#get the PEPC protein alignment for sedges (plant species at C3/C4 transition), and rename this alignment with shorter name
wget https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.aa.coor_mays.fa
mv cyp_coding.aa.coor_mays.fa ali.fa
#get the corresponding tree, rename it and move it to a new folder
mkdir -p tree_dir
wget https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.phy_phyml_tree.txt -O tree_dir/tree.nw
Of note: the gene tree needs to have the exact same labels (same number of labels, same names) as the alignment.
PCOC relies on the Bpp
suite (https://github.com/BioPP/bppsuite) which is in constant development. In order to avoid installation and compilation of the latest version on your machine (this can take a long time), we suggest you to use Docker
. Of course, you can also use PCOC without Docker but Docker
is the easiest way to use the PCOC toolkit locally. Docker
will create a local environment on your computer that will contain PCOC dependencies (Bpp
and python
with some modules (ete3
and Biopython
)).They will all be packaged within the PCOC Docker
image.
If you don't have Docker on your machine, get it here first. (Be aware that installation might differ if you're a Linux, a Mac or a Windows user.)
To download or update the docker image you have to type the following command line:
docker pull carinerey/pcoc
Note: Mac users may first need to start Docker using the Docker application (if it's not running in background). If not, you could get the error message Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
.
Then, to use PCOC through a docker container (containing the PCOC image), you have to type:
# run as user (to have user permission on output files) ##NOT WORKING
docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc [SOME PCOC TOOL] [SOME PCOC OPTIONS]
What does this command line mean?
-
docker run [container options] [image] [cmd]
is the command to launch a[cmd]
in a docker container with the specified[container options]
containing a given[image]
. -
-e LOCAL_USER_ID=`id -u $USER`
: allows you to get read and write permissions on the ouput files (if you remove this part of the command line, files will belong to the root user). -
-v $PWD:$PWD
: allows sharing the current directory between your computer and the container. -
--rm
: allows automaticlly removing the container when the job is finished (this limits disk space usage) -
-e CWD=$PWD
: allows specifying the current working directory and use relative path from this directory (here$PWD
) -
carinerey/pcoc
is the name of the docker image hosted on DockerHub -
[SOME PCOC TOOL] [SOME PCOC OPTIONS]
is the command run in the container (see below)
Below, we will shorten this command line by passing all the Docker options into an environment variable.
CMD_PCOC_DOCKER="docker run -e LOCAL_USER_ID=`id -u $USER` --rm -v $PWD:$PWD -e CWD=$PWD carinerey/pcoc"
Thus, to run PCOC, simply type:
$CMD_PCOC_DOCKER [SOME PCOC TOOL] [SOME PCOC OPTIONS]
Based on biological relevance of the studied taxon, each user determines manually where the convergent transitions occurred, and which nodes are assumed to be in the convergent phenotypic state. This is done with the help of the pcoc_num_tree.py
script.
Note that PCOC aims to detect sites with convergent substitutions in the case of a given set of transitions but not to identify the set of convergent transitions.
$CMD_PCOC_DOCKER pcoc_num_tree.py -t tree_dir/tree.nw -o num_tree.pdf ##takes a few seconds; replace file name of tree.nw with your input tree
ii. In the 2nd step, you use the numbered tree (output of the 1st step) to write a string corresponding to the convergent scenario you are interested in.
The string looks like 1,2,3/55,56/67
. The syntax is defined as such:
-
a clade of nodes corresponding to a same convergent transition is written as a list of node IDs separated by a
,
. The node ID corresponding to the initial convergent transition must be the first number (here,1
,55
and67
). -
each independent set of nodes corresponding to independent convergent clades are separated by a
/
.
In the example dataset used in the PCOC paper, species with the convergent phenotypes are (see the figure in num_tree.pdf for the node IDs):
1st convergent clade | 2nd convergent clade | 3rd convergent clade | 4th convergent clade | 5th convergent clade |
---|---|---|---|---|
Ele.bald (node 0) | Bulbostyl (38) | Fimb.li2 (45) | Cyp.capi (78) | Rhy.rubr (133) |
Ele.bal2 (node 1) | all internal nodes in this group | Fimb.di2 (46) | Volkiell (79) | Rhy.glob (134) |
Ele.bal4 (node 3) | Fimb.fer (47) | Cyp.ust2 (80) | Rhy.glo2 (135) | |
Ele.vivi (node 5) | all internal nodes in this group | Remirea (81) | all internal nodes in this group | |
Ele.vivA (node 6) | Cyp.iria (82) | |||
all internal nodes in this group (nodes 2, 4, 7 and 8) | Killinga (83) | |||
Pycreus (84) | ||||
Cyp.long (86) | ||||
Cyp.rotu (87) | ||||
Cyp.papy (89) | ||||
Cyp.ustu (90) | ||||
all internal nodes in this group |
so the string corresponding to this scenario is:
8,0,1,2,3,4,5,6,7/49,45,46,47,48/38/98,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97/137,133,134,135,136
.
(Remember that the common ancestor of each clade with a convergent transition has to be the first in the list.)
iii. (Optional) The 3rd step is to visualize the final tree with colored labeled branches (to ascertain that no branch has been forgotten during 2nd step)
# this takes a few seconds
scenario="8,0,1,2,3,4,5,6,7/49,45,46,47,48/38/98,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97/137,133,134,135,136"
$CMD_PCOC_DOCKER pcoc_num_tree.py -t tree_dir/tree.nw -o num_tree.pdf -m $scenario
This is done by the pcoc_det.py
script. Please see the man page for the full list of options: $CMD_PCOC_DOCKER pcoc_det.py -h
To run the detection tool of the PCOC toolkit, you need the amino-acid alignment of the studied gene, the "species" tree with values for branch lengths, and the string corresponding to the convergent scenario (obtained from the previous section).
With the option -f
, you can fix the threshold at which a position is retained if its posterior probability is above this threshold for one of the 3 implemented models (PCOC, PC or OC).
# here we set the threshold to 0.8
# this takes one minute maximum
$CMD_PCOC_DOCKER pcoc_det.py -t tree_dir/tree.nw -aa ali.fa -o output_pcoc_det -m $scenario -f 0.8
Results are in the folder output_pcoc_det/RUN_yyyymmdd_hhmmss/
where yyyymmdd_hhmmss
corresponds to the date and time of the analysis.
You will find:
-
ali.filtered_results.tsv
: a tabular file with posterior probabilities for each position with a posterior probability above the threshold of one of the PCOC models (PCOC, PC and OC; see -f/-f_pcoc/-f_pc/-f_oc) -
ali.results.tsv
: a tabular file with posterior probabilities for all positions according the PCOC models (PCOC, PC and OC) -
Run_metadata.tsv
: a tabular file with metadata about this run -
pcoc_det.log
: a log file
Of note: if you want to get the alignment of filtered positions only, add the --no_cleanup_fasta
option to your command. This will create a directory named fasta
with these particular positions for each model (PCOC, PC and OC) and the union of them. You will most probably only use the output from the PCOC model, but the PC and OC outputs are also provided.
-
You can get a graphical output for all positions using
--plot_complete_ali
(default is pdf output, add--svg
for svg output)$CMD_PCOC_DOCKER pcoc_det.py -t tree_dir/tree.nw -aa ali.fa -o output_pcoc_det -m $scenario --plot_complete_ali
You will additionally find in the output directory
ali_plot_complete.pdf
. This file contains a plot of the tree, the complete alignment and the posterior probability according to PCOC models (only if--plot_complete_ali
has been used) -
You can get a graphical output for the filtered positions only using
--plot
and reorder them in function of their posterior probability with--reorder
$CMD_PCOC_DOCKER pcoc_det.py -t tree_dir/tree.nw -aa ali.fa -o output_pcoc_det -m $scenario --plot --reorder
You will additionnaly find in the output directory:
-
ali_plot_filtered_PCOC.pdf
: the same plot asali_plot_complete.pdf
but filtered for the positions with a posterior probability according to the PCOC model above-f
threshold (only if--plot
) -
ali_plot_filtered_PC.pdf
: the same plot as ali_plot_filtered_PCOC.pdf but for the PC model -
ali_plot_filtered_OC.pdf
: the same plot as ali_plot_filtered_ali_PCOC.pdf but for the OC model -
ali_plot_filtered_union.pdf
: the same plot with the union of all positions wit the different models
-
-
You can also add user defined sites in the output and highlight them
Based on previous work, you can have a list of candidate sites that are susceptible to be under convergent evolution. If you want to visualize how these sites go through the PCOC detection pipeline, you can choose to specifically display them in the output with the option
-ph
.In the example dataset, this is the case for 16 sites (based on Besnard et al. (2009)).
candidates="505,540,572,573,623,665,731,733,749,751,761,770,780,810,839,906"
You can visualize them by typing:
$CMD_PCOC_DOCKER pcoc_det.py -t tree_dir/tree.nw -aa ali.fa -o output_pcoc_det -m $scenario -f 0.8 --plot --reorder -ph $candidates
The sites with a
*
in the image output correspond to the candidate sites (set with-ph
).
If you don't have PCOC yet, see the installation procedure in the first section
PCOC allows users to simulate a dataset and compare PCOC with other convergence detection tools (using the simulated dataset). This PCOC feature can be used in 2 main situations:
-
you want to start a project on convergence and you wonder how many samples are necessary to have enough power to detect convergent sites. With the PCOC simulation pipeline, you can simulate if detection of convergence is possible with a dataset of a given size
-
you want to test the underlying model of PCOC and compare it to other algorithms used to detect convergence
In both cases, the options will be exactly the same. The only thing that differs between these approaches is the definition of the convergence scenario.
(i) In the first situation, you provide a candidate convergence scenario (relevant to the biology of the studied species). In practice, you have a species tree with given convergent events and you want to test the power of convergent detection tools on this particular topology.
(ii) In the second situation, the convergence scenario will be randomly generated. You have one (or more) species tree(s) and you want to compare the power of PCOC with other convergence detection tools under random convergent scenarios on this (these) topologies.
The PCOC simulation and benchmark part is done by the pcoc_sim.py
script.
(see man page for full list of options: $CMD_PCOC_DOCKER pcoc_sim.py -h
)
If you don't have a species tree yet, you can use the dataset from Besnard et al., 2009 used in the PCOC paper and courteously provided by the authors:
#get the corresponding tree, rename it and move it to a new folder mkdir -p tree_dir wget https://raw.githubusercontent.com/CarineRey/pcoc/master/data/det/cyp_coding.phy_phyml_tree.txt -P tree_dir mv tree_dir/cyp_coding.phy_phyml_tree.txt tree_dir/tree.nw
The other species trees used in the PCOC paper are available here: https://github.com/CarineRey/pcoc/tree/master/data/sim/tree_dir
A typical PCOC simulation and detection pipeline is composed of three main steps:
- A. Define the convergent scenario(s)
- B. Simulate the data accordingly (this will be different under different scenarios)
- C. Test for convergence in the simulated data using different methods (including PCOC)
Thus, PCOC command-lines will be divided into "blocks" that correspond to the different steps in this pipeline. Prior to each analysis, you will have to define the directory containing your tree(s) (-td
option) and the output directory (-o
option).
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim [REST OF OPTIONS]
Below, we'll go through all the three "steps"... step by step.
In both situations (user-defined or random scenario), PCOC will randomly choose ONE number of convergent events between a minimum (c_min
) and a maximum number (c_max
). If you use -c
, the number of convergent events will be fixed to -c
. (c
can still be used in combination with c_max
to pick -c
convergent events among -c_max
events.)
To run PCOC multiple times and test different scenarios, use the -n_sc
option. For instance, to repeat the test 10 times:
pcoc_sim.py [...] -n_sc 10 [...]
-
If you have a given convergent scenario, you have to set the scenario (the so-called
$scenario
variable used above) using the same syntax defined in the previous section.The
c_max
will be fixed as the total number of convergent events in the given scenario and the defaultc_min
is 2 (except if you set--c_min
to other value > 2). Note thatc_min
has to be above 2 because convergence has to happen in at least 2 clades. Likewise,c_max
has to be below the species number.For instance, in the Besnard et al. (2009) dataset,
c_max
= 5. If you want to force PCOC to test these 5 events, use-c 5
(If you setc_max 5
, PCOC will randomly choose a number of events between 2 and 5 among the 5 possible).$scenario="8,0,1,2,3,4,5,6,7/49,45,46,47,48/38/98,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97/137,133,134,135,136" pcoc_sim.py [...] -m $scenario [...]
-
If you want to test random convergent scenarios:
You have to fix the
c_max
and thec_min
of convergent event and PCOC will randomly place them in the tree. ( You can use the--plot_event_repartition
option to see the distribution in the tree).Note: To be fair when you compare 2 and 7 events in a topology, PCOC will first draw up the maximum number of convergent events and then randomly choose a number between those two (between the
c_min
and thec_max
– except if you use-c
) . In fact, it is more simple to place 2 events closest to the root of the tree than 7.pcoc_sim.py [...] -c_min 2 -c_max 7 [...] #this is overwritten by the -m option
PCOC can not modify the species tree topology but can modify the branch lengths. You can:
- multiply all the branch lengths by a factor X using
--flg X
- change all the branch length by a same value X using
--bl_new X
Once PCOC has placed the convergent events on the species tree, it will use the tree to generate:
-
an alignment corresponding to this convergent scenario
-
another alignment corresponding to a non-convergent scenario (the same tree but without convergent evolution). This allows for measuring the rate of false-positives.
Of note: the length of the alignments can be set with -n_sites
.
Sequence generation is based on the evolution of sequences from the root to the leaves of the tree according to amino-acid profiles (i. e. sets of amino-acid frequencies). PCOC uses the C60 profiles (containing 60 amino-acid profiles) defined in (Si Quang et al, 2009). Among those 60 profiles, 1 couple is picked and the first profile is qualified as "ancestor" and the second as "convergent".
In the convergent scenario, PCOC "traverses" the tree starting from the root with the ancestor profile. Once it reaches a node with a convergent transition, PCOC switches from the ancestor to the convergent profile. The other nodes keep the ancestral profile. In the non-convergent scenario, the ancestor profile is applied to all nodes.
Alternatively, you can use the C10 profiles (containing 10 amino-acid profiles) instead of C60 using -CATX_sim 10
.
By default, only one couple of profiles is used for a scenario but you can use the -nb_sampled_couple
to increase this number. This will generate multiple sets of independent alignments corresponding to the same convergent scenario.
To increase the realism of the simulation, you can add some noise in the simulated data. You can either:
-
add noise in the alignment by using different profiles in the non-convergent part of the tree (see PCOC paper) using
--ali_noise
-
add noise in the scenario by modifying the branch lengths after simulating the data using
--bl_noise
. This will allow using a different tree for the simulation and the detection part by mimicking mis-estimated branch lengths.
Finally, pcoc_sim.py
can use and compare 3 methods to detect convergence in the simulated dataset (see PCOC paper for more details):
- the PCOC models using
-pcoc
- a method detecting identical substitutions in convergent clades using
-ident
(first defined in Zhang and Kumar (1997)) - a topological method using
-topo
(defined in Parker et al. (2013))
For instance, to compare the PCOC method and the topological method only, use
pcoc_sim.py [...] -pcoc -topo #do not include -ident
By default, C10 amino-acid profiles are used in the estimation part instead of the C60 amino-acid profiles in the simulation part. Alternatively, you can use the C60 profiles instead of C10 using -CATX_est 60
. Note that -CATX_est
is different from -CATX_sim
used in the simulation part.
To have an example output, you can use the following command line adapted from the example (see below).
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 1 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -c 5 --pcoc --ident --topo
Results are in the folder output_pcoc_sim/RUN_yyyymmdd_hhmmss/
where yyyymmdd_hhmmss
corresponds to the date and time of the analysis.
You will find:
-
-
run_metadata.tsv
: a tabular file containing a summary of the used options:
RunID 20180111_174413 Input trees tree.nw Run PCOC method Yes Run identical method Yes Run topological method Yes Number of simulated sites 50 [...] [...] -
-
-
tree_metadata.tsv
: a correspondence table between the result directory name(s)Tree_i
and the tree file name(s). (This contain also the execution time). In this tutorial we have only used 1 tree but this table is useful if use more input trees (you would then haveTree_1
,Tree_2
, etc.).
Tree_# Tree_filename Execution time Tree_1 tree.nw 32.9199180603 [...] -
-
-
pcoc_sim.log
: a log file
-
-
- a result directory
Tree_i
for each input tree containing:
-
a.
BenchmarkResults.tsv
: a tabular file containing metadata (columns "RunID", "InputTree", ..., "DistanceSimuCouple") and performance statistics ("FN", ... , "MCC") for each method ("Method") and for each threshold ("Threshold") used, for each simulated dataset. "NumberOfConvergentEvents" and "NumberOfSites" are other metadata related to the scenario and to the alignment.RunID InputTree ScenarioID SimuCoupleID C1 C2 DistanceSimuCouple Method Threshold FN FP TN TP Sensitivity Specificity MCC NumberOfConvergentEvents NumberOfSites 20180111_174413 tree.nw Scenario_1 A53_C44 53 44 0.424326507329 PC 0.7 26.0 2.0 98.0 74.0 0.74 0.98 0.7416 5 100 20180111_174413 tree.nw Scenario_1 A53_C44 53 44 0.424326507329 PCOC 0.7 0.0 1.0 99.0 100.0 1.0 0.99 0.9900 5 100 20180111_174413 tree.nw Scenario_1 A53_C44 53 44 0.424326507329 OC 0.7 0.0 2.0 98.0 100.0 1.0 0.98 0.98019 5 100 20180111_174413 tree.nw Scenario_1 A53_C44 53 44 0.424326507329 Topological 0.7 16.0 3.0 97.0 84.0 0.84 0.97 0.81693 5 100 [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] [...] 20180111_174413 tree.nw Scenario_1 A53_C44 53 44 0.424326507329 Identical NA 100.0 0.0 100.0 0.0 0.0 1.0 0.0 5 100 -
b.
Metadata_Scenarios.tsv
: a tabular file containing (a lot of) metadata for each scenario (Number of convergent transitions, Number of leaves, ... ). Column names should be explicit enough.ScenarioID numberOfLeaves Execution_time InputTree [...] Scenario_1 41 32.8589060307 tree.nw [...] [...] -
c. a directory
nw_trees
containing NHX formatted trees annotated with convergent transitions for each scenario. You will find 2 prefix:-
tree
corresponding to the input topology tree without modification -
tree_conv
corresponding to the topology tree with modifications link to the topological method
Each node is labeled by a string starting with
&&NHX:ND=[NODE NUMBER]:T=True/False:C=True/False
.T
refers to a node with convergent transition andC
to nodes with a convergent profile. In case you used--ali_noise
option, a new field in node labels is added:
Cz=False/[NUMBER OF NOISY PROFILE]
(in the tree with prefixannotated_
). -
- a result directory
In this section, some command-lines used in the PCOC paper will be used as examples but option values have been changed to reduce the computational time.
The 2nd figure of the paper addresses the impact of:
- the number of convergent events on the ability to detect convergence.
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60
- the impact of branch lengths on the ability to detect convergence.
bl_l="0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1"
for bl in $bl_l
do
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 32 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c 5 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60 -bl_new $bl
done
Some figures of the supplementary data address the robustness of the PCOC models towards some noisy simulated dataset.
- noise in the alignment (for example Supplementary figure S8)
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60 --ali_noise
- noise add in the branch length after simulate the alignment (for example Supplementary figure S9)
$CMD_PCOC_DOCKER pcoc_sim.py -td tree_dir -o output_pcoc_sim -n_sc 180 -nb_sampled_couple 1 -n_sites 50 -c_min 2 -c_max 7 -cpu 8 --pcoc --ident --topo -CATX_est 10 -CATX_sim 60 --bl_noise