-
Notifications
You must be signed in to change notification settings - Fork 4
Home
|
ProtTrace is a simulation based approach to assess for a protein, the seed, over what evolutionary distances its orthologs can be found by means of sharing a significant sequence similarity. By doing so, it helps to differentiate between the true absence of an ortholog in a given species, and its non-detection due to a limited search sensitivity. Once the user has specified a seed protein whose traceability should be assessed, a standard ProtTrace analysis includes the following steps
- Compilation of an orthologous group for the seed protein
- Standard: Querying existing ortholog collections (we use OMA groups as a default)
- Expert: Compiling a custom ortholog collection, e.g. by running a targeted ortholog search using HaMStR OneSeq
- Search for Pfam domains in the seed protein's sequence. This analysis requires a local installation of the Pfam database, and of the HMMER package. //Note, his information is later used for inferring position-specific constraints on the evolutionary process.//
- Inference of the seed proteins' evolutionary parameters. Here, ProtTrace uses first IQ-TREE to compute both the pair-wise Maximum Likelihood distances between the orthologs in the training data with, and a maximum likelihood sequence tree. Using this information, the protTrace then determins the following paramter:
- Rate of insertions and deletions
- length distribution of insertions and deletions
- Substitution rate scaling factor κ
- Simulation of the seed protein with REvolver and determining the traceability curve
- Evolutionary sequence change is simulated in steps of 0.1 substitutions per site up to a total distance of 7.5 substitutions per site.
- Subsequent to each simulation step, the simulated sequence is used as a query in a BlastP search against the entire gene set of the seed species. This step serves to assess whether there is still a significant local sequence similarity between the seed sequence and its evolved instance. The search is a success if the seed protein is among the top five hits.
- Repeat the steps a) and b) 100 times to achieve for each distance the fraction of successes.
- Compute the traceability curve
- In an optional step, the traceability results can be depicted on a phylogenetic tree.
ProtTrace comes along with the requirements for some [#accessory] that needs to be installed along with the ProtTrace package. Note, this is, in almost all cases, standard software for evolutionary sequence analysis, and we trust that most of this software is installed and executable on your system anyway. If not, please find below a detailed instruction of what software is needed and how to install it. Most of the software is available via the BIOCONDA channel of the Conda package manager, so installation on Linux and on MacOS is straightforward.
ProtTrace is written in Java and runs platform-independent. However, some accessory software is limited to Linux / MacOS, such that we recommend installing ProtTrace on either Linux or MacOS.
The ProtTrace package contains scripts written in different languages. In order to run ProtTrace you need the following resources:
- Python v2.7.13 or higher. //Note, ProtTrace will not run under Python 3//
- Install also the https://www.dendropy.org/ DendroPy module (can be done via conda).
- Perl v5 or higher including the following modules
- Getopt::Long
- List::Util
- LWP::Simple
- Java v1.7 or higher
- R v3 or higher
| Program name !Version !Description !Mandatory !BioConda | ||||
|---|---|---|---|---|
| v6 or higherMultiple Sequence alignmentyes | ||||
| Blast | v2.7 or higher | Sequence similarity based search | yes | yes |
| HMMER | 3.2 or higher | Sequence similarity based search using Hidden Markov Mode | yes | yes |
| IQTREE] | 1.6.7.1 or higher | Phylogenetic tree reconstruction | yes | yes |
| HaMStR-OneSeq | v1 or higher | targeted ortholog search | no | no |
Before installing protTrace, prepare the environment by installing the necessary software dependencies. Click this LINK for a detailed instruction of how to set up the environment using the Conda package management system.
Use the following steps to create a standard instance of //protTrace// on your computer. Note, the standard installation works only with pre-existing ortholog assignments and does not use the HaMStR package.
- Change to a directory where you want the ProtTrace package to be installed - clone the git repository by typing
git clone https://github.com/BIONF/protTrace.git
This installs all the programs in the directory from which you issued the command.
Change to the //protTrace// directory and run the script create_conf.pl. This will test for the existence of all software dependencies, and optionally can download the necessary data from the OMA web pages and from Pfam.
To see all options for the create_config.pl script, run
perl ./bin/create_config.pl -h
If you run protTrace for the first time, we suggest to run the full set up script by issuing
perl ./bin/create_config.pl -name=prog.config -getPfam -getOma
The script will perform the following steps:
- Check for the software dependencies. If you installed a software in a non-standard path, you can enter the corresponding path interactively. - Update all paths in the config file prog.config - offers you the option to set run parameters interactively - print out a config file that controls the protTrace run - download orthologous group assignment from the OMA database together with the corresponding sequences (option -oma). In addition, the Pfam-A database will be downloaded (option -Pfam). The data will be placed into the directory //protTrace/used_files//, and the corresponding paths in the config file.
If all tests succeeded, protTrace will be ready to run.
protTrace requires, by default, orthologous groups assigned by OMA together with the protein sequences in format and the https://pfam.xfam.org.
- If you have this information already available at you computer, specify the corresponding paths in the protTrace configuration file, either by manually editing the config file, or by running the configuration script //create_conf.pl//. Make sure that the protein sequences are formatted such that each sequence extends only over one line. See [[:projects:prottrace:oma|HERE]] for details about downloading and preparing the OMA ortholog assignments. - use the configuration script //create_conf.pl// provided in the //bin// directory for downloading and reformatting the files. Use the options
-getPfam -getOma
for this purpose. **Note:** The script will attempt to download about 6Gb of data, so this may take a while. Per default, the files will be placed in the //used_files// directory.
For running protTrace, you will have to configure the individual run parameters that are then passed on to the program via the config file. For creating an initial config file, run the script
create_config.pl
which is provided in the //bin// directory of the protTrace distribution. For modifying an existing config file, simply provide the name of the existing file and add the option //-update//.
create_conf.pl -name=YourConfigFile -update
. The script will then guide you through the update procedure.
- In brief, there are seven main parameter classes controlling different steps of the analysis:
- [0] - [[:projects:prottrace:options:general|General Options]]
- [1] - [[:projects:prottrace:options:general#preprocessing|Preprocessing Settings]]
- [2] - [[:projects:prottrace:options:general#advanced_preprocessing|Advanced Preprocessing Settings]]
- [3] - [[:projects:prottrace:options:general#scaling_factor|Scaling factors]]
- [4] - [[:projects:prottrace:options:general#indel_parameter|Indel parameter]]
- [5] - [[:projects:prottrace:options:general#traceability_calculation|Traceability calculation]]
- [6] - [[:projects:prottrace:options:general#program_paths|Program paths]]
- [7] - [[:projects:prottrace:options:general#path_to_files|Paths to files]]
- You can select one to several of the main classes, by
* providing the corresponding numbers, each separated by a comma
* providing a range, e.g. 1-7 will select all classes
- Once the main classes have been selected, the script will then ask you to select the parameter(s) you want to update. **Note:** If you selected more than one main parameter class, the script will ask you first **for each** class which parameters you want to set.
- As a last step, the script will ask you to enter your values for the selected parameters. For each parameter it provides you with the current setting, and the default value (if existent), or a brief description of what to enter.
- Once all parameters have been set, the config file will be saved and is ready to use.
Once you have completed all installation steps and did run the configuration script //create_conf.pl// you should be ready to go. We have provided two example files in the directory toy_example with which you can test your protTrace installation.
The most convenient way of starting a protTrace analysis is to provide the program an OMA sequence id. The file test.id contains the OMA id YEAST05874. To start a traceability analysis with this sequence, run protTrace as following:
- check in your protTrace config file that the parameter **//species//** is set to **//YEAST//** . For the next step, we assume that a config file **//prot.config//** is located in the directory **//toy_example//** . Click {{:projects:prottrace:prot.config.pdf|HERE}} to access the config file we have used for the run.
- change into the directory **//toy_example//** and run protTrace by issuing the following command
../bin/protTrace.py -i test.id -c prot.config
Click to access a summary of the command line output during the protTrace run. Table summarizes the main main information that you should find in your output directory upon a successful completion of the protTrace run.
Optionall, you can start protTrace using a protein sequence in FASTA format as the seed. The file test.fa contains the protein sequence of the human protein ZNT3. protTrace will then, as its first step, use a BLAST search to identify the corresponding OMA identifier for this sequence((This makes the assumption that your sequence is indeed represented in the OMA database. So far, no check is implemented)) . To start a traceability analysis with this sequence, run protTrace as following:
- check in your protTrace config file that the parameter **//species//** is set to **//HUMAN//** . For the next step, we assume that a config file **//prot2.config//** is located in the directory **//toy_example//** . Click {{:projects:prottrace:prot2.config.pdf|HERE}} to access the config file we have used for the run.
- change into the directory **//toy_example//** and run protTrace by issuing the following command
../bin/protTrace.py -f test.fasta -c prot2.config
Click to access a summary of the command line output during the protTrace run. The output produced is analogous to the one shown in table .
^Task^Filename^Description| ^Pfam domain annotation| |File containing the hmmscan output for the seed sequence| ^Ortholog identification| |Members of the OMA ortholog group the seed protein is part of| | | |Amino acid sequences for the OMA ortholog group| ^MSA of orthologous sequences| |MAFFT-linsi alignment of the orthologous sequences| ^Phylogeny reconstruction| |ML tree reconstruction of the orthologous sequences| ^Pairwise distance computation| |Computation of the pairwise ML distances between the orthologs| ^Scaling factor| |Compare pairwise distances between sequences to pairwise distances of species| ^Indel rate| |Rate and shape parameter of the geometric indel distribution| ^Decay analysis| |Results of the simulation procedure| | | |Summary of decay analysis over 100 repetitions| | | |Graphical display of the traceability curve| | | |This file contains traceabilities for every reference taxon listed in your Xref_mapping_file (species_tree_maping.txt). The third column in the file gives the traceabilities values| ^Visualization| |Display of the traceability results on the taxonomy tree. The traceability index (TI) of the seed protein in the respective species is color coded from green, representing high a high TI, to yellow, representing intermediate TI, and finally to red, representing a low TI| | | |Tabular output of the traceability analysis for upload and visualization in PhyloProfile| <caption><fs>Main output files of the protTrace run using YEAST05874 as the seed protein. Meta-results, such as Blast libraries, and protein set collections for the YEAST, which are generated in the course of the analysis are not listed.</fs></caption>\\