-
Notifications
You must be signed in to change notification settings - Fork 9
Input Data
- What should be included in my main input?
- What kinds of input format does PhyloProfile support?
- How can I generate input files?
- Additional input files
- Use with config file
- How PhyloProfile retrieves taxonomy information
- Examples
PhyloProfile requires three basic information for each pair of seed-ortholog protein:
- ID of seed protein or ortholog group (geneID)
- ID of orthologous proteins (orthoID)
- NCBI ID of taxon that contains ortholog (ncbiID)
and up to two other values (var1 and var2) for two additional information layers (optional).
PhyloProfile accepts 5 kinds of input format: OMA IDs list, multi-fasta format, OrthoXML, a long-format and a matrix-format.
PhyloProfile accepts a list of OMA protein IDs (e.g. RATNO03710
) as an input. The corresponding (orthologs, FASTA sequences and protein domain architectures) will be automatically downloaded from OMA Browser.
For example:
RATNO03709
RATNO03710
RATNO03711
Ortholog groups can be stored as a multiple FASTA file. The sequence header has to be formatted as follow:
>geneID|ncbiID|orthoID|var1|var2
where
- geneID is an unique ID for each ortholog group
- ncbiID is taxonomy ID of taxon contains ortholog (ncbi+taxonID. e.g. ncbi7029, ncbi3702)
- orthoID is ID of ortholog
- var1 and var2 are values for additional layers of information (required*)
For example: inst/extdata/test.main.fasta
>DIM1|ncbi83332|P9WH07|0.996778465|0.811763277
MCCTSGCALTIRLLGRTEIRRLAKELDFRPRKSLGQNFVHDANTVRRVVAASGVSRSDLVLEVGPGLGSLTLALLDRGATVTAVEIDPLLASRLQQTVAEHSHSEVHRLTVVNRDVLALRREDLAAAPTAVVANLPYNVAVPALLHLLVEFPSIRVVTVMVQAEVAERLAAEPGSKEYGVPSVKLRFFGRVRRCGMVSPTVFWPIPRVYSGLVRIDRYETSPWPTDDAFRRRVFELVDIAFAQRRKTSRNAFVQWAGSGSESANRLLAASIDPARRGETLSIDDFVRLLRRSGGSDEATSTGRDARAPDISGHASAS
>DIM1|ncbi284812|Q9USU2|0.99994315|NA
MGKIRVRNNNAASDAEVRNTVFKFNKDFGQHILKNPLVAQGIVDKADLKQSDTVLEVGPGTGNLTVRMLEKARKVIAVEMDPRMAAEITKRVQGTPKEKKLQVVLGDVIKTDLPYFDVCVSNTPYQISSPLVFKLLQQRPAPRAAILMFQREFALRLVARPGDPLYCRLSANVQMWAHVKHIMKVGKNNFRPPPLVESSVVRIEPKNPPPPLAFEEWDGLLRIVFLRKNKTIGACFKTSSIIEMVENNYRTWCSQNERMVEEDFDVKSLIDGVLQQCNLQDARASKCGQTEFLSLLHAFHQVGVHFA
>DIM2|ncbi237631|A0A0D1C927|NA|NA
MPRAVSAKLSRQHEPSAGLRSGSARSAASSSSSVHASNQNSSATTKNPIFNTDKFGQHILKNPLVAQGIVDKANLKPTDMVLEVGPGTGNLTVRILEKAKKTTVVEMDPRMAAELSKRVQGKPEQRKLDIMLGDFCKTDLPYFDVCISNTPYQISSPLVFKLLSHRPLFRCAILMFQREFALRLIARPGDNLWCRLSANVQLYSKVDHIMKVSRNSFRPPPQVESSVVRITPLNPPPAIPFEEFDGLTRIVFSRRNKTVRASFFDARGVIDMLESNYKTYCAVKEIMPEQGSFADMVKQVLVETGSAENRAAKMDIDDLLTLLAAFHEKGIHFS
(*) Write NA
for any missing / not available value
The OrthoXML format is a standardized format that is used by many popular orthology prediction tools and databases like OMA, InParanoid, Hieranoid, OrthoMCL, Panther or Roundup.
PhyloProfile expects the NCBI taxonomy IDs to be present in the species
tag in the XML as NCBITaxId
like this:
<species name="Dipodomys ordii" NCBITaxId="10020">
Example: inst/extdata/test.main.xml
NOTE: Please make sure that the second line contains only the tag <orthoXML> without any further words/characters.
The long format is a tab delimited file containing up to 5 columns:
1. geneID
2. ncbiID (ncbi+taxonID. e.g. ncbi7029, ncbi3702)
3. orthoID
4. var1 (optional)
5. var2 (optional)
Example: inst/extdata/test.main.long
geneID ncbiID orthoID FAS traceability
OG_1017 ncbi272557 NA NA 0.963660510229
OG_1017 ncbi176299 A.fabrum@176299@1582 0.99904467475 0.962110129973
OG_1017 ncbi3702 A.thaliana@3702@252561 0.99420505647 0.970055633243
OG_1017 ncbi876142 E.intestinalis@876142@Eint_050240 1.0 0.217721321958503
OG_1017 ncbi9606 H.sapiens@9606@149340 0.9991125988 0.97056087274
OG_1017 ncbi10090 M.musculus@10090@112934 0.9992383882 0.970475008934
OG_1017 ncbi586133 N.parisii@586133@NEPG_02124 0.999537811 0.865887168304603
*Use NA
for any missing value! Rename the title of the 4. and 5. column according to your data
**Depend on your need, you can switch the last 2 columns to show them differently in the profile plot. The first var1
column will be represented by the dot colors, while the second var2
column will be represented by the background colors. However, do not use NA
for the whole column 4 of the var1
. If you just want to show your values using the background colors, you can add zero 0
to the whole var1
column.
For example:
This is a wrong input:
geneID ncbiID orthoID score_1 score_2
OG_1017 ncbi272557 NA NA 0.963660510229
OG_1017 ncbi176299 NA NA 0.962110129973
OG_1017 ncbi3702 NA NA 0.970055633243
It should be written like this:
geneID ncbiID orthoID score_2
OG_1017 ncbi272557 pseudo_1 0.963660510229
OG_1017 ncbi176299 pseudo_1 0.962110129973
OG_1017 ncbi3702 pseudo_1 0.970055633243
In this case, you have to add names for the dots - the pseudo orthologs, otherwise the dots will not be present. The values of the score_2
will be then represented by the dot colors.
If you want to plot score_2
using the background colors, you have to rewrite the above input file like this:
geneID ncbiID orthoID score_1 score_2
OG_1017 ncbi272557 NA 0 0.963660510229
OG_1017 ncbi176299 NA 0 0.962110129973
OG_1017 ncbi3702 NA 0 0.970055633243
Here you don't need to add the pseudo orthologs into the orthoID
column, since the background colors are independent from the dots.
A matrix where rows represent genes and columns represent taxa. Each cell in the matrix contains <orthoID>#<var1>#<var2>
. Absent values is written as NA
, e.g. arath_2339_31:248814#NA#0.2
, or homsa_8_41:119370#NA#NA
or only NA
(leads to the same result as NA#NA#NA
).
The header of first column has to be geneID
. The header of each taxon must have this format ncbi12345
, in which 12345
is its NCBI taxon ID.
Example: inst/extdata/test.main.wide
geneID ncbi272557 ncbi176299 ncbi3702 ncbi876142 ncbi9606 ncbi10090 ncbi586133 ncbi4837 ncbi4081 ncbi7668
OG_1017 NA#NA#0.963660510229 A.fabrum@176299@1582#0.99904467475#0.962110129973 A.thaliana@3702@252561#0.99420505647#0.970055633243 E.intestinalis@876142@Eint_050240#1.0#0.217721321958503 H.sapiens@9606@149340#0.9991125988#0.97056087274 M.musculus@10090@112934#0.9992383882#0.970475008934 N.parisii@586133@NEPG_02124#0.999537811#0.865887168304603 NA#NA#0.969937583896 NA#NA#0.970897505007 NA#NA#0.971396609443
OG_1019 A.pernix@272557@1942#0.99968773663#1 NA#NA#1 A.thaliana@3702@247509#0.99942067306#1 E.intestinalis@876142@Eint_111470#0.99984830527#0.351918260358449 H.sapiens@9606@131231#0.99941631235#1 M.musculus@10090@72026#0.99942634894#1 N.parisii@586133@NEPG_02025#0.99929028115#0.576080895539263 P.blakesleeanus@4837@2023#0.99942142624#1 S.lycopersicum@4081@15365#0.9997286185#1 S.purpuratus@7668@161#0.999799012#1
OG_1020 NA#NA#1 NA#NA#1 A.thaliana@3702@243417#0.999920242#1 E.intestinalis@876142@Eint_051570#1.0#0.209834190564059 H.sapiens@9606@106204#0.999839662#1 M.musculus@10090@97389#0.999314626#1 N.parisii@586133@NEPG_00242#0.999879679#0.574228258376227 P.blakesleeanus@4837@12159#0.999973516#1 S.lycopersicum@4081@24239#0.999776104#1 NA#NA#1
OG_1023 NA#NA#0.871939716429 NA#NA#0.86683810271 NA#NA#0.893358887125 E.intestinalis@876142@Eint_061440#1.0#0.271616002558503 H.sapiens@9606@109429#0.995954221#0.895076764175 M.musculus@10090@68883#0.997224544#0.894784552862 N.parisii@586133@NEPG_02149#0.999997996#0.83226924426614 P.blakesleeanus@4837@1117#0.99875561095#0.892958043873 NA#NA#0.896223430018 S.purpuratus@7668@8192#0.99922606933#0.897926570173
OG_1024 A.pernix@272557@1694#0.98373085665#0.957890571002 NA#NA#0.956189828978 A.thaliana@3702@253525#0.98362698628#0.964951063327 E.intestinalis@876142@Eint_061350#0.99968588504#0.923773511804587 H.sapiens@9606@149706#0.98325730005#0.965512227736 M.musculus@10090@60168#0.9839481254#0.965416822528 N.parisii@586133@NEPG_02355#0.99625187461#0.222323679187767 P.blakesleeanus@4837@63#0.99872664087#0.964820022573 S.lycopersicum@4081@33226#0.974997601#0.965886415483
*NOTE: wide format is not suitable for profiles containing paralogs (co-orthologs). We recommend using the long format as input because of its easy preparation
When working with large phylogenetic profiles, it is advantageous to skip the preprocessing step. PhyloProfile provides an option to export the preprocessed data (menu Export data
-> Processed data
) for future use. The preprocessed data consists of 4 files longDf.rds
, preData.rds
, sortedtaxaList.rds
and fullData.rds
, which must be stored together in a single folder.
Certain parameters, such as the working taxonomy rank, the reference taxon, and the order of taxa, are automatically defined from the data and cannot be modified.
The Orthologous Matrix OMA is one of the most commonly used approaches for ortholog search. OMA provide ortholog groups in orthoXML format for both online database and standalone tool.
PhyloProfile provides basic Python scripts to make use of ortholog output that is provided by OMA. If you calculate your own orthologous groups via OMA Standalone this will add the NCBI taxonomy IDs to the generated OrthoXML file. If you want to use the precalculated orthologous groups from OMA Browser you can easily download the corresponding files from the command line.
By default, the output of OMA Standalone does not include the correct NCBITaxId
but rather gives these as <species name="Dipodomys_ordii" NCBITaxId="-1">
. With scripts/convert_oma_standalone_orthoxml.py
we provide a basic Python script to enable the use of OMA Standalone.
Besides the OrthoXML of OMA Standalone it only requires a simple, tab-separated mapping-file that maps the species names as generated by OMA Standalone to the NCBI Taxonomy ID. OMA Standalone uses the filenames of the protein sets you put into the DB
folder as the species names, with Dipodomys_ordii.fa
being transformed into the species name Dipodomys_ordii
. An example mapping file (named e.g. "taxon_mapping_oma_orthoxml.csv") should look like this:
Dipodomys_ordii 10020
Mus_musculus 10090
Rattus_norvegicus 10116
To convert the OrthoXML of OMA Standalone to a PhyloProfile compatible OrthoXML you can use the script in /inst/PhyloProfile/scripts/convert_oma_standalone_orthoxml.py (https://raw.githubusercontent.com/BIONF/PhyloProfile/master/inst/PhyloProfile/scripts/convert_oma_standalone_orthoxml.py):
python convert_oma_standalone_orthoxml.py -x oma_output.orthoxml -m taxon_mapping_oma_orthoxml.csv > oma_example_phyloprofile_compatible.orthoxml
where oma_output.orthoxml is your local OMA output file.
From v0.3.0 you can directly upload list of OMA IDs or Uniprot IDs into PhyloProfile as a supported input format. The obtained OMA data can be optionally downloaded for later use.
PhyloProfile was tested with an example OrthoXML from their website and converted OrthoXML as given by OMA. If it does not work with your XML file, please let us know!
Our new ortholog prediction tool fDOG can generate phylogenetic profiles together with domain architecture files for directly uploading into PhyloProfile.
in Multi-FASTA, long or wide format :)
Your tool is missing? Please get in touch! we are trying to support more orthology prediction tools right out of the box.
An additional sequence annotation file/folder can be provided to further enrich the phylogenetic profiles. Since the tool was initially designed to work with protein architecture annotations, the annotation file can have the following information (columns separated by tab):
1. pairID - formatted as geneID#orthologID (*)
2. orthologID/seedID (*)
3. sequence length
4. feature name - pfam domain, smart domain,etc. (*)
5. start position (*)
6. end position (*)
7. weight value for the feature - set as NA or leave blank if unavailable
8. highlighted - write Y for highlighting a domain, otherwise write N
(*) required columns, the others are optional!
Example: (or see domain files in data/demo/domain_files
)
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 pfam_RVT_N 12 94 NA Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 pfam_GIIM 367 445 NA Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 pfam_HNH 520 566 NA N
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 pfam_HNH_4 522 569 NA N
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 smart_HNHc 510 561 NA Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|acama_4692@329726@1|2416|0 576 pfam_RVT_1 108 336 NA Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|YEAST@559292@1|P03875 834 seg_low complexity regions 242 253 0.06521739 Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|YEAST@559292@1|P03875 834 smart_HNHc 761 820 0.32608696 Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|YEAST@559292@1|P03875 834 pfam_Intron_maturas2 602 759 0.32608696 Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|YEAST@559292@1|P03875 834 tmhmm_transmembrane 13 35 0.06521739 Y
Q0050#Q0050|acama_4692@329726@1|2416|0 Q0050|YEAST@559292@1|P03875 834 pfam_RVT_1 315 577 0.2173913 Y
We provide 2 python scripts for converting outputs from hmmscan or pfamscan into PhyloProfile compatible domain files.
-
Do PFAM annotation by using hmmscan or pfamscan:
hmmscan -E 0.001 --noali --domtblout hmmscanOut.txt path_to_Pfam_A_files/Pfam-A.hmm inst/extdata/test.main.fasta
perl pfamscan.pl -fasta data/demo/test.input.fasta -dir hmmscanOut.txt path_to_Pfam_A_files/ > pfamscanOut.txt
-
Parse output files into compatible domain files using inst/PhyloProfile/scripts/hmmscanParser.py or inst/PhyloProfile/scripts/hmmscanParser.py:
python hmmscanParser.py -i hmmscanOut.txt > hmmscanOut.domains
python pfamscanParser.py -i pfamscanOut.txt > pfamscanOut.domains
*NOTE: To keep the consistent IDs with sequences in main input file, please format the sequence header of FASTA file before using it as input for hmmscan or pfamscan.
Aside from that, PhyloProfile also accepts the domain files generated directly from HaMStR.
Sequences in FASTA format can be optionally given to FASTA config
menu in the Input and settings
page of PhyloProfile. There are two options for submitting the FASTA files:
- You can input a concatenated file containing all sequences of genes that are present in the phylogenetic profile.
- If each taxon has it own multiple FASTA file, you can give the path to the folder containing those FASTA files. The extension of FASTA files is limited with either .fa, .fasta, .fas or .txt. The description (header line) of each sequence must have the format
taxonID:sequenceID
,taxonID@sequenceID
, ortaxonID|sequenceID
(for example: homsa_8_41:149340, H.sapiens@9606@46172, H.sapiens|HUMAN_124; where homsa_8_41, H.sapiens@9606 and H.sapiens aretaxonIDs
and 149340, Eint_030020 aresequenceIDs
.
For both cases, the FASTA header of each sequence has to be the same as the sequence name showing in the input profile. Demo fasta files can be found in data/demo/fasta_files
.
If input file is already in FASTA format, the sequences will be directly parsed from the main input file.
PhyloProfile allow users to upload their species tree in order to (1) sort taxa based on their defined tree instead of using NCBI taxonomy common tree, (2) filter input taxa by the taxa present in the species tree.
Input species tree has to be in newick format and should not contain singletons(*). The same as other input files, tree nodes must have ncbi+taxonID format.
Example:
(ncbi3702,(ncbi2234,ncbi329726,(ncbi3711,ncbi7029)));
(*) A singleton node is a node with only one descendant. These are created by extra left & right parentheses in our Newick string. For example:
(((A,B),(C,D)),E);
has no singletons; whereas:
((((A),B),(C,D))),E);
has two singletons - one on the edge leading from the common ancestor of A & B to tip A, and another below the clade containing A, B, C, and D.
Another option to have the taxa sorted is providing a ordered list of taxon IDs. For example
ncbi3702
ncbi2234
ncbi329726
ncbi3711
ncbi7029
where the first line is the reference species.
NOTE: if an ordered taxon list is given, the option for selecting reference taxon and changing the working taxonomy rank will be disabled.
The phylogenetic profiles can be highlighted based on their gene categories. These categories can be protein functions, metabolic pathways, or physiological characters of proteins, etc. Genes within the same category will be highlighted with the same color.
The gene categories input file is a tab delimited file without the header, for example:
OG_1009 cat1
OG_1010 cat1
OG_1016 cat2
OG_1017 cat5
OG_1018 cat2
You can find an example for the LCA Microspodiria demo data set here.
Instead of the default gene IDs, you can also display their gene names, which you can provide by a tab delimited file without the header, for example:
OG_1009 Gene A
OG_1010 Gene B
OG_1016 Gene C
OG_1017 Gene D
OG_1018 Gene E
NOTE: Duplicate or empty gene names are not allowed. If some genes don't have names, you can exclude them from the mapping file. Any genes without a name will be displayed by their IDs. Genes whose IDs cannot be found in the main phylogenetic profile input will be ignored.
If you don't want to manually upload the input files via the GUI, as well as change some settings every time, you can create a config file and run PhyloProfile directly with your predefined settings. The config file is a text file written in YAML format. For example, this is the contain of a test_config.yml
:
---
# input files
mainInput: /Users/vinh/test.main.long
domainInput: /Users/vinh/Desktop/101621at6656.domains
fastaInput: /Users/vinh/Desktop/bionf/fastaFiles/concatenatedFile.fa
treeInput: NULL
# define taxonomy rank / reference taxon (optional)
rank: genus
refspec: Homo
# others (optional)
# wrong value will be replaced by default value
clusterProfile: FALSE
xAxis: taxa
profileTypeClustering: binary
distMethodClustering: maximum
In a config file, the only the mainInput
is required and set to the absolute path of the main phylogenetic profile input file. The other input files (domainInput
, fastaInput
and treeInput
) can be set to NULL
if they do not exist (see treeInput
in the example). Other settings are also optional. If they are not defined, or are set with an unacceptable values, the default values will be used.
List of currently available parameters:
Variable (* required) | Value type | Description | Possible values (default in bold) |
---|---|---|---|
mainInput (*) | text | Path to main input phylogenetic profiles | Absolute path to main input file |
domainInput | text | Path to domain input | Absolute path to domain file, or NULL |
fastaInput | text | Path to fasta file | Absolute path to fasta file, or NULL |
treeInput | text | Path to tree file for sorting taxa | Absolute path to tree file, or NULL |
rank | text | Working taxonomy rank | strain, species, genus, family, order, class, phylum, kingdom, superkingdom |
refspec | text | Selected reference taxon (e.g. homo ) |
|
clusterProfile | boolean | Flag to cluster profiles | TRUE, FALSE |
profileTypeClustering | text | Type of profiles used for clustering | binary, var1, var2 |
distMethodClustering | text | Distance method for clustering | for binary profiles only: euclidean, maximum, manhattan, canberra, binary, pearson; for both binary and non-binary profiles: mutualInformation, distanceCorrelation |
clusterMethod | text | Clustering method | single, complete, average, mcquitty, median, centroid |
xAxis | text | Type of x-Axis | taxa, gene |
(will continue being expanded...)
The smallest config file tiny_config.yml
will look like this:
---
mainInput: /Users/vinh/test.main.long
To run PhyloProfile with a config file, simply use the command:
library(PhyloProfile)
runPhyloProfile("/absolute/path/to/my/test_config.yml")
PhyloProfile required NCBI taxonomy IDs to work. In all kinds of input (except orthoXML), taxonomy IDs must have this format ncbi+taxonID (e.g. ncbi7029, ncbi3702). We use those IDs to fetch the complete taxonomy ranks and the corresponding scientific names for input taxa.
To convert species names into ncbi+taxonID format, we provide a function called Search for NCBI taxonomy IDs
, which can be found in Function menu of the tool.
By the first time running PhyloProfile, the complete NCBI taxonomy database will be downloaded and preprocessed into your_R_library_folder/PhyloProfile/PhyloProfile/data/preProcessedTaxonomy.txt
.
If you want to update this file after that, you can run these commands:
# Download and preprocess NCBI taxonomy information
library(PhyloProfile)
preProcessedTaxonomy <- processNcbiTaxonomy()
# Find the library path, where PhyloProfile has been installed and identify an output file there
path <- find.package("PhyloProfile")
outFile <- paste0(path, "/PhyloProfile/data/preProcessedTaxonomy.txt")
# Write the preprocessed taxonomy file into PhyloProfile data folder
write.table(
preProcessedTaxonomy,
file = outFile,
col.names = TRUE,
row.names = FALSE,
quote = FALSE,
sep = "\t"
)
or download this script and run RScript within the UNIX/Mac terminal
Rscript updateNCBITax4PhyloProfile.R
If your input taxa are not in the taxonomy file taxonNamesFull.txt
, PhyloProfile will inform users about the missing taxa. There are two possible reasons for this. First, either the missing IDs or the taxonomy file are out-of-date. Second, your taxa are not present in NCBI taxonomy database. In the second case, you have two options to add new taxa to taxonomy file.
- Using the UI to input manually the missing taxa.
- Putting the missing taxa into ``your_R_library_folder/PhyloProfile/PhyloProfile/data/newTaxa.txt` file. By that, PhyloProfile will automatically append the new information into taxonomy file. For example:
ncbiID fullName rank parentID
999999901 abcdef species 3702
where, 999999901
is taxon ID defined by you (be make sure that 999999901 not exist in NCBI taxonomy database), abcdef
is taxon name, species
is rank of this taxon, 3702
is the already known ID of the next higher rank.
Please take a look at the example files at inst/extdata/test.main.fasta
, inst/extdata/test.main.wide
, inst/extdata/test.main.long
or inst/extdata/test.main.xml
to see small examples for correctly formatted orthology data. In the folder inst/extdata/domainFiles
and inst/extdata/fastaFiles
you can find examples for the additional ortholog annotations with domains and sequence files in FASTA format.