Skip to content

Input Data

Vinh Tran edited this page Oct 30, 2024 · 53 revisions

Table of Contents

What should be included in my main input?

PhyloProfile requires three basic information for each pair of seed-ortholog protein:

  1. ID of seed protein or ortholog group (geneID)
  2. ID of orthologous proteins (orthoID)
  3. NCBI ID of taxon that contains ortholog (ncbiID)

and up to two other values (var1 and var2) for two additional information layers (optional).

What kinds of input format does PhyloProfile support?

PhyloProfile accepts 5 kinds of input format: OMA IDs list, multi-fasta format, OrthoXML, a long-format and a matrix-format.

OMA IDs list

PhyloProfile accepts a list of OMA protein IDs (e.g. RATNO03710) as an input. The corresponding (orthologs, FASTA sequences and protein domain architectures) will be automatically downloaded from OMA Browser.

For example:

RATNO03709
RATNO03710
RATNO03711

Multi-FASTA format

Ortholog groups can be stored as a multiple FASTA file. The sequence header has to be formatted as follow:

>geneID|ncbiID|orthoID|var1|var2

where

  • geneID is an unique ID for each ortholog group
  • ncbiID is taxonomy ID of taxon contains ortholog (ncbi+taxonID. e.g. ncbi7029, ncbi3702)
  • orthoID is ID of ortholog
  • var1 and var2 are values for additional layers of information (required*)

For example: inst/extdata/test.main.fasta

>DIM1|ncbi83332|P9WH07|0.996778465|0.811763277
MCCTSGCALTIRLLGRTEIRRLAKELDFRPRKSLGQNFVHDANTVRRVVAASGVSRSDLVLEVGPGLGSLTLALLDRGATVTAVEIDPLLASRLQQTVAEHSHSEVHRLTVVNRDVLALRREDLAAAPTAVVANLPYNVAVPALLHLLVEFPSIRVVTVMVQAEVAERLAAEPGSKEYGVPSVKLRFFGRVRRCGMVSPTVFWPIPRVYSGLVRIDRYETSPWPTDDAFRRRVFELVDIAFAQRRKTSRNAFVQWAGSGSESANRLLAASIDPARRGETLSIDDFVRLLRRSGGSDEATSTGRDARAPDISGHASAS
>DIM1|ncbi284812|Q9USU2|0.99994315|NA
MGKIRVRNNNAASDAEVRNTVFKFNKDFGQHILKNPLVAQGIVDKADLKQSDTVLEVGPGTGNLTVRMLEKARKVIAVEMDPRMAAEITKRVQGTPKEKKLQVVLGDVIKTDLPYFDVCVSNTPYQISSPLVFKLLQQRPAPRAAILMFQREFALRLVARPGDPLYCRLSANVQMWAHVKHIMKVGKNNFRPPPLVESSVVRIEPKNPPPPLAFEEWDGLLRIVFLRKNKTIGACFKTSSIIEMVENNYRTWCSQNERMVEEDFDVKSLIDGVLQQCNLQDARASKCGQTEFLSLLHAFHQVGVHFA
>DIM2|ncbi237631|A0A0D1C927|NA|NA
MPRAVSAKLSRQHEPSAGLRSGSARSAASSSSSVHASNQNSSATTKNPIFNTDKFGQHILKNPLVAQGIVDKANLKPTDMVLEVGPGTGNLTVRILEKAKKTTVVEMDPRMAAELSKRVQGKPEQRKLDIMLGDFCKTDLPYFDVCISNTPYQISSPLVFKLLSHRPLFRCAILMFQREFALRLIARPGDNLWCRLSANVQLYSKVDHIMKVSRNSFRPPPQVESSVVRITPLNPPPAIPFEEFDGLTRIVFSRRNKTVRASFFDARGVIDMLESNYKTYCAVKEIMPEQGSFADMVKQVLVETGSAENRAAKMDIDDLLTLLAAFHEKGIHFS

(*) Write NA for any missing / not available value

OrthoXML

The OrthoXML format is a standardized format that is used by many popular orthology prediction tools and databases like OMA, InParanoid, Hieranoid, OrthoMCL, Panther or Roundup.

PhyloProfile expects the NCBI taxonomy IDs to be present in the species tag in the XML as NCBITaxId like this:

<species name="Dipodomys ordii" NCBITaxId="10020">

Example: inst/extdata/test.main.xml

NOTE: Please make sure that the second line contains only the tag <orthoXML> without any further words/characters.

Long format

The long format is a tab delimited file containing up to 5 columns:

1. geneID
2. ncbiID (ncbi+taxonID. e.g. ncbi7029, ncbi3702)
3. orthoID
4. var1 (optional)
5. var2 (optional)

Example: inst/extdata/test.main.long

geneID	ncbiID	orthoID	FAS	traceability
OG_1017	ncbi272557	NA	NA	0.963660510229
OG_1017	ncbi176299	A.fabrum@176299@1582	0.99904467475	0.962110129973
OG_1017	ncbi3702	A.thaliana@3702@252561	0.99420505647	0.970055633243
OG_1017	ncbi876142	E.intestinalis@876142@Eint_050240	1.0	0.217721321958503
OG_1017	ncbi9606	H.sapiens@9606@149340	0.9991125988	0.97056087274
OG_1017	ncbi10090	M.musculus@10090@112934	0.9992383882	0.970475008934
OG_1017	ncbi586133	N.parisii@586133@NEPG_02124	0.999537811	0.865887168304603

*Use NA for any missing value! Rename the title of the 4. and 5. column according to your data

**Depend on your need, you can switch the last 2 columns to show them differently in the profile plot. The first var1 column will be represented by the dot colors, while the second var2 column will be represented by the background colors. However, do not use NA for the whole column 4 of the var1. If you just want to show your values using the background colors, you can add zero 0 to the whole var1 column.

For example:

This is a wrong input:

geneID	ncbiID	orthoID	score_1	score_2
OG_1017	ncbi272557	NA	NA	0.963660510229
OG_1017	ncbi176299	NA	NA	0.962110129973
OG_1017	ncbi3702	NA	NA	0.970055633243

It should be written like this:

geneID	ncbiID	orthoID	score_2
OG_1017	ncbi272557	pseudo_1	0.963660510229
OG_1017	ncbi176299	pseudo_1	0.962110129973
OG_1017	ncbi3702	pseudo_1	0.970055633243

In this case, you have to add names for the dots - the pseudo orthologs, otherwise the dots will not be present. The values of the score_2 will be then represented by the dot colors.

If you want to plot score_2 using the background colors, you have to rewrite the above input file like this:

geneID	ncbiID	orthoID	score_1	score_2
OG_1017	ncbi272557	NA	0	0.963660510229
OG_1017	ncbi176299	NA	0	0.962110129973
OG_1017	ncbi3702	NA	0	0.970055633243

Here you don't need to add the pseudo orthologs into the orthoID column, since the background colors are independent from the dots.

Wide (matrix) format

A matrix where rows represent genes and columns represent taxa. Each cell in the matrix contains <orthoID>#<var1>#<var2>. Absent values is written as NA, e.g. arath_2339_31:248814#NA#0.2, or homsa_8_41:119370#NA#NA or only NA (leads to the same result as NA#NA#NA).

The header of first column has to be geneID. The header of each taxon must have this format ncbi12345, in which 12345 is its NCBI taxon ID.

Example: inst/extdata/test.main.wide

geneID	ncbi272557	ncbi176299	ncbi3702	ncbi876142	ncbi9606	ncbi10090	ncbi586133	ncbi4837	ncbi4081	ncbi7668
OG_1017	NA#NA#0.963660510229	A.fabrum@176299@1582#0.99904467475#0.962110129973	A.thaliana@3702@252561#0.99420505647#0.970055633243	E.intestinalis@876142@Eint_050240#1.0#0.217721321958503	H.sapiens@9606@149340#0.9991125988#0.97056087274	M.musculus@10090@112934#0.9992383882#0.970475008934	N.parisii@586133@NEPG_02124#0.999537811#0.865887168304603	NA#NA#0.969937583896	NA#NA#0.970897505007	NA#NA#0.971396609443
OG_1019	A.pernix@272557@1942#0.99968773663#1	NA#NA#1	A.thaliana@3702@247509#0.99942067306#1	E.intestinalis@876142@Eint_111470#0.99984830527#0.351918260358449	H.sapiens@9606@131231#0.99941631235#1	M.musculus@10090@72026#0.99942634894#1	N.parisii@586133@NEPG_02025#0.99929028115#0.576080895539263	P.blakesleeanus@4837@2023#0.99942142624#1	S.lycopersicum@4081@15365#0.9997286185#1	S.purpuratus@7668@161#0.999799012#1
OG_1020	NA#NA#1	NA#NA#1	A.thaliana@3702@243417#0.999920242#1	E.intestinalis@876142@Eint_051570#1.0#0.209834190564059	H.sapiens@9606@106204#0.999839662#1	M.musculus@10090@97389#0.999314626#1	N.parisii@586133@NEPG_00242#0.999879679#0.574228258376227	P.blakesleeanus@4837@12159#0.999973516#1	S.lycopersicum@4081@24239#0.999776104#1	NA#NA#1
OG_1023	NA#NA#0.871939716429	NA#NA#0.86683810271	NA#NA#0.893358887125	E.intestinalis@876142@Eint_061440#1.0#0.271616002558503	H.sapiens@9606@109429#0.995954221#0.895076764175	M.musculus@10090@68883#0.997224544#0.894784552862	N.parisii@586133@NEPG_02149#0.999997996#0.83226924426614	P.blakesleeanus@4837@1117#0.99875561095#0.892958043873	NA#NA#0.896223430018	S.purpuratus@7668@8192#0.99922606933#0.897926570173
OG_1024	A.pernix@272557@1694#0.98373085665#0.957890571002	NA#NA#0.956189828978	A.thaliana@3702@253525#0.98362698628#0.964951063327	E.intestinalis@876142@Eint_061350#0.99968588504#0.923773511804587	H.sapiens@9606@149706#0.98325730005#0.965512227736	M.musculus@10090@60168#0.9839481254#0.965416822528	N.parisii@586133@NEPG_02355#0.99625187461#0.222323679187767	P.blakesleeanus@4837@63#0.99872664087#0.964820022573	S.lycopersicum@4081@33226#0.974997601#0.965886415483

*NOTE: wide format is not suitable for profiles containing paralogs (co-orthologs). We recommend using the long format as input because of its easy preparation

Preprocessed data

When working with large phylogenetic profiles, it is advantageous to skip the preprocessing step. PhyloProfile provides an option to export the preprocessed data (menu Export data -> Processed data) for future use. The preprocessed data consists of 4 files longDf.rds, preData.rds, sortedtaxaList.rds and fullData.rds, which must be stored together in a single folder.

Certain parameters, such as the working taxonomy rank, the reference taxon, and the order of taxa, are automatically defined from the data and cannot be modified.

How can I generate input files?

Using output files from OMA

The Orthologous Matrix OMA is one of the most commonly used approaches for ortholog search. OMA provide ortholog groups in orthoXML format for both online database and standalone tool.

PhyloProfile provides basic Python scripts to make use of ortholog output that is provided by OMA. If you calculate your own orthologous groups via OMA Standalone this will add the NCBI taxonomy IDs to the generated OrthoXML file. If you want to use the precalculated orthologous groups from OMA Browser you can easily download the corresponding files from the command line.

OMA Standalone

By default, the output of OMA Standalone does not include the correct NCBITaxId but rather gives these as <species name="Dipodomys_ordii" NCBITaxId="-1">. With scripts/convert_oma_standalone_orthoxml.py we provide a basic Python script to enable the use of OMA Standalone.

Besides the OrthoXML of OMA Standalone it only requires a simple, tab-separated mapping-file that maps the species names as generated by OMA Standalone to the NCBI Taxonomy ID. OMA Standalone uses the filenames of the protein sets you put into the DB folder as the species names, with Dipodomys_ordii.fa being transformed into the species name Dipodomys_ordii. An example mapping file (named e.g. "taxon_mapping_oma_orthoxml.csv") should look like this:

Dipodomys_ordii	10020
Mus_musculus	10090
Rattus_norvegicus	10116

To convert the OrthoXML of OMA Standalone to a PhyloProfile compatible OrthoXML you can use the script in /inst/PhyloProfile/scripts/convert_oma_standalone_orthoxml.py (https://raw.githubusercontent.com/BIONF/PhyloProfile/master/inst/PhyloProfile/scripts/convert_oma_standalone_orthoxml.py):

python convert_oma_standalone_orthoxml.py -x oma_output.orthoxml -m taxon_mapping_oma_orthoxml.csv > oma_example_phyloprofile_compatible.orthoxml

where oma_output.orthoxml is your local OMA output file.

OMA Browser

From v0.3.0 you can directly upload list of OMA IDs or Uniprot IDs into PhyloProfile as a supported input format. The obtained OMA data can be optionally downloaded for later use.

Other OrthoXML inputs?

PhyloProfile was tested with an example OrthoXML from their website and converted OrthoXML as given by OMA. If it does not work with your XML file, please let us know!

Using fDOG tool

Our new ortholog prediction tool fDOG can generate phylogenetic profiles together with domain architecture files for directly uploading into PhyloProfile.

Manually prepare input files

in Multi-FASTA, long or wide format :)

Your tool is missing? Please get in touch! we are trying to support more orthology prediction tools right out of the box.

Additional input files

Ortholog annotations (e.g. domains)

An additional sequence annotation file/folder can be provided to further enrich the phylogenetic profiles. Since the tool was initially designed to work with protein architecture annotations, the annotation file can have the following information (columns separated by tab):

1. pairID - formatted as geneID#orthologID (*)
2. orthologID/seedID (*)
3. sequence length
4. feature name - pfam domain, smart domain,etc. (*)
5. start position (*)
6. end position (*)
7. weight value for the feature - set as NA or leave blank if unavailable
8. highlighted - write Y for highlighting a domain, otherwise write N

(*) required columns, the others are optional!

Example: (or see domain files in data/demo/domain_files)

Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	pfam_RVT_N	12	94	NA	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	pfam_GIIM	367	445	NA	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	pfam_HNH	520	566	NA	N
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	pfam_HNH_4	522	569	NA	N
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	smart_HNHc	510	561	NA	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|acama_4692@329726@1|2416|0	576	pfam_RVT_1	108	336	NA	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|YEAST@559292@1|P03875	834	seg_low complexity regions	242	253	0.06521739	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|YEAST@559292@1|P03875	834	smart_HNHc	761	820	0.32608696	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|YEAST@559292@1|P03875	834	pfam_Intron_maturas2	602	759	0.32608696	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|YEAST@559292@1|P03875	834	tmhmm_transmembrane	13	35	0.06521739	Y
Q0050#Q0050|acama_4692@329726@1|2416|0	Q0050|YEAST@559292@1|P03875	834	pfam_RVT_1	315	577	0.2173913	Y

We provide 2 python scripts for converting outputs from hmmscan or pfamscan into PhyloProfile compatible domain files.

  1. Do PFAM annotation by using hmmscan or pfamscan:

    hmmscan -E 0.001 --noali --domtblout hmmscanOut.txt path_to_Pfam_A_files/Pfam-A.hmm inst/extdata/test.main.fasta

    perl pfamscan.pl -fasta data/demo/test.input.fasta -dir hmmscanOut.txt path_to_Pfam_A_files/ > pfamscanOut.txt

  2. Parse output files into compatible domain files using inst/PhyloProfile/scripts/hmmscanParser.py or inst/PhyloProfile/scripts/hmmscanParser.py:

    python hmmscanParser.py -i hmmscanOut.txt > hmmscanOut.domains

    python pfamscanParser.py -i pfamscanOut.txt > pfamscanOut.domains

*NOTE: To keep the consistent IDs with sequences in main input file, please format the sequence header of FASTA file before using it as input for hmmscan or pfamscan.

Aside from that, PhyloProfile also accepts the domain files generated directly from HaMStR.

Amino acid sequences (in FASTA format)

Sequences in FASTA format can be optionally given to FASTA config menu in the Input and settings page of PhyloProfile. There are two options for submitting the FASTA files:

  1. You can input a concatenated file containing all sequences of genes that are present in the phylogenetic profile.
  2. If each taxon has it own multiple FASTA file, you can give the path to the folder containing those FASTA files. The extension of FASTA files is limited with either .fa, .fasta, .fas or .txt. The description (header line) of each sequence must have the format taxonID:sequenceID, taxonID@sequenceID, or taxonID|sequenceID (for example: homsa_8_41:149340, H.sapiens@9606@46172, H.sapiens|HUMAN_124; where homsa_8_41, H.sapiens@9606 and H.sapiens are taxonIDs and 149340, Eint_030020 are sequenceIDs.

For both cases, the FASTA header of each sequence has to be the same as the sequence name showing in the input profile. Demo fasta files can be found in data/demo/fasta_files.

If input file is already in FASTA format, the sequences will be directly parsed from the main input file.

User-defined species tree

PhyloProfile allow users to upload their species tree in order to (1) sort taxa based on their defined tree instead of using NCBI taxonomy common tree, (2) filter input taxa by the taxa present in the species tree.

Input species tree has to be in newick format and should not contain singletons(*). The same as other input files, tree nodes must have ncbi+taxonID format.

Example:

(ncbi3702,(ncbi2234,ncbi329726,(ncbi3711,ncbi7029)));

(*) A singleton node is a node with only one descendant. These are created by extra left & right parentheses in our Newick string. For example:

(((A,B),(C,D)),E);

has no singletons; whereas:

((((A),B),(C,D))),E);

has two singletons - one on the edge leading from the common ancestor of A & B to tip A, and another below the clade containing A, B, C, and D.

(reference)

List of sorted taxa

Another option to have the taxa sorted is providing a ordered list of taxon IDs. For example

ncbi3702
ncbi2234
ncbi329726
ncbi3711
ncbi7029

where the first line is the reference species.

NOTE: if an ordered taxon list is given, the option for selecting reference taxon and changing the working taxonomy rank will be disabled.

Gene categories

The phylogenetic profiles can be highlighted based on their gene categories. These categories can be protein functions, metabolic pathways, or physiological characters of proteins, etc. Genes within the same category will be highlighted with the same color.

The gene categories input file is a tab delimited file without the header, for example:

OG_1009	cat1
OG_1010	cat1
OG_1016	cat2
OG_1017	cat5
OG_1018	cat2

You can find an example for the LCA Microspodiria demo data set here.

Gene names

Instead of the default gene IDs, you can also display their gene names, which you can provide by a tab delimited file without the header, for example:

OG_1009	Gene A
OG_1010	Gene B
OG_1016	Gene C
OG_1017	Gene D
OG_1018	Gene E

NOTE: Duplicate or empty gene names are not allowed. If some genes don't have names, you can exclude them from the mapping file. Any genes without a name will be displayed by their IDs. Genes whose IDs cannot be found in the main phylogenetic profile input will be ignored.

Use with config file

If you don't want to manually upload the input files via the GUI, as well as change some settings every time, you can create a config file and run PhyloProfile directly with your predefined settings. The config file is a text file written in YAML format. For example, this is the contain of a test_config.yml:

---
# input files
mainInput: /Users/vinh/test.main.long
domainInput: /Users/vinh/Desktop/101621at6656.domains
fastaInput: /Users/vinh/Desktop/bionf/fastaFiles/concatenatedFile.fa
treeInput: NULL
# define taxonomy rank / reference taxon (optional)
rank: genus
refspec: Homo
# others (optional)
# wrong value will be replaced by default value
clusterProfile: FALSE
xAxis: taxa
profileTypeClustering: binary
distMethodClustering: maximum

In a config file, the only the mainInput is required and set to the absolute path of the main phylogenetic profile input file. The other input files (domainInput, fastaInput and treeInput) can be set to NULL if they do not exist (see treeInput in the example). Other settings are also optional. If they are not defined, or are set with an unacceptable values, the default values will be used.

List of currently available parameters:

Variable (* required) Value type Description Possible values (default in bold)
mainInput (*) text Path to main input phylogenetic profiles Absolute path to main input file
domainInput text Path to domain input Absolute path to domain file, or NULL
fastaInput text Path to fasta file Absolute path to fasta file, or NULL
treeInput text Path to tree file for sorting taxa Absolute path to tree file, or NULL
rank text Working taxonomy rank strain, species, genus, family, order, class, phylum, kingdom, superkingdom
refspec text Selected reference taxon (e.g. homo)
clusterProfile boolean Flag to cluster profiles TRUE, FALSE
profileTypeClustering text Type of profiles used for clustering binary, var1, var2
distMethodClustering text Distance method for clustering for binary profiles only: euclidean, maximum, manhattan, canberra, binary, pearson; for both binary and non-binary profiles: mutualInformation, distanceCorrelation
clusterMethod text Clustering method single, complete, average, mcquitty, median, centroid
xAxis text Type of x-Axis taxa, gene

(will continue being expanded...)

The smallest config file tiny_config.yml will look like this:

---
mainInput: /Users/vinh/test.main.long

To run PhyloProfile with a config file, simply use the command:

library(PhyloProfile)
runPhyloProfile("/absolute/path/to/my/test_config.yml")

How PhyloProfile retrieves taxonomy information

Input taxa can be found in NCBI taxonomy database

PhyloProfile required NCBI taxonomy IDs to work. In all kinds of input (except orthoXML), taxonomy IDs must have this format ncbi+taxonID (e.g. ncbi7029, ncbi3702). We use those IDs to fetch the complete taxonomy ranks and the corresponding scientific names for input taxa.

To convert species names into ncbi+taxonID format, we provide a function called Search for NCBI taxonomy IDs, which can be found in Function menu of the tool.

By the first time running PhyloProfile, the complete NCBI taxonomy database will be downloaded and preprocessed into your_R_library_folder/PhyloProfile/PhyloProfile/data/preProcessedTaxonomy.txt.

If you want to update this file after that, you can run these commands:

# Download and preprocess NCBI taxonomy information
library(PhyloProfile)
preProcessedTaxonomy <- processNcbiTaxonomy()
# Find the library path, where PhyloProfile has been installed and identify an output file there
path <- find.package("PhyloProfile")
outFile <- paste0(path, "/PhyloProfile/data/preProcessedTaxonomy.txt")
# Write the preprocessed taxonomy file into PhyloProfile data folder
write.table(
	preProcessedTaxonomy,
	file = outFile,
	col.names = TRUE,
	row.names = FALSE,
	quote = FALSE,
	sep = "\t"
)

or download this script and run RScript within the UNIX/Mac terminal

Rscript updateNCBITax4PhyloProfile.R

Input taxa does not exist in NCBI

If your input taxa are not in the taxonomy file taxonNamesFull.txt, PhyloProfile will inform users about the missing taxa. There are two possible reasons for this. First, either the missing IDs or the taxonomy file are out-of-date. Second, your taxa are not present in NCBI taxonomy database. In the second case, you have two options to add new taxa to taxonomy file.

  • Using the UI to input manually the missing taxa.
  • Putting the missing taxa into ``your_R_library_folder/PhyloProfile/PhyloProfile/data/newTaxa.txt` file. By that, PhyloProfile will automatically append the new information into taxonomy file. For example:
ncbiID	fullName	rank	parentID
999999901	abcdef	species	3702

where, 999999901 is taxon ID defined by you (be make sure that 999999901 not exist in NCBI taxonomy database), abcdef is taxon name, species is rank of this taxon, 3702 is the already known ID of the next higher rank.

Examples

Please take a look at the example files at inst/extdata/test.main.fasta, inst/extdata/test.main.wide, inst/extdata/test.main.long or inst/extdata/test.main.xml to see small examples for correctly formatted orthology data. In the folder inst/extdata/domainFiles and inst/extdata/fastaFiles you can find examples for the additional ortholog annotations with domains and sequence files in FASTA format.