Skip to content
This repository has been archived by the owner on Feb 16, 2019. It is now read-only.

ITEP ID standards

mattb112885 edited this page Apr 3, 2014 · 1 revision

Organism IDs

Organism IDs consist of two numbers separated by a period ("."):

- The TaxID (an integer), and
- A "version number" (an integer)

The version number can be used to distinguish between different annotations of the same genome, or more commonly it is used to distinguish different genomes with the same TaxID. An example organism ID is 83333.1, which is for an organism with TaxID 83333 and version number 1.

A mapping between organism name and ID is stored in the database and also exists in the file $ITEP_ROOT/organisms . DO NOT DELETE THIS FILE.

The organism ID will always match the regex "\d+.\d+"

Gene IDs

The gene IDs in ITEP are designed to be compatible with RAST and with PubSEED. They are generated as follows:

  • If you download a genbank file from PubSEED, ITEP will use the same IDs automatically.
  • ITEP will also use RAST IDs if the user uses the web interface to RAST to download the tab-delimited file. See this tutorial for details.
  • Otherwise, ITEP (in particular the convertGenbank2Table.py script) will generate IDs with the format

fig|[organism_ID].peg.[Number]

Where [number] is incremented by 1 in order in which the genes appear in the Genbank file. The conversion from ITEP IDs to other IDs in the input Genbank files is automatically generated and stored in the file $ITEP_ROOT/aliases/aliases

The ITEP gene ID will always match the following regex:

fig\|\d+\.\d+\.peg\.\d+

Organism IDs can be obtained by capturing the first two numbers:

fig\|(\d+\.\d+)\.peg\.\d+

Contig IDs

The contig name in input Genbank files is concatenated with

Contig name from Genbank file : contig1
ITEP organism ID: 83333.1
---------------
ITEP contig ID: 83333.1.contig1

This is done because contig names are often something generic like "contig1" and we want to avoid collisions of the same contig name in different organisms.

tBLASTn IDs

When you run the tBLASTn wrapper you will get an informative ID in this format:

TBLASTN_CONTIG_$CONTIG_START_$START_STOP_$STOP

where $CONTIG is the ITEP contig ID for the tBLASTn hit, $START is the location of the first base (1-indexed) of the tBLASTn hit within that contig and $STOP is the location of the last base. ITEP includes functions for parsing this and supports including IDs of this format in a tree. If a tBLASTn ID is included in a Newick tree, the neighborhood computation functions will automatically compute the neighborhoods for the tBLASTn hit so you can compare neighborhoods of called and uncalled genes.

Clone this wiki locally