Skip to content
Pablo Pareja Tobes edited this page May 31, 2013 · 37 revisions

This is the wiki for the project BG7.

Pipeline schema:

Click here for the linkable SVG file. BG7 pipeline schema

Input data format

RNA sequences constraints:

The headers of the FASTA file including the RNA sequences must comply with the format of the .frn files that you can find in Refseq, that means they should look something like this:

>ref|NC_011283|:75804-75898|Sec tRNA| [locus_tag=KPK_0076]

You can find an example here for the RNA file of a Clostridium strain: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Clostridium_SY8519_uid68705/NC_015737.frn

Features

BG7

This program is the main enter point for the project. It relies on the 'executions.xml' file, where sub-programs are specified along with their arguments so that in the end the whole annotation process is performed. The corresponding jar file can be found at the /jars project folder.

Execution times expected:

  • FixFastaHeaders: almost instantaneous
  • PredictGenes: about 10 minutes
  • RemoveDuplicatedGenes: 5/10 minutes
  • SolveOverlappings: 10/15 minutes
  • FillDataFromUniprot : Directly proportional to the number of proteins (if there are a lot proteins sometimes it kind of gets stucked for some time... we suspect uniprot cuts temporarily the access to our ip)
  • FillDataFromBio4j: ~ 1 minute
  • GenerateCSVFile: almost instantaneous

Associates an unique id to each fasta header.

Completes protein data performing HTTP requests to Uniprot site.

Completes protein data retrieving it from Bio4j DB.

Removes all genes that are duplicated.

Solves every overlapping found between genes and rnas.

This is one of the most important programs/steps on the semi-automatic annotation process. It carries out the gene prediction phase of the process.

Generates two multifasta files for the genes that have been predicted by the end of the process. One including the nucleotide sequences and other with the amino acid sequences.

Generates both a XML and multifasta file including every intergenic sequence.

Generates the corresponding file in format GFF for the final XML results file.

Exports the fnial XML results file to a CSV file.

It creates a new annotation XML file without any dismissed gene included in the input annotation XML file.

Exporting data to other formats

Exports the final xml annotation file to Embl format (one file for each contig).

Exports final xml annotation file to GenBank format.

Exports final xml annotation file to GenBank format.

Test programs

CheckForIterationQueryDefErrors

Looks for weird/wrong syntax <Iteration_query-def> values in blastoutput xml files, specifically wrong number of characters '|'.

Generates some statistics about proteins grouped by organism.

Quality control programs

Performs a really basic (still useful) quality control in the final annotation results XML file.

Performs an automatic quality control in some results selected randomly from the final annotation XML file.

Quality control program for GenBank files exporter program: Export GenBank files

Quality control program for '5 columns' GenBank files (those used for genomes submissions) exporter program: Export 5 columns GenBank files

Quality control program for the file generated by the program 'FixFastaHeaders'