Skip to content

Loading generated results into a Trinotate SQLite Database and Looking the Output Annotation Report

Patrick Douglas edited this page Mar 25, 2019 · 3 revisions

The following commands will import the results from the bioinformatic computes performed in previows step into a Trinotate SQLite database. All operations are performed using the included Trinotate utility. Usage is like so:

usage: Trinotate <sqlite.db> <command> <input> [...]
<commands>:
  • Initial import of transcriptome and protein data:

    $TRINOTATE_HOME/Trinotate Trinotate.sqlite init --gene_trans_map <file> --transcript_fasta <file> --transdecoder_pep <file>
    
  • Transdecoder protein search results:

    LOAD_swissprot_blastp <file>
    LOAD_pfam <file>
    LOAD_tmhmm <file>
    LOAD_signalp <file>
    
  • Trinity transcript search results:

    LOAD_swissprot_blastx <file>
    LOAD_rnammer <file>
    
  • Load custom blast results using any searchable database

    LOAD_custom_blast --outfmt6 <file> --prog <blastp|blastx> --dbtype <database_name>
    
  • report generation:

    report [ -E (default: 1e-5) ] [--pfam_cutoff DNC|DGC|DTC|SNC|SGC|STC (default: DNC=domain noise cutoff)]
    

Follow the steps below to obtain a boilerplate Trinotate sqlite database and populate it with your data.

1. Load transcripts and coding regions

Begin populating the sqlite database by loading three data types:

  • Transcript sequences (de novo assembled transcripts or reference transcripts)

  • Protein sequences (currently as defined by TransDecoder)

  • Gene/Transcript relationships (tab delimited format: "gene_id(tab)transcript_id", same as used by the RSEM software). If you are using Trinity assemblies, you can generate this file like so:

    $TRINITY_HOME/util/support_scripts/get_Trinity_gene_to_trans_map.pl Trinity.fasta >  Trinity.fasta.gene_trans_map
    

Note If you’re not using Trinity transcript assemblies, then it’s up to you to provide the correspinding gene-to-transcript mapping file.

Load these info into the Trinotate sqlite database like so (example, using Trinity assemblies):

Trinotate Trinotate.sqlite init --gene_trans_map Trinity.fasta.gene_trans_map --transcript_fasta Trinity.fasta --transdecoder_pep transdecoder.pep

2. Loading BLAST homologies

Command Description
Trinotate Trinotate.sqlite LOAD_swissprot_blastp blastp.outfmt6 Load protein hits
Trinotate Trinotate.sqlite LOAD_swissprot_blastx blastx.outfmt6 Load transcript hits

Optional: load custom database blast hits:

Command Description
Trinotate Trinotate.sqlite LOAD_custom_blast --outfmt6 custom_db.blastp.outfmt6 --prog blastp --dbtype custom_db_name Load protein hits
Trinotate Trinotate.sqlite LOAD_custom_blast --outfmt6 custom_db.blastx.outfmt6 --prog blastx --dbtype custom_db_name Load transcript hits

3. Load Pfam domain entries

Trinotate Trinotate.sqlite LOAD_pfam TrinotatePFAM.out

4. Load transmembrane domains

Trinotate Trinotate.sqlite LOAD_tmhmm tmhmm.out

5. Load signal peptide predictions

Trinotate Trinotate.sqlite LOAD_signalp signalp.out

Trinotate: Output an Annotation Report

To generate an output of Trinotate annotation report just hit the command bellow:

Trinotate Trinotate.sqlite report [opts] > trinotate_annotation_report.xls

Note, you can threshold the blast and pfam results to be reported by including the options below:

##################################################################
#
#  -E <float>                 maximum E-value for reporting best blast hit
#                             and associated annotations.
#							  Example: 1e-3
#  --pfam_cutoff <string>     'DNC' : domain noise cutoff (default)
#                             'DGC' : domain gathering cutoff
#                             'DTC' : domain trusted cutoff
#                             'SNC' : sequence noise cutoff
#                             'SGC' : sequence gathering cutoff
#                             'STC' : sequence trusted cutoff
#
##################################################################

The output has the following column headers:

0       #gene_id
1       transcript_id
2       sprot_Top_BLASTX_hit
3       RNAMMER
4       prot_id
5       prot_coords
6       sprot_Top_BLASTP_hit
7       custom_pombe_pep_BLASTX
8       custom_pombe_pep_BLASTP
9       Pfam
10      SignalP
11      TmHMM
12      eggnog
13      Kegg
14      gene_ontology_blast
15      gene_ontology_pfam
16      transcript
17      peptide

and the data are formatted like so:

0       TRINITY_DN179_c0_g1
1       TRINITY_DN179_c0_g1_i1
2       GCS1_SCHPO^GCS1_SCHPO^Q:53-2476,H:1-808^100%ID^E:0^RecName: Full=Probable mannosyl-oligosaccharide glucosidase;^Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces
3       .
4       TRINITY_DN179_c0_g1_i1|m.1
5       2-2479[+]
6       GCS1_SCHPO^GCS1_SCHPO^Q:18-825,H:1-808^100%ID^E:0^RecName: Full=Probable mannosyl-oligosaccharide glucosidase;^Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; Schizosaccharomyces
7       SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^Q:53-2476,H:1-808^100%ID^E:0^.^.
8       SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^SPAC6G10_09_SPAC6G10_09_I_alpha_glucosidase_I_Gls1_predicte^Q:18-825,H:1-808^100%ID^E:0^.^.
9       PF16923.2^Glyco_hydro_63N^Glycosyl hydrolase family 63 N-terminal domain^58-275^E:6.9e-60`PF03200.13^Glyco_hydro_63^Glycosyl hydrolase family 63 C-terminal domain^315-823^E:5.1e-187
10      .
11      .
12      .
13      KEGG:spo:SPAC6G10.09`KO:K01228
14      GO:0005783^cellular_component^endoplasmic reticulum`GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0004573^molecular_function^mannosyl-oligosaccharide glucosidase activity`GO:0009272^biological_process^fungal-type cell wall biogenesis`GO:0009311^biological_process^oligosaccharide metabolic process`GO:0006487^biological_process^protein N-linked glycosylation
15      .
16      .
17      .

Note Include options report --incl_pep --incl_trans to add the protein and transcript sequence data in the above tab delimited report.

Example rRNA entry

 0       TRINITY_DN2464_c0_g1
 1       TRINITY_DN2464_c0_g1_i1
 2       ART2_YEAST^ART2_YEAST^Q:6813-6646,H:1-56^85.71%ID^E:2e-23^RecName: Full=Putative uncharacterized protein ART2;^Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharo
myces
 3       18s_rRNA^1258-3098`28s_rRNA^3521-7502
 4       TRINITY_DN2464_c0_g1_i1|m.606
 5       6628-6960[-]
 6       ART2_YEAST^ART2_YEAST^Q:50-105,H:1-56^85.71%ID^E:3e-28^RecName: Full=Putative uncharacterized protein ART2;^Eukaryota; Fungi; Dikarya ; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces
 7       .
 8       .
 9       .
 10      .
 11      .
 12      .
 13      .
 14      .
 15      .
 16      .
 17      .

Note The Trinity-assembled 18S/28S S. pombe rRNA region includes a TransDecoer predicted ORF with a blast match to an S. cerevisiae protein "Antisense to ribosomal RNA transcript protein 2 (ART2).

Backticks and carets (^) are used as delimiters for data packed within an individual field, such as separating E-values, percent identity, and taxonomic info for best matches. When there are multiple assignments in a given field, the assignments are separated by (`) and the fields within an assignment are separated by (^). In a future release (post Feb-2013), the backticks and carets will be used more uniformly than above, such as carets as BLAST field separators, and including more than the top hit.

You can’t perform that action at this time.