Skip to content

Software installation and data required

Patrick Douglas edited this page Mar 18, 2019 · 17 revisions

1. Software Required

Note, Trinity is not absolutely required. It is possible to use Trinotate with other sources of transcript data as long as suitable inputs are available.

Below are optional but recommended:

After download and unpack signalP you should edit the following line to read like so, increasing the max number of entries that can be processed, in the signalP main script (eg. /home/patrick/signalp-4.1/signalp this file can be found where you unpacked signalP binaries):

my $MAX_ALLOWED_ENTRIES=2000000;  # default is only 10000

Also update the path to where you have the signalP software installed eg.(in the same above file):

$ENV{SIGNALP} = '/usr/local/src/signalp-4.1';
  • tmhmm v2 (free academic download)

http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm

You might need to edit the header lines of the scripts tmhmm and tmhmmformat.pl to read:

#!/usr/bin/env perl
  • RNAMMER (free academic download)

http://www.cbs.dtu.dk/cgi-bin/sw_request?rnammer

Installation notes: Installing RNAMMER requires a little bit of hacking, unfortunately, and if you follow the instructions below you will likely get it working. When you obtain the software bundle from the above website, be sure to untar it in a new directory. For example:

mkdir RNAMMER
cd RNAMMER
mv /path/to/rnammer-1.2.src.tar.Z .
tar zxvf rnammer-1.2.src.tar.Z

a. RNAMMER requires the older version of hmmsearch (v2). You can obtain the hmmsearch_v2 at here. After building the software, rename this version of hmmsearch as hmmsearch2.

b. Edit the rnammer script like so; In the rnammer software configuration, edit the rnammer script to point

$HMMSEARCH_BINARY = "/path/to/hmmsearch2";
# be sure to give the complete path to where you installed hmmsearch2.

# update the INSTALL_PATH setting:
$INSTALL_PATH = "/dir/where/you/installed/RNAMMER";

c. Edit the core-rnammer script like so:

There are two places where you'll find --cpu 1 --compat. Remove the --cpu 1 at each of these places, and retain the --compat.

d. Be sure that rnammer functions correctly by executing it on their provided sample data. RNAMMER is quite useful, but the current implementation is not robust to error, so check carefully.

Visit the example directory included in rnammer
cd RNAMMER/example
Now run the example command like so:
../rnammer -S bac -m lsu,ssu,tsu -xml ecoli.xml -gff ecoli.gff -h ecoli.hmmreport < ecoli.fsa
If it runs without error AND generates new ecoli.xml, ecoli.gff, and ecoli.hmmreport files (check the datestamps on the files via ls -ltr), then congratulate yourself for successfully installing rnammer. You will see something like bellow:

NOTE: In this example sometimes you may see the error bellow. If you get this error, please take a look here

Can't locate XML/Simple.pm in @INC (you may need to install the XML::Simple module) (@INC contains: /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.22.1 /usr/local/share/perl/5.22.1 /usr/lib/x86_64-linux-gnu/perl5/5.22 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.22 /usr/share/perl/5.22 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base .) ...

2. Sequence Databases Required

Trinotate relies heavily on SwissProt and Pfam, and custom protein files are generated as described below to be specifically used with Trinotate. You can obtain the protein database files by running this Trinotate build process. This step will download several data resources including the latest version of swissprot, pfam, and other companion resources, create and populate a Trinotate boilerplate sqlite database (Trinotate.sqlite), and yield uniprot_sprot.pep file to be used with BLAST, and the Pfam-A.hmm.gz file to be used for Pfam searches. Run the build process like so:

$TRINOTATE_HOME/admin/Build_Trinotate_Boilerplate_SQLite_db.pl  Trinotate

and once it completes, it will provide to you:

Trinotate.sqlite
uniprot_sprot.pep
Pfam-A.hmm.gz

Prepare the protein database for blast searches by:

makeblastdb -in uniprot_sprot.pep -dbtype prot

Uncompress and prepare the Pfam database for use with 'hmmscan' like so:

gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm

3. Running Sequence Analyses

  • Files needed for execution

    • Trinity.fasta - Final product containing all the transcripts assembled by Trinity

    • Trinity.fasta.transdecoder.pep - Most likely Longest-ORF peptide candidates generated from the Trinity Assembly. Instructions for generation of this file can be found here: http://transdecoder.github.io/

  • Capturing BLAST Homologies

BLAST information Instructions for installation of command line stand alone blast can be found here: http://www.ncbi.nlm.nih.gov/books/NBK52640/ NOTE: This step will undoubtedly take the longest, for very large files execution on a multi-cpu server HPC environment is highly recommended, and your thread count should be equal to the number of CPU's present on the node the job is run on.

Blast Commands

Command Description
blastx -query Trinity.fasta -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastx.outfmt6 Search Trinity transcripts
blastp -query transdecoder.pep -db uniprot_sprot.pep -num_threads 8 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastp.outfmt6 Search Transdecoder-predicted proteins

Note use of '--max_target_seqs 1', and while this might not report the best match with dna-level searches see Shah 2018, I find it does seem to work as intuited with protein database searches.

In addition to searching uniprot_sprot.pep, you can search any other protein database and load the results in as a custom protein database. Searching Swissprot, however, is critical to Trinotate, because that's where it retrieves the various Kegg, GO, and Eggnog, etc., annotations from.

Note

  • If you have access to a compute farm running LSF, SGE, PBS, or SLURM, consider using HPC GridRunner to maximally parallelize your blast searches.
  • num_threads should be equal to the amount of cores available
  • To see the number of cores of you machine, open up a Terminal window and hit nproc command (e.g bellow)
  • Running HMMER to identify protein domains

hmmscan (HMMER) command:

Command Description
hmmscan --cpu 12 --domtblout TrinotatePFAM.out Pfam-A.hmm transdecoder.pep > pfam.log Run hmmscan

Note

  • In --cpu, 12should be replaced by amount of cores available
  • To see the number of cores of you machine, open up a Terminal window and hit nproc command
  • Running signalP to predict signal peptides

signalP command:

Command Description
signalp -f short -n signalp.out transdecoder.pep Run signalP
  • Running tmHMM to predict transmembrane regions

tmhmm command:

Command Description
tmhmm --short < transdecoder.pep > tmhmm.out Run tmhmm
  • Running RNAMMER to identify rRNA transcripts

RNAMMER image

RNAMMER was originally developed to identify rRNA genes in genomic sequences. To have it identify rRNA sequences among our large sets of transcriptome sequences, we first concatenate all the transcripts together into a single super-scaffold, run RNAMMER to identify rRNA homologies, and then transform the rRNA feature coordinates in the super-scaffold back to the transcriptome reference coordinates. The following script will perform all of these operations:

$TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl
################################################################################
#
#  --transcriptome <string>      Transcriptome assembly fasta file
#
#  --path_to_rnammer <string>    Path to the rnammer software
#                                (ie.  /usr/bin/software/rnammer_v1.2/rnammer)
#
#  Optional:
#
#  --org_type <string>           arc|bac|euk   (default: euk)
#
################################################################################

And so, you might execute it like so:

$TRINOTATE_HOME/util/rnammer_support/RnammerTranscriptome.pl --transcriptome Trinity.fasta --path_to_rnammer /usr/bin/software/rnammer_v1.2/rnammer

Once complete, it will have generated a file: Trinity.fasta.rnammer.gff, which can be loaded into Trinotate as described in sections below.

4. Now you can proceed to next phase Loading Results into a Trinotate SQLite Database

You can’t perform that action at this time.