Skip to content

Perl script to append taxonomic info to BLAST+ output and get summary stats

Notifications You must be signed in to change notification settings

AnnaFriedlander/taxon4blast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

###############################################################################
#                                                                             #
# Copyright 2012 Anna Friedlander and Monica Gruber.                          #
# (anna.fr@gmail.com and monica.gruber@vuw.ac.nz)                             #
#                                                                             #     
# This program is free software; you can redistribute it and/or modify it     #
# under the same terms as Perl itself (either: the GNU General Public License #
# as published by the Free Software Foundation; or the Artistic License).     #
# See http://dev.perl.org/licenses/ for more information.                     #
#                                                                             #
# Comments and queries are welcome to either or both authors, however we are  #
# unable to provide dedicated technical support.                              #
#                                                                             #
# Please acknowledge the authors if you use the programme in any work         #
# resulting in publication. Please cite it as:                                #
#                                                                             #
#   Friedlander, A.M. and Gruber, M.A.M 2012. taxon4blast.pl v 1.0.           #
#   available from anna.fr@gmail.com or monica.gruber@vuw.ac.nz               #
#                                                                             #
###############################################################################


---------------------PROGRAM DESCRIPTION---------------------------------------

taxon4blast.pl is a program to parse BLAST+ output (output format 6 with 
default fields) and append taxonomic information. 

It contains functions to: 
    * summarise BLAST+ output by taxon and calculate summary statistics 
      (number of hits; total, average, and median sequence length; average
      e-value; average bit-score);
    * compare BLAST+ output files by number of hits and sequence length at
      a particular taxon level (or taxon level and GI)
    * create FASTA files and split data based on whether an input sequence
      hit a particular taxon.

More detailed information can be found in the help menu by using the command:

      % perl taxon4blast.pl -help


---------------------RUNNING REQUIREMENTS--------------------------------------

taxon4blast.pl parses files of the following format:

    BLAST+ output in output format 6 (tab-delimited, no comment lines) with 
    default fields (qseqid sseqid pident length mismatch gapopen qstart qend 
    sstart send evalue bitscore)

To run taxon4blast.pl, Perl 5 (or later) and the following Perl modules must be
installed:

    *Bioperl
    *Bio::LITE::Taxonomy::NCBI::Gi2taxid
    *Sort::Naturally

As well as the default modules Getopt::Long and List::Util

The appropriate (nucl or prot) NCBI Taxonomy flatfiles are required:

    *nodes.dmp
    *names.dmp
    *gi_taxid_nucl.dmp or gi_taxid_prot.dmp

which can be obtained from ftp://ftp.ncbi.nih.gov/pub/taxonomy (the first two 
files will be in a taxdump archive)

NOTE: the first time the program is run, use the gi_taxid_<nucl/prot>.dmp file;
      a gi_taxid_<nucl/prot>.bin file will be created, which can be used 
      subsequently

To use -sequence_extract, input files to the BLAST+ search (in FASTA format) are
required.


---------------------FUNCTION SUMMARY------------------------------------------

The following functions are available to the user:

taxon_info
fetches taxid and taxonomic hierarchy given GI, appends to BLAST+ output and 
prints to file (output used in the other four main functions)

taxon_summary
sorts taxon_info output according to taxon specified by the user, calculates 
summary stats (number of hits; total, average, median sequence length; average 
e-value; average bit-score) and prints to file

taxon_overlap
sorts taxon_info output according to taxon specified by user compares n files 
by number of hits and total length of hits for each taxon

taxon_sequence_overlap
sorts taxon_info output according to taxon specified by user compares n files
by number of hits and total length of hits for each taxon+GI
using -nonunique prints only records found in >1 file

sequence_extract
given taxon_info output and the original input files to BLAST (which were used
to create the BLAST output used as input to taxon_info), and a taxon level and
taxon name (e.g. -level superkingdom -name bacteria) this function creates two
fasta files. One file contains all sequences that hit the taxon name specified,
the other contains all sequences that hit other taxa. Note that there will 
likely be overlap in the output files (eg an input sequence may hit both 
bacterial sequences and nonbacterial sequences).
Using the -subseq option prints out only the sequence that was matched (using 
qstart and qend)


---------------------SUGGESTED IMPROVEMENTS------------------------------------

The following are improvements that the user may wish to make to the program:

    *allow input of any BLAST output type, by requiring the user to specify 
     what column the qseqid, sseqid, evalue etc are in and how fields are 
     delimited (at the moment BLAST+ output format 6 with default fields is 
     hardcoded in).
    *graphical output
    *improve efficiency (e.g. use three hashes in taxon_info (line->gi; 
     gi->taxid; taxid->taxon_info) - at present there is duplication in this
     function
    *utilise parallel processing (at present processing is sequential)

About

Perl script to append taxonomic info to BLAST+ output and get summary stats

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages