Skip to content

ndaniels/cablastp2

 
 

Repository files navigation

ABOUT
=====
This software has been deprecated in favor of [MICA](https://github.com/ndaniels/MICA) which accelerates both DIAMOND and BLASTX.


CaBLASTP is a family of programs for performing compressively-accelerated
protein sequence searches based on the BLASTP family of tools (including BLASTX,
PSI-BLAST and DELTA-BLAST), as well as a compression tool (cablastp-compress)
for creating searchable, compressed databases based on an input FASTA file.

If you use CaBLASTP, please cite:
Daniels N, Gallant A, Peng J, Cowen L, Baym M, Berger B
(2013) Compressive Genomics for Protein Databases. 
Bioinformatics.

AND

Yu W, Daniels N, Danko D, Berger B:
Entropy-scaling search of massive biological data
Submitted for publication

CaBLASTP is licensed under the GNU public license version 2.0. If you would
like to license CaBLASTP in an environment where the GNU public license is
unacceptable (such as inclusion in a non-GPL software package) commercial
CaBLASTP licensing is available through MIT office of Technology Transfer.
Contact bab@mit.edu for more information.
Contact ndaniels@cs.tufts.edu for issues involving the code.


QUICK EXAMPLE
=============
Assuming you have Go and BLAST+ installed, here is a quick example of how to 
perform a compressively accelerated BLASTP search using a compressed database 
that has already been created.

    # Install CaBLASTP
    go get github.com/ndaniels/cablastp2/...

    # Download and extract the database. It is large and could take a while.
    # Make sure to check for a newer version!
    wget http://giant.csail.mit.edu/gems/nr-20140917-cablastx.tgz
    tar zxf nr-20140917-cablastx.tgz

    # Compressive BLAST search.
    cablastp-xsearch nr-20140917-cablastx query.fasta

There are more examples covering more use cases further down.


INSTALLATION
============
The easiest way to install is to download binaries compiled for your operating
system. No other dependencies are required (sans BLASTP+).
They can be downloaded here: http://gems.csail.mit.edu/

Compiling from source is also easy; compiling CaBLASTP only requires that git 
and Go are installed. If Go is not already available via your package manager,
it can be installed from source by following the directions here:
http://golang.org/doc/install

Once Go is installed, you'll need to set your GOPATH, which is where CaBLASTP 
(and other Go packages) will be installed. We recommend running

    mkdir $HOME/go

And adding the following to your `~/.profile` or equivalent:

    export GOPATH="$HOME/go"
    export PATH="$PATH:$GOPATH/bin"

Finally, run the following command to download, compile and install CaBLASTP:

    go get github.com/ndaniels/cablastp2/...

The CaBLASTP executables should be installed in `$GOPATH/bin`.

CaBLAST has been tested against Go 1.x.


EXECUTABLES
===========
There are five binary executables in the CaBLASTP suite, also available as 
binaries for users without Go installed. They are: 

    cablastp-compress     Compresses FASTA input files (such as nr.fasta or
                          nr.gz) into a compressed database for quick searching.

    cablastp-decompress   A rarely-needed inverse of cablastp-compress.
    
    cablastp-deltasearch  A compressively accelerated version of BLASTX.

    cablastp-search       A compressively accelerated version of BLASTP.

    cablastp-psisearch    A compressively accelerated version of PSI-BLAST.

    cablastp-xsearch      A compressively accelerated version of BLASTX.

Every executable can be run with the `--help` flag to get a list of command 
line options.


PREREQUISITES
=============
CaBLASTP boosts BLAST+ protein search, and as such it is not completely
self-contained. It relies on BLAST+.

To use CaBLASTP, you must already have BLAST+ 2.2 or later installed, so that
the BLAST binaries are in your PATH. DELTA-BLAST requires BLAST+ 2.2.26 or 
later and we recommend 2.2.27. DELTA-BLAST also requires an RPS database 
configured per NCBI's instructions.

We provide binaries for Mac OS X (64-bit intel, tested on OS X 10.8.2 and
built with Go 1.0.3) and Linux (64-bit intel/AMD, tested on Linux kernel 3.6.1 
and Go 1.1.1). With Go installed, CaBLASTP should work on Microsoft Windows but 
is untested.

You do not need the Go compiler installed to use the binary distributions of
CaBLASTP.


ADDITIONAL FILES
================
As compression is compute-intensive, we provide an already-compressed database
based on NCBI's NR from September 17, 2014, which we will update thrice yearly.
Since the CaBLASTP compressed database format is actually a directory 
structure, we provide it as a .tgz file, so should be unarchived with 
`tar zxf nr-20140917-cablastx.tgz`.

The result will be a directory, 'nr-20140917-cablastx', which contains the 
various files necessary for CaBLASTP to run.

Should you wish to create your own compressed database, you would use the
cablastp-compress binary. The database we provide was created with:

    cablastp-compress --ext-seed-size 0 --match-seq-id-threshold 70 
                      --ext-seq-id-threshold 60 --max-seeds 20 -p 40
                      nr-20140917-cablastx nr.fasta

Several of the command-line arguments are tuning parameters that affect the
run-time performance of compression.

The --max-seeds argument caps the size of the seeds table to, in this case, 20
gigabytes. Compressing large databases can require a great deal of RAM. A
significantly smaller cap will harm compression.

The --ext-seed-size argument allows for larger k-mer seeds without the memory
overhead associated with the larger size, by greedily requiring the additional
residues to be exact matches.

The --match-seq-id-threshold argument sets the sequence identity percentage
required for a match during compression.

The --ext-seq-id-threshold argument sets the sequence identity percentage
required for a single instance of extension during compression.

The -p argument simply sets the number of processor cores used during 
compression, and bears no relevance to the resulting compressed database.

In this case, the input file is `nr.fasta`, and the output name for the
compressed database is `nr-20140917-cablastx`.
Note that the compressed database is actually a directory that will be created
by `cablastp-compress`.


USAGE
=====
Run cablastp-compress -help, cablastp-deltasearch -help, cablastp-search -help, 
or cablastp-psisearch -help for detailed help as to command-line arguments.


EXAMPLES
========
To perform a compressively accelerated DELTA-BLAST search, you might do:

    cablastp-deltasearch -rpspath /path/to/cdd_delta
                         /path/to/cablastp_database /path/to/query.fasta

where:

    /path/to/cdd_delta is the local file path to your conserved domain
    database (required for standard delta-blast as well)

    /path/to/cablastp_database is the local file path to your cablastp 
    compressed database (it will be the path to cablastp-nr20121212 if you are 
    using the provided December, 2012 database)

    /path/to/query.fasta is simply the local file path to the FASTA file you 
    wish to use as a query.

To perform a compressively accelerated BLASTP search, you might do:

    cablastp-search /path/to/cablastp_database /path/to/query.fasta

where:

    /path/to/cablastp_database is the local file path to the cablastp 
    compressed database, and

    /path/to/query.fasta is the local file path to the FASTA file you wish to 
    use as a query.

Arguments the user wishes to pass to the underlying BLAST program, such as
adjusting the output format or the E-value threshold, may be passed via the
`--blast-args` flag.

For example, to specify XML output, one might run:

    cablastp-search /path/to/cablastp_database /path/to/query.fasta
                    --blast-args -outfmt 5

Where `-outfmt 5` is, as indicated in the NCBI blastp user guide, the 
command-line argument for XML output.


REPORTING BUGS
==============
If you find any bugs or have any problems using CaBLASTP, please submit a bug
report on our issue tracker:

    https://github.com/ndaniels/cablastp2/issues

About

Performs BLAST on compressed proteomic data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 94.6%
  • Ruby 3.3%
  • Python 1.6%
  • Other 0.5%