Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign up
Cannot retrieve the latest commit at this time.
| Failed to load latest commit information. | |||
|
|
blosum | ||
|
|
cmd | ||
|
|
data | ||
|
|
scripts | ||
|
|
.gitignore | ||
|
|
LICENSE | ||
|
|
Makefile | ||
|
|
NR-STATS | ||
|
|
README | ||
|
|
cablastp_test.go | ||
|
|
cmd.go | ||
|
|
coarse.go | ||
|
|
compressed.go | ||
|
|
db.go | ||
|
|
dbconf.go | ||
|
|
fasta.go | ||
|
|
io.go | ||
|
|
link_to_coarse.go | ||
|
|
link_to_compressed.go | ||
|
|
misc.go | ||
|
|
seeds.go | ||
|
|
seq.go | ||
|
|
seqdiff.go | ||
|
|
session.vim | ||
README
ABOUT
=====
CaBLASTP is a family of programs for performing compressively-accelerated
protein sequence searches based on the BLASTP family of tools (including
PSI-BLAST and DELTA-BLAST), as well as a compression tool (cablastp-compress)
for creating searchable, compressed databases based on an input FASTA file.
If you use CaBLASTP, please cite:
Daniels N, Gallant A, Peng J, Cowen L, Baym M, Berger B
(2013) Compressive Genomics for Protein Databases.
Submitted for publication.
CaBLASTP is licensed under the GNU public license version 2.0. If you would
like to license CaBLASTP in an environment where the GNU public license is
unacceptable (such as inclusion in a non-GPL software package) commercial
CaBLASTP licensing is available through MIT office of Technology Transfer.
Contact bab@mit.edu for more information.
Contact ndaniels@cs.tufts.edu for issues involving the code.
QUICK EXAMPLE
=============
Assuming you have Go and BLAST+ installed, here is a quick example of how to
perform a compressively accelerated BLASTP search using a compressed database
that has already been created.
# Install CaBLASTP
go get github.com/BurntSushi/cablastp/...
# Download and extract the database. It is large and could take a while.
# Make sure to check for a newer version!
wget http://groups.csail.mit.edu/cb/cablastp/cablastp-nr20121212.tar.gz
tar zxf cablastp-nr20121212.tar.gz
# Compressive BLAST search.
cablastp-search cablastp-nr20121212 query.fasta
There are more examples covering more use cases further down.
INSTALLATION
============
The easiest way to install is to download binaries compiled for your operating
system. No other dependencies are required (sans BLASTP+).
They can be downloaded here: http://groups.csail.mit.edu/cb/cablastp
Compiling from source is also easy; compiling CaBLASTP only requires that git
and Go are installed. If Go is not already available via your package manager,
it can be installed from source by following the directions here:
http://golang.org/doc/install
Once Go is installed, you'll need to set your GOPATH, which is where CaBLASTP
(and other Go packages) will be installed. We recommend running
mkdir $HOME/go
And adding the following to your `~/.profile` or equivalent:
export GOPATH="$HOME/go"
export PATH="$PATH:$GOPATH/bin"
Finally, run the following command to download, compile and install CaBLASTP:
go get github.com/BurntSushi/cablastp/...
The CaBLASTP executables should be installed in `$GOPATH/bin`.
CaBLAST has been tested against Go 1.x.
EXECUTABLES
===========
There are five binary executables in the CaBLASTP suite, also available as
binaries for users without Go installed. They are:
cablastp-compress Compresses FASTA input files (such as nr.fasta or
nr.gz) into a compressed database for quick searching.
cablastp-decompress A rarely-needed inverse of cablastp-compress.
cablastp-search A compressively accelerated version of BLASTP.
cablastp-psisearch A compressively accelerated version of PSI-BLAST.
cablastp-deltasearch A compressively accelerated version of DELTA-BLAST.
Every executable can be run with the `--help` flag to get a list of command
line options.
PREREQUISITES
=============
CaBLASTP boosts BLAST+ protein search, and as such it is not completely
self-contained. It relies on BLAST+.
To use CaBLASTP, you must already have BLAST+ 2.2 or later installed, so that
the BLAST binaries are in your PATH. DELTA-BLAST requires BLAST+ 2.2.26 or
later and we recommend 2.2.27. DELTA-BLAST also requires an RPS database
configured per NCBI's instructions.
We provide binaries for Mac OS X (64-bit intel, tested on OS X 10.8.2 and
built with Go 1.0.3) and Linux (64-bit intel/AMD, tested on Linux kernel 3.6.1
and Go 1.1.1). With Go installed, CaBLASTP should work on Microsoft Windows but
is untested.
You do not need the Go compiler installed to use the binary distributions of
CaBLASTP.
ADDITIONAL FILES
================
As compression is compute-intensive, we provide an already-compressed database
based on NCBI's NR from December 12, 2012, which we will update thrice yearly.
Since the CaBLASTP compressed database format is actually a directory
structure, we provide it as a .tar.gz file, so should be unarchived with
`tar zxf cablastp-nr20121212.tar.gz`.
The result will be a directory, 'cablastp-nr20121212', which contains the
various files necessary for CaBLASTP to run.
Should you wish to create your own compressed database, you would use the
cablastp-compress binary. The database we provide was created with:
cablastp-compress --ext-seed-size 0 --match-seq-id-threshold 70
--ext-seq-id-threshold 60 --max-seeds 20 -p 40
cablastp-nr20121212 nr.fasta
Several of the command-line arguments are tuning parameters that affect the
run-time performance of compression.
The --max-seeds argument caps the size of the seeds table to, in this case, 20
gigabytes. Compressing large databases can require a great deal of RAM. A
significantly smaller cap will harm compression.
The --ext-seed-size argument allows for larger k-mer seeds without the memory
overhead associated with the larger size, by greedily requiring the additional
residues to be exact matches.
The --match-seq-id-threshold argument sets the sequence identity percentage
required for a match during compression.
The --ext-seq-id-threshold argument sets the sequence identity percentage
required for a single instance of extension during compression.
The -p argument simply sets the number of processor cores used during
compression, and bears no relevance to the resulting compressed database.
In this case, the input file is `nr.fasta`, and the output name for the
compressed database is `cablastp-nr20121212`.
Note that the compressed database is actually a directory that will be created
by `cablastp-compress`.
USAGE
=====
Run cablastp-compress -help, cablastp-deltasearch -help, cablastp-search -help,
or cablastp-psisearch -help for detailed help as to command-line arguments.
EXAMPLES
========
To perform a compressively accelerated DELTA-BLAST search, you might do:
cablastp-deltasearch -rpspath /path/to/cdd_delta
/path/to/cablastp_database /path/to/query.fasta
where:
/path/to/cdd_delta is the local file path to your conserved domain
database (required for standard delta-blast as well)
/path/to/cablastp_database is the local file path to your cablastp
compressed database (it will be the path to cablastp-nr20121212 if you are
using the provided December, 2012 database)
/path/to/query.fasta is simply the local file path to the FASTA file you
wish to use as a query.
To perform a compressively accelerated BLASTP search, you might do:
cablastp-search /path/to/cablastp_database /path/to/query.fasta
where:
/path/to/cablastp_database is the local file path to the cablastp
compressed database, and
/path/to/query.fasta is the local file path to the FASTA file you wish to
use as a query.
Arguments the user wishes to pass to the underlying BLAST program, such as
adjusting the output format or the E-value threshold, may be passed via the
`--blast-args` flag.
For example, to specify XML output, one might run:
cablastp-search /path/to/cablastp_database /path/to/query.fasta
--blast-args -outfmt 5
Where `-outfmt 5` is, as indicated in the NCBI blastp user guide, the
command-line argument for XML output.
REPORTING BUGS
==============
If you find any bugs or have any problems using CaBLASTP, please submit a bug
report on our issue tracker:
https://github.com/BurntSushi/cablastp/issues