Skip to content

vsearch

Pas-Kapli edited this page Oct 16, 2016 · 5 revisions

Commands for installing vsearch

Cloning the repo. You will need Git, autoconf and automake to clone the repository and install VSEARCH. On a Debian-based Linux system, the three packages can be installed using the commands:

sudo apt-get install git autotools-dev

To clone the repository and install VSEARCH use the following commands:

$ git clone https://github.com/torognes/vsearch.git
$ cd vsearch
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Binary distribution. If cloning/compiling fails, you may directly download the pre-compiled VSEARCH binary for your system. If you are on a Linux system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-linux-x86_64.tar.gz
tar xzf vsearch-2.3.0-linux-x86_64.tar.gz

Or, if you are on a MAC system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-osx-x86_64.tar.gz
tar xzf vsearch-2.3.0-osx-x86_64.tar.gz

You will now have the binary distribution in a folder called vsearch-2.3.0-linux-x86_64 in which you will find three subfolders; bin, man and doc. We recommend making a copy or a symbolic link to the vsearch binary bin/vsearch in a folder included in your $PATH, and a copy or a symbolic link to the vsearch man page man/vsearch.1 in a folder included in your $MANPATH. The PDF version of the manual is available in doc/vsearch_manual.pdf.

Introduction Vsearch

Overview. VSEARCH includes commands to perform de novo clustering using a greedy and heuristic centroid-based algorithm with an adjustable sequence similarity threshold specified with the --id option (e.g., --id 0.97). The input sequences are either processed in the user supplied order (--cluster_smallmem) or pre-sorted based on length (--cluster_fast) or abundance (--cluster_size).

Method. Each input sequence is used as a query against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold (--id). If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If --maxaccepts is higher than 1 (default: 1), several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering), or, if the --sizeorder option is used, the centroid with the highest abundance (abundance-based greedy clustering).

Delimitation with Vsearch

Vsearch is sensitive to two parameters: the distance threshold value (--id) and the order of sequences. As mentioned earlier Vsearch offers multiple options for controlling the order of sequences (--cluster_smallmem, --cluster_fast, --cluster_size).

Input file: BR_cob_57ind_no_outgr.fasta

vsearch --cluster_fast BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-cf-97.fa --msaout BR_msaout-cf-97.txt
vsearch --cluster_smallmem BR_cob_57ind_no_outgr.fasta --usersort --id 0.97 --centroids BR_centroids-sm-97.fa --msaout BR_msaout-sm-97.txt
vsearch --cluster_size BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-sz-97.fa --msaout BR_msaout-sz-97.txt

Note: When using ``--cluster_smallmem, option --usersort` indicates that sequences are not pre-sorted by length.

Re-run the delimitations using --id 0.99 and --id 0.98

Re-do the exercise for Carabus using this input file

Produced output files

Download all output files for Branchiomma and for Carabus

More information on VSEARCH

Check the VSEARCH wiki page on clustering.

Clone this wiki locally