Skip to content

vsearch

Pas-Kapli edited this page Oct 13, 2016 · 5 revisions

Commands for installing vsearch

Cloning the repo. You will need Git, autoconf and automake to clone the repository and install VSEARCH. On a Debian-based Linux system, the three packages can be installed using the commands:

sudo apt-get install git autotools-dev

To clone the repository and install VSEARCH use the following commands:

$ git clone https://github.com/torognes/vsearch.git
$ cd vsearch
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Binary distribution. If cloning/compiling fails, you may directly download the pre-compiled VSEARCH binary for your system. If you are on a Linux system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-linux-x86_64.tar.gz
tar xzf vsearch-2.3.0-linux-x86_64.tar.gz

Or, if you are on a MAC system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-osx-x86_64.tar.gz
tar xzf vsearch-2.3.0-osx-x86_64.tar.gz

You will now have the binary distribution in a folder called vsearch-2.3.0-linux-x86_64 in which you will find three subfolders; bin, man and doc. We recommend making a copy or a symbolic link to the vsearch binary bin/vsearch in a folder included in your $PATH, and a copy or a symbolic link to the vsearch man page man/vsearch.1 in a folder included in your $MANPATH. The PDF version of the manual is available in doc/vsearch_manual.pdf.

Usage

Overview. VSEARCH includes commands to perform de novo clustering using a greedy and heuristic centroid-based algorithm with an adjustable sequence similarity threshold specified with the --id option (e.g., --id 0.97). The input sequences are either processed in the user supplied order (--cluster_smallmem) or pre-sorted based on length (--cluster_fast) or abundance (--cluster_size).

Method. Each input sequence is used as a query against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold (--id). If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If --maxaccepts is higher than 1 (default: 1), several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering), or, if the --sizeorder option is used, the centroid with the highest abundance (abundance-based greedy clustering).

Examples

vsearch --cluster_fast BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids centroids-cf.fa --msaout msaout-cf.txt
vsearch --cluster_smallmem BR_cob_57ind_no_outgr.fasta --usersort --id 0.97 --centroids centroids-sm.fa --msaout msaout-sm.txt
vsearch --cluster_size BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids centroids-sz.fa --msaout msaout-sz.txt

Note: When using ``--cluster_smallmem, option --usersort` indicates that sequences are not pre-sorted by length.

Exercise files

Input files

Filename Description
CHANGE FILES
Anolis.fas Input sequences

Produced output files

Filename Description
[centroids-cf.fa](place link) Centroids for --cluster_fast
[centroids-sm.fa](place link) Centroids for --cluster_smallmem
[centroids-sz.fa](place link) Centroids for --cluster_size
[msaout-cf.fa](place link) Clusters for --cluster_fast
[msaout-sm.fa](place link) Clusters for --cluster_smallmem
[msaout-sz.fa](place link) Clusters for --cluster_size

More information on VSEARCH

Check the VSEARCH wiki page on clustering.

Clone this wiki locally