vsearch

Introduction Vsearch

Overview. VSEARCH includes commands to perform de novo clustering using a greedy and heuristic centroid-based algorithm with an adjustable sequence similarity threshold specified with the --id option (e.g., --id 0.97). The input sequences are either processed in the user supplied order (--cluster_smallmem) or pre-sorted based on length (--cluster_fast) or abundance (--cluster_size).

Method. Each input sequence is used as a query against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold (--id). If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If --maxaccepts is higher than 1 (default: 1), several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering), or, if the --sizeorder option is used, the centroid with the highest abundance (abundance-based greedy clustering).

Check the VSEARCH wiki page on clustering.

Commands for installing vsearch

Cloning the repo. You will need Git, autoconf and automake to clone the repository and install VSEARCH. On a Debian-based Linux system, the three packages can be installed using the commands:

sudo apt-get install git autotools-dev

To clone the repository and install VSEARCH use the following commands:

$ git clone https://github.com/torognes/vsearch.git
$ cd vsearch
$ ./autogen.sh
$ ./configure
$ make
$ sudo make install

Binary distribution. If cloning/compiling fails, you may directly download the pre-compiled VSEARCH binary for your system. If you are on a Linux system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-linux-x86_64.tar.gz
tar xzf vsearch-2.3.0-linux-x86_64.tar.gz

Or, if you are on a MAC system:

wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-osx-x86_64.tar.gz
tar xzf vsearch-2.3.0-osx-x86_64.tar.gz

You will now have the binary distribution in a folder called vsearch-2.3.0-linux-x86_64 in which you will find three subfolders; bin, man and doc. We recommend making a copy or a symbolic link to the vsearch binary bin/vsearch in a folder included in your $PATH, and a copy or a symbolic link to the vsearch man page man/vsearch.1 in a folder included in your $MANPATH. The PDF version of the manual is available in doc/vsearch_manual.pdf.

Exercise: Delimitation with Vsearch

Working directory: ~/workshop_exercises/distance_methods/branchiomma/vsearch

Note: If you didn't create this directory during the linux tutorial create it now using mkdir

Vsearch is sensitive to two parameters: the distance threshold value (--id) and the order of sequences. As mentioned earlier Vsearch offers multiple options for controlling the order of sequences (--cluster_smallmem, --cluster_fast, --cluster_size).

Input file: BR_cob_57ind_no_outgr.fasta

$ vsearch --cluster_fast BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-cf-97.fa --msaout BR_msaout-cf-97.txt
vsearch v2.3.0_linux_x86_64, 7.7GB RAM, 4 cores
https://github.com/torognes/vsearch

Reading file BR_cob_57ind_no_outgr.fasta 100%  
21610 nt in 53 seqs, min 311, max 435, avg 408
Masking 100% 
Sorting by length 100%
Counting unique k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 13 Size min 1, max 10, avg 4.1
Singletons: 3, 5.7% of seqs, 23.1% of clusters
Multiple alignments 100%

The number of clusters is printed in the screen Clusters: 13 Size min 1, max 10, avg 4.1 and it is equal to the number of centroid sequences saved in the output file defined with the --centroids argument.

Count the centroid sequences saved in the BR_centroids-cf-97.fa file with the following bash command:

$ grep ">" BR_centroids-cf-97.fa | wc -l
13

Continue with a different order of the input sequences

$ vsearch --cluster_smallmem BR_cob_57ind_no_outgr.fasta --usersort --id 0.97 --centroids BR_centroids-sm-97.fa --msaout BR_msaout-sm-97.txt

$ vsearch --cluster_size BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-sz-97.fa --msaout BR_msaout-sz-97.txt

Note: When using ``--cluster_smallmem, option --usersort` indicates that sequences are not pre-sorted by length.

Re-run the delimitations using `--id 0.99` and `--id 0.98` for each sequence order option

Re-do the exercise for Carabus using this input file

Produced output files

Download all output files for Branchiomma and for Carabus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vsearch

Introduction Vsearch

Commands for installing vsearch

Exercise: Delimitation with Vsearch

Re-run the delimitations using `--id 0.99` and `--id 0.98` for each sequence order option

Re-do the exercise for Carabus using this input file

Produced output files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Home

Main Tutorial Task

ABGD

Vsearch

Crop

GMYC

(m)PTP

tr2

Clone this wiki locally

vsearch

Introduction Vsearch

Commands for installing vsearch

Exercise: Delimitation with Vsearch

Re-run the delimitations using --id 0.99 and --id 0.98 for each sequence order option

Re-do the exercise for Carabus using this input file

Produced output files

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Re-run the delimitations using `--id 0.99` and `--id 0.98` for each sequence order option