-
Notifications
You must be signed in to change notification settings - Fork 2
vsearch
Overview. VSEARCH includes commands to perform de novo clustering using a greedy and heuristic
centroid-based algorithm with an adjustable sequence similarity threshold specified with the --id
option (e.g., --id 0.97). The input sequences are either processed in the user supplied order
(--cluster_smallmem) or pre-sorted based on length (--cluster_fast) or abundance (--cluster_size).
Method. Each input sequence is used as a query against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold (--id). If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If --maxaccepts is higher than 1 (default: 1), several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering), or, if the --sizeorder option is used, the centroid with the highest abundance (abundance-based greedy clustering).
Check the VSEARCH wiki page on clustering.
Commands for installing vsearch
Cloning the repo. You will need Git, autoconf and automake to clone the repository and install VSEARCH. On a Debian-based Linux system, the three packages can be installed using the commands:
sudo apt-get install git autotools-devTo clone the repository and install VSEARCH use the following commands:
$ git clone https://github.com/torognes/vsearch.git
$ cd vsearch
$ ./autogen.sh
$ ./configure
$ make
$ sudo make installBinary distribution. If cloning/compiling fails, you may directly download the pre-compiled VSEARCH binary for your system. If you are on a Linux system:
wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-linux-x86_64.tar.gz
tar xzf vsearch-2.3.0-linux-x86_64.tar.gzOr, if you are on a MAC system:
wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-osx-x86_64.tar.gz
tar xzf vsearch-2.3.0-osx-x86_64.tar.gzYou will now have the binary distribution in a folder called vsearch-2.3.0-linux-x86_64 in which you will find three subfolders; bin, man and doc. We recommend making a copy or a symbolic link to the vsearch binary bin/vsearch in a folder included in your $PATH, and a copy or a symbolic link to the vsearch man page man/vsearch.1 in a folder included in your $MANPATH. The PDF version of the manual is available in doc/vsearch_manual.pdf.
Vsearch is sensitive to two parameters: the distance threshold value (--id) and the order of sequences. As mentioned earlier Vsearch offers multiple options for controlling the order of sequences (--cluster_smallmem, --cluster_fast, --cluster_size).
Input file: BR_cob_57ind_no_outgr.fasta
vsearch --cluster_fast BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-cf-97.fa --msaout BR_msaout-cf-97.txtvsearch --cluster_smallmem BR_cob_57ind_no_outgr.fasta --usersort --id 0.97 --centroids BR_centroids-sm-97.fa --msaout BR_msaout-sm-97.txtvsearch --cluster_size BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-sz-97.fa --msaout BR_msaout-sz-97.txtNote: When using ``--cluster_smallmem, option --usersort` indicates that sequences are not pre-sorted by length.
Re-do the exercise for Carabus using this input file
Download all output files for Branchiomma and for Carabus