-
Notifications
You must be signed in to change notification settings - Fork 2
vsearch
Overview. VSEARCH includes commands to perform de novo clustering using a greedy and heuristic
centroid-based algorithm with an adjustable sequence similarity threshold specified with the --id
option (e.g., --id 0.97). The input sequences are either processed in the user supplied order
(--cluster_smallmem) or pre-sorted based on length (--cluster_fast) or abundance (--cluster_size).
Method. Each input sequence is used as a query against an initially empty database of centroid sequences. The query sequence is clustered with the first centroid sequence found with similarity equal to or above the threshold (--id). If no matches are found, the query sequence becomes the centroid of a new cluster and is added to the database. If --maxaccepts is higher than 1 (default: 1), several centroids with sufficient sequence similarity may be found and considered. By default, the query is clustered with the centroid presenting the highest sequence similarity (distance-based greedy clustering), or, if the --sizeorder option is used, the centroid with the highest abundance (abundance-based greedy clustering).
Check the VSEARCH wiki page on clustering.
Commands for installing vsearch
Cloning the repo. You will need Git, autoconf and automake to clone the repository and install VSEARCH. On a Debian-based Linux system, the three packages can be installed using the commands:
sudo apt-get install git autotools-devTo clone the repository and install VSEARCH use the following commands:
$ git clone https://github.com/torognes/vsearch.git
$ cd vsearch
$ ./autogen.sh
$ ./configure
$ make
$ sudo make installBinary distribution. If cloning/compiling fails, you may directly download the pre-compiled VSEARCH binary for your system. If you are on a Linux system:
wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-linux-x86_64.tar.gz
tar xzf vsearch-2.3.0-linux-x86_64.tar.gzOr, if you are on a MAC system:
wget https://github.com/torognes/vsearch/releases/download/v2.3.0/vsearch-2.3.0-osx-x86_64.tar.gz
tar xzf vsearch-2.3.0-osx-x86_64.tar.gzYou will now have the binary distribution in a folder called vsearch-2.3.0-linux-x86_64 in which you will find three subfolders; bin, man and doc. We recommend making a copy or a symbolic link to the vsearch binary bin/vsearch in a folder included in your $PATH, and a copy or a symbolic link to the vsearch man page man/vsearch.1 in a folder included in your $MANPATH. The PDF version of the manual is available in doc/vsearch_manual.pdf.
Working directory: ~/workshop_exercises/distance_methods/branchiomma/vsearch
Note: If you didn't create this directory during the linux tutorial create it now using mkdir
Vsearch is sensitive to two parameters: the distance threshold value (--id) and the order of sequences. As mentioned earlier Vsearch offers multiple options for controlling the order of sequences (--cluster_smallmem, --cluster_fast, --cluster_size).
Input file: BR_cob_57ind_no_outgr.fasta
$ vsearch --cluster_fast BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-cf-97.fa --msaout BR_msaout-cf-97.txt
vsearch v2.3.0_linux_x86_64, 7.7GB RAM, 4 cores
https://github.com/torognes/vsearch
Reading file BR_cob_57ind_no_outgr.fasta 100%
21610 nt in 53 seqs, min 311, max 435, avg 408
Masking 100%
Sorting by length 100%
Counting unique k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 13 Size min 1, max 10, avg 4.1
Singletons: 3, 5.7% of seqs, 23.1% of clusters
Multiple alignments 100% The number of clusters is printed in the screen Clusters: 13 Size min 1, max 10, avg 4.1 and it is equal to the number of centroid sequences saved in the output file defined with the --centroids argument.
Count the centroid sequences saved in the BR_centroids-cf-97.fa file with the following bash command:
$ grep ">" BR_centroids-cf-97.fa | wc -l
13Continue with a different order of the input sequences
$ vsearch --cluster_smallmem BR_cob_57ind_no_outgr.fasta --usersort --id 0.97 --centroids BR_centroids-sm-97.fa --msaout BR_msaout-sm-97.txt$ vsearch --cluster_size BR_cob_57ind_no_outgr.fasta --id 0.97 --centroids BR_centroids-sz-97.fa --msaout BR_msaout-sz-97.txtNote: When using ``--cluster_smallmem, option --usersort` indicates that sequences are not pre-sorted by length.
Re-do the exercise for Carabus using this input file
Download all output files for Branchiomma and for Carabus