Skip to content
Pas-Kapli edited this page Oct 15, 2016 · 8 revisions

Introduction Crop

Crop stands for "Clustering 16s rRNA for OTU Prediction" and it mostly designed for large datasets in the framework of microbial ecology. It is a Bayesian Hierarchical Clustering (BHC) method and therefore it can be much slower than other clustering methods. To reduce the computational time the initial set of sequences is split in smaller subsets (blocks) and the BHC is performed within each of these blocks. Subsequently, Crop performs another BHC step comparing the initial clusters that were found based on the blocks; some of those clusters are merged in this phase. This procedure is repeated until one of the following requirements is met:

  1. The number of the clusters is >90% of the number of sequences. (This means most sequences are forming a cluster by themselves. Thus, split and merge process will not be able to reduce the dimension of the data efficiently any more.)

  2. The number of the clusters is smaller than a predetermined threshold.

  3. The process has been running for N times, where N is a predetermined threshold (defined by the parameter -m).

Software:

Crop is available as a standalone software implemented in C.

Install:

$ git clone https://github.com/tingchenlab/CROP

$ cd CROP

$ make

Input files

Sequences in fasta format (not aligned).

Crop delimitation

$ CROPLinux -i BR_cob_57ind_no_outgr.fasta -o BR_cob_57ind_no_out.CROP -z 50 -s

BR_cob_57ind_no_outgr.fasta

-z defines the size of blocks (subsets of the original set of sequences). In this example the number of sequences is small, therefore this parameter does not make a difference, all sequences can be compared within one block. In a large dataset (100s of sequences) different z parameters could result in different clusters.

Output files:

Unique sequences of the input fasta file in a list list and in fasta.

Clusters of sequences in a list and representative sequences per cluster in fasta.

Try other threshold values based on the following parameters:

parameters Cut-off threshold
l=0.2 u=0.5 1%
l=0.5 u=1.0 2%
l=1.0 u=1.5 3%
l=1.5 u=2.5 5%
l=3 u=4 8%
l=4 u=5 10%

Q: How do the thresholds change the delimitation?

Clone this wiki locally