Skip to content
Pas-Kapli edited this page Oct 25, 2016 · 8 revisions

Introduction Crop

Crop stands for "Clustering 16s rRNA for OTU Prediction" and it mostly designed for large datasets in the framework of microbial ecology. It is a Bayesian Hierarchical Clustering (BHC) method and therefore it can be much slower than other clustering methods. To reduce the computational time the initial set of sequences is split in smaller subsets (blocks) and the BHC is performed within each of these blocks. Subsequently, Crop performs another BHC step comparing the initial clusters that were found based on the blocks; some of those clusters are merged in this phase. This procedure is repeated until one of the following requirements is met:

  1. The number of the clusters is >90% of the number of sequences. (This means most sequences are forming a cluster by themselves. Thus, split and merge process will not be able to reduce the dimension of the data efficiently any more.)

  2. The number of the clusters is smaller than a predetermined threshold.

  3. The process has been running for N times, where N is a predetermined threshold (defined by the parameter -m).

Software:

Crop is available as a standalone software implemented in C.

Install:

$ git clone https://github.com/tingchenlab/CROP

$ cd CROP

$ make

Input files

Sequences in fasta format (not aligned).

Exercise: delimitation with Crop

Working directory: ~/workshop_exercises/distance_methods/branchiomma/crop

Note: If you didn't create this directory during the linux tutorial create it now using mkdir

$ CROPLinux -i BR_cob_57ind_no_outgr.fasta -o BR_cob_57ind_no_out.CROP -z 50 -l 1.0 -u 1.5

BR_cob_57ind_no_outgr.fasta

-z defines the size of blocks (subsets of the original set of sequences). In this example the number of sequences is small, therefore this parameter does not make a difference, all sequences can be compared within one block. In a large dataset (100s of sequences) different z parameters could result in different clusters.

Output files:

Unique sequences of the input fasta file in a list list and in fasta.

Clusters of sequences in a list and representative sequences per cluster in fasta.

Try other threshold values based on the following parameters:

parameters Cut-off threshold
-l 0.2 -u 0.5 1%
-l 0.5 -u 1.0 2%
-l 1.0 -u 1.5 3%

Find the output files for all the threshold values here: threshold1, threshold2, threshold3

Repeat the exercise for the Carabus sequences.

Don't forget to work in the right directory: ~/workshop_exercises/distance_methods/branchiomma/crop

This might be taking a bit long.. check the output files here: threshold1, threshold2, threshold3

Clone this wiki locally