StoatyDive is a tool to evaluate and classify predicted peak profiles to assess the binding specificity of a protein to its targets. It can be used for sequencing data such as CLIP-seq or ChIP-Seq, or any other type of peak profile data.
You can install StoatyDive via conda or with a direct download of this repository.
conda install stoatydive
git clone https://github.com/BackofenLab/StoatyDive.git cd StoatyDive sudo apt-get install -y r-base python setup.py install
Requirements: python >= 3.6, bedtools >= 2.27.1, numpy>=1.13.3, matplotlib>=2.1, scipy >= 1.3, R >= 3.4.4
StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt
optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -a *.bed, --input_bed *.bed Path to the peak file in bed6 format. -b *.bam/*.bed, --input_bam *.bam/*.bed Path to the read file used for the peak calling in bed or bam format. -c *.txt, --chr_file *.txt Path to the chromosome length file. -o path/, --output_folder path/ Write results to this path. [Default: Operating Path] -t float, --thresh float Set a normalized CV threshold to divide the peak profiles into more specific (0) and more unspecific (1). [Default: 1.0] --peak_correction Activate peak correction. The peaks are recentered (shifted) for the correct sumit. --max_translocate Set this flag if you want to shift the peak profiles based on the maximum value inside the profile instead of a Gaussian blur translocation. --peak_length int Set maximum peak length for the constant peak length. --max_norm_value float Provide a maximum value for CV to make the normalized CV plot more comparable. --border_penalty Adds a penalty for non-centered peaks. --scale_max float Provide a maximum value for the CV plot. --maxcl int Maximal number of clusters of the kmeans clustering of the peak profiles. The algorithm will be optimized, i.e., the parameter is just a constraint and not absolute. [Default: 15] -k int, --numcl int You can forcefully set the number of cluster of peak profiles. --sm Turn on the peak profile smoothing for the peak profile classification. It is recommended to turn it on. --lam float Parameter for the peak profile classification. Set lambda for the smoothing of the peak profiles. A higher value (> default) will underfit. A lower value (< default) will overfit. [Default: 0.3] --turn_off_classification Turn off the peak profile classification.
Border Penalty and Length Normalization
It is recommended to use StoatyDive with
Adding the border penalty takes care of peaks that are not correctly centered and might
just overlap with a short appendage of a read stack. The length normalization takes
care of different sized peaks. All peak are extended to a certain length with
--peak_correction. The user can either provide a peak length with
StoatyDive just takes the maximal peak length of the given peak set.
The user can set a CV threshold with
-t, --thresh to divide the predicted
peaks into more specific (0) and more unspecific sites (1). The default is set
StoatyDive runs uMAP and k-means clustering to classify the peak profiles.
You can skip this step with
--turn_off_classification. The parameter
is crucial for the k-means clustering. Just leave it in default, since the number
of clusters will be optimized internally. The parameter
--maxcl is just an upper
bound. If you assume that 15 cluster is not enough then increase the parameter. You
can also force StoatyDive to use k clusters with
Smoothing of the Peak Profiles
StoatyDive can smooths the peak profiles, with a spline regression, using the
--sm, which is recommended. This helps reduce the amount of noise. The
--lam can be adjusted for your data. The default of
0.3 was optimized
with some test data. A higher value (> default) will underfit. A lower value (< default) will overfit.
Translocation of the Peak Profile
StoatyDive translocates (centers) the profiles with a Gaussian blur, because the dimensional reduction
methods uMAP is not translation invariant. You can also decide to center the peak profiles based on the
maximal intensity value inside with
--max_translocate. This can be useful, for example, if you use
truncation events of iCLIP.
- You can set a maximal value for the normalized CV distribution plot with
max_norm_value. This option helps, if you want to compare several normalized CV distribution plots from different experiments. Take the highest CV from all experiments as a maximal value.
- You can set a maximal value for the CV distribution plot with
scale_max. This option helps, if you want to compare several normalized CV distribution plots from different experiments. Take the highest CV from all experiments as a maximal value.
StoatyDive.py -a test/broad_peaks/peaks.bed -b test/broad_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/broad_peaks/
StoatyDive.py -a test/sharp_peaks/peaks.bed -b test/sharp_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/sharp_peaks/
StoatyDive.py -a test/mixed_peaks/peaks.bed -b test/mixed_peaks/reads.bam -c test/chrom_sizes.txt --peak_correction --peak_length 50 --border_penalty --sm -o test/mixed_peaks/
CV distribution plot
The diagram will give you a first impression of the binding specificity of your protein of interest. The diagram also tells you about the performance/quality of your experiment. An experiment with lots of unspecific binding sites will have a CV distribution close to zero, as in our example A. An experiment with lots of specific binding sites will have a CV distribution with a high expected CV, as in our example B.
Normalized CV distribution plot
The normalized CV distribution helps to identify specific and unspecific sites within an experiment. The normalized CV is in a range [0,1]. A specific site will have a value of 1. An unspecific site will have a value of 0.
The final tabular file is a ranked, tab separated list of your predicted binding sites:
- Start of Peak
- End of peak
- Peak ID/Name
- Peak length
- r (hyperparameter of negative binomial)
- p (hyperparameter of negative binomial)
- Normalized CV
- Difference between the maximal value of the left border and the maximal value of the center region of the peak. (Penalty used for the border penalty.)
- Difference between the maximal value of the right border and the maximal value of the center region of the peak. (Penalty used for the border penalty.)
- Internal peak index
- Type of Peak: 0 = More specifc binding site; 1 = More unpsecific binding site.
- Relative Position of the maximal coverage/intensity (sumit) inside the peak.
- Class of the Peak (Cluster).
14 to divide your peak into the two general categories of
specific and unspecific. Use column
16 to find peaks with a specific shape. If the user
skipped the classification, then the final_tab_.bed is the CV_tab_.bed. The
tabular has no 15/16th column.
You will get some plots for the classification, saved in the folder
Individual Cluster Profiles (pdf)
If you turned on the smoothing you will get four types of cluster sets. The first one shows you
some example raw peak profiles assigned to the specific cluster (e.g. cluster_1.pdf
for cluster 1; Figure A). The second one shows you some example smoothed and sometimes translocated peak profiles to the specific cluster (e.g. cluster_smoothed1.pdf for cluster 1; Figure B). Profile like figure B are used for the classification. The profiles are colored based on the clusters as seen as in the uMAP plot.
Overview Cluster Profiles (pdf)
The third one is an example profile, such as Figure A for each cluster in one plot (overview_cluster.pdf; Figure C). The fourth one are the average profiles of each cluster (e.g. cluster_average_profiles.pdf; Figure D).
kmeans_Optimization.pdf shows you the optimization scheme. If you data
has a very low complexity, that is to say, you have lots of similar peak profiles,
then the percent of variance explained will be very low (second plot). It is also
indicated by strong fluctuations in the other diagrams. If you have very distinguishable
peak profiles, as in our example, then the variance explained will be
uMAP.pdf shows you the data in the new dimension found by the uMAP
dimensional reduction algorithm. In correspondance to the the k-means optimization,
highly distinguishable peaks will appear in the plot as very clearly separated