This add-on of TADbit allows you to deconvolve the structural signal from a set of regions of interest and obtain the different structural clusters.
Metawaffle was written and tested using Python 2.7.
It requires the packages TADbit
(0.4.97),Neupy
(0.6.4),matplotlib
(3.2.1), scipy
(1.0.0), numpy
(1.14.0), pandas
(0.22.0),pysam
(0.15.2), pickle
(2.0.0), functools
(3.2.3), multiprocess
(0.70.9) and scikit-learn
(0.22.2).
In order to install the package you can download the code from our GitHub repo and install it manually. If needed, the dependencies will be downloaded automatically; if you encounter problems, try to install the other requirements manually.
git clone https://github.com/3DGenomes/metawaffle # or download manually
cd metawaffle
python setup.py install
After the installation, you can run the provided example to familiarize with the functions of metawaffle.
Use metawaffle -h
for quick help and orientation.
For the tutorial will be used the Hi-C data from the chromosome 22 of GM12878 cell line from the merged replicates listed in (META-WAFFLE ARTICLE).
The biases have to be computed before running bam2count
. We use TADbit
as it gives the possibility to obtain the biases using various normalization methods (check https://github.com/3DGenomes/TADbit). For this example we have extracted the biases using the OneD normalization (E Vidal et al. 2018) at 5 kb resolution. The biases will be stored in a pickle file. We will use a 5 kb resolution.
metawaffle bam2count \
example/chr22_GM12878.bam \
5000 \
example/ \
example/biases_5kb_GM12878.pickle \
You can notice that if you do not have a BAM format file, you can use a tabseparated format with the same information as the output file. The output file contains 4 columns:
- bin i: genomic bin i.
- bin j: genomic bin j.
- raw contact value: The raw interaction value.
- normalized contact value: The normalized interaction value.
For the tutorial we will use the ChIP-seq peaks from CTCF in the chromosome 2 (listed in PAPER).First of all we can run a quality check to know the distribution and the width of the peaks.
metawaffle check_peaks \
example/ctcf_chr2.bed \
example/ \
The quality check will provide worthy information to decide the genomic interval of distance and the windows span around the center of the peak. Then, the pair list will be generated running the pairlist
script.
metawaffle pairlist \
example/ctcf_chr2.bed \
example/ \
example/human_grch38_size \
ctcf_chr2 \
27500 \
2500000 \
255000-2500000 \
This command will generate a file with the coordinate pairs with the CTCF peak centered in 55 kb region. Only those coordinates within the interval distance of 255 kb - 2.5 Mb will be stored.
The output will be a two column file:
chr22:40502977-40547977 chr22:41644314-41689314
chr22:39308838-39353838 chr22:40502977-40547977
chr22:40502977-40547977 chr22:41650826-41695826
chr22:39291974-39336974 chr22:40502977-40547977
chr22:39492032-39537032 chr22:40502977-40547977
The next step will be to extract the contact matrices from the coordinate pair list. For the example we will extract the all the matrices using the tag -A True. If you only want to use a random set of coordinates from the pairlist, -A True and specify the sample size.
metawaffle peak2matrix \
example/ctcf_chr2_255000_2500000.tsv \
example/human_grch38_size \
5000 \
example/tmpdir/ \
example/ \
8 \
ctcf_chr2 \
example/biases_5kb_GM12878.pickle \
example/matrix/ \
27500 \
-A True \
The output file will contain row-wise the coordinates and their contact matrices from the pairlist file.
To classify the extracted contact matrices according to their structural pattern, a competitive neural network provided in Neupy
package, the Self-Organizing Feature Map (SOFM or SOM).
metawaffle sofm \
example/matrices_ctcf_chr2.tsv \
example/sofm/ \
ctcf_chr2 \
20 \
12 \
5 \
1 \
0.01 \
0.01 \
-plot True \
If you want to know more about SOFM: http://neupy.com/apidocs/neupy.algorithms.competitive.sofm.html . After running this command, you will obtain various files:
- Cluster_n.bed: You will obtain bed files for each of the neurons according to the grid size, with the classified coordinates.
- Heatmap_info : image with the number of coordinates classified per neuron, in order to check how the coordinates have been distributed in the SOFM map.
metawaffle has these basic commands:
bam2count
to convert the bamfile into tabseparated counts.check_peaks
to evaluate peaks of interest, and get the widht and distance between contiguous peaks.pairlist
used to generate the coordinates list to extract the matrices.peak2matrix
contains the tools to extract the contact matrices from the pairs list.sofm
used to classify the contact matrices according to their structural pattern.
Check the label FAQ <https://github.com/3DGenomes/TADbit/issues?utf8=%E2%9C%93&q=is%3Aissue+label%3AFAQ+>
_ in TADbit issues.
If your question is still unanswered feel free to open a new issue.
This add-on of TADbit is currently developed at the MarciusLab <http://www.marciuslab.org>
_ with the contributions of Silvia Galan and François Serra.
Please, cite this article if you use TADbit.
Serra, F., Baù, D., Goodstadt, M., Castillo, D. Filion, G., & Marti-Renom, M.A. (2017).
Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors.
PLOS Comp Bio 13(7) e1005665. doi:10.1371/journal.pcbi.1005665 <https://doi.org/10.1371/journal.pcbi.1005665>
_