Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET) sequencing is a technology to study genome-wide long-range chromatin interactions bound by protein factors. ChIA-PET Tool V3, a software package for automatic processing of ChIA-PET sequence data, including: 0. linker detect (--stop_step 0, only run linker detect)
- linker filtering
- mapping the paired-end reads to a reference genome
- purifying the mapped reads
- dividing the reads into different categories
- peak calling
- interaction calling
- visualizing the results
Java is a popular platform-independent programming language and can be run on any machines with a Java Virtual Machine (JVM). BWA is used to map ChIA-PET sequencing reads to a reference genome. SAMtools is used to convert the alignment output from SAM format to BAM format. BEDTools is required to convert the files from BAM format to bed format. R environment and its packages are used to compute the p-values in peak calling and interaction calling and generate the graphs for visualization. ChIA-PET Tool V3 requires the following softwares:
- JDK>=1.8(https://www.oracle.com/technetwork/java/javase/downloads/index.html)
- BWA(http://bio-bwa.sourceforge.net/)
- SAMtools(http://samtools.sourceforge.net/)
- BEDTools(https://bedtools.readthedocs.io/en/latest/)
- R(https://www.r-project.org/)
- R package grid(install.packages("grid"))
- R package xtable(install.packages("xtable"))
- R package RCircos(install.packages("RCircos"))
Download the ChIA-PET Tool V3 package from https://github.com/GuoliangLi-HZAU/ChIA-PET_Tool_V3.
Unpack the package using the following command in your selected directory:
$ unzip ChIA-PET_Tool_V3.zip
short-read test data set is part of RNAPII ChIA-PET data from human K562 cells: https://1drv.ms/u/s!AqzVTcWMvT40bhONKvBSCZDxqjA
long-read test data set is part of CTCF ChIA-PET data from human GM12878 cells: https://1drv.ms/u/s!AqzVTcWMvT40bzU5gKTuxxGRvz8
RNAPII ChIA-PET data from human K562 cells:
GSM832464 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM832464)
CTCF ChIA-PET data from human GM12878 cells:
GSM1872886 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1872886)
Before excuting the ChIA-PET Tool V3, you need to create genome index by BWA referring to http://bio-bwa.sourceforge.net/bwa.shtml and configure environment variables of bwa, samools and bamToBed (bedtools). After that, we can simply run it with one command line:
$ java -jar ChIA-PET.jar [options]
The options are as follows:
options:
--mode: There are two modes for ChIA-PET Tool V3. 0 for short read, 1 for long read.
--fastq1: path of read1 fastq file.
--fastq2: path of read2 fastq file.
--autolinker: detect linker by our program, then no need provide --linker and --mode paramater. default: true
When the parameter "--stop_step 0" is present, only the automatic detection linker program will be run.
When the parameter "--stop_step 1" is present, only the automatic detection linker AND linker filter program will be run.
--linker: linker file.
--fastp: fastp path, strong suggest for ChIA-PET data.
--minimum_linker_alignment_score: Specifies the allowed minimum alignment score.
--GENOME_INDEX: specifies the path of BWA index file.
--GENOME_LENGTH: specifies the number of base pairs in the whole genome.
--CHROM_SIZE_INFO: specifies the file that contains the length of each chromosome.
--CYTOBAND_DATA: specifies the ideogram data used to plot intra-chromosomal peaks and interactions.
--SPECIES: specifies the genome used to plot inter-chromosomal interactions, 1 for human, 2 for mouse and 3 for
--hichip: Y(es) or N(o)[default] or O(nly print restriction site file without run other step), need for hichip data
--ligation_site: It can be the name of restriction enzyme, such as HindIII, MboI, DpnII, Bglii, Sau3AI, Hinf1, NlaIII, AluI
or the site of enzyme digestion, A^AGCTT, ^GATC, ^GATC, A^GATCT, G^ANTC, CATG^, AG^CT or others.
multipe restriction enzyme can be seperated by comma, such as G^ANTC,^GATC.
restriction site with '^' and contains 'ATCG' without other character!!!
if the genomic enzyme digestion file --restrictionsiteFile is provided,
this parameter does not need to be provided. only needed for hichip data;
--ResRomove: Y or N, whether remove PET in same restriction contig. default: Y
--restrictionsiteFile: restriction site file, can be genarated while has --ligation_site and without this paramater or
provide restriction enzyme information with --ligation_site, we will automatically generate the file.
only needed for hichip data
--genomefile\tgenome fasta file path, needed for with --ligation_site and without --restrictionsiteFile only needed for hichip data
--start_step: start with which step, 1: linker filtering; 2: mapping to genome; 3: removing redundancy; 4:
categorization of PETs; 5: peak calling; 6: interaction calling; 7: visualizing, default: 1"
--stop_step: stop with which step, see above
--output: specifies the directory to store the output data from ChIA-PET Tool V3,
note: please specify the absolute path of output directory.
--prefix: specifies the prefix of all the output files, default: out.
--minimum_tag_length: Specifies the minimum tag length. Tag is the sequence after linker removal. This parameter
is better to be set above 18bp. Default: 18.
--maximum_tag_length: Specifies the maximum tag length. Default:1000 Specifies the maximum tag length. Default:
1000.
--minSecondBestScoreDiff: Specifies the score difference between the best-aligned and the second-best aligned
linkers. Default: 3.
--output_data_with_ambiguous_linker_info: Determines whether to print the linker-ambiguity PETs. 0: not print; 1:
print, Default: 1.
--thread: the number of threads used in linker filtering and mapping to genome. Default: 1.
--MAPPING_CUTOFF: The mapping threshold to remove the PETs with poor quality. Default: 30.
--MERGE_DISTANCE: specifies the distance limit to merge the PETs with similar mapping locations. Default: 2.
--SELF_LIGATION_CUFOFF: specifies the distance threshold between self-ligation PETs and intra- chromosomal
inter-ligation PETs. Default: 8000.
--EXTENSION_LENGTH: specifies the extension length from the location of each tag, which is determined by the
median span of the self-ligation PETs. Default: 500.
--MIN_COVERAGE_FOR_PEAK: specifies the minimum coverage to define peak regions. Default:5.
--PEAK_MODE: There are two modes for peak calling. Number 1 is "peak region" mode, which takes all the
overlapping PET regions above the coverage threshold as peak regions, and number 2 is "peak summit" mode, which
takes the highest coverage of overlapping regions as peak regions. Default: 2.
--MIN_DISTANCE_BETWEEN_PEAK: specifies the minimum distance between two peaks. If the distance of two peak
regions is below the threshold, only the one with higher coverage will be kept. Default: 500.
--GENOME_COVERAGE_RATIO: specifies the estimated proportion of the genome covered by the reads. Default: 0.8.
--PVALUE_CUTOFF_PEAK: specifies p-value to filter peaks that are not statistically significant. Default: 0.00001.
--INPUT_ANCHOR_FILE: a file which contains user-specified anchors for interaction calling. If you don't have this
file, please specify the value of this variable as "null" instead. Default: null.
--PVALUE_CUTOFF_INTERACTION: specifies p-value to filter false positive interactions. Default:0.05.
--zipbedpe: gzip bedpe related file, after analysis done. default: N. Y for gzip, N for not.
--zipsam: Convert sam file to bam, after analysis done. default: N
--deletesam: Delete sam files. default: N
--keeptemp: Keep temp sam and bedpe file. default: N
--map_ambiguous: Also mapping ambiguous reads without linker. default: N
--skipmap: Skip mapping read1 and read2, start from paired R1.sam and R2.sam, only valid in HiChIP mode now. default: N
--macs2: macs2 path, using macs2 callpeak to detect anchor peak with alignment file. default: N
--nomodel macs2 parameter, Whether or not to build the shifting model in macs2. default: N
--shortestP extend and keep shorest peak length longer than N for loop calling, suggest 1500, user can set 0 to skip this step. default: 1500
--shortestA extend and keep shorest anchor length longer than N for loop calling, user can set 0 to skip this step. default: 0
--XOR_cluster Whether keep loops if only one side of anchor is overlap with peak. default: N
--addcluster Keep all regions with more than 2 count reads as potential anchor for calling loop. default: N. if peaks number of macs2 smaller than 10000, this paramater will work automaticly.
Especially, the directories of data should be set properly to make sure that the programs could run smoothly. ChIA-PET Tool V3 will create a folder named by OUTPUT_PREFIX
in the OUTPUT_DIRECTORY
. The default value of OUTPUT_DIRECTORY
is in the master folder "ChIA-PET_Tool_V3/
" and OUTPUT_PREFIX
is "out". Examples of CHROM_SIZE_INFO
and CYTOBAND_DATA
are both in the master folder "ChIA-PET_Tool_V3/chromInfo/
". We recommend users to select one or create a new one in that folder.
The results will be visualized by a HTML file which is in the output folder "OUTPUT_DIRECTORY/OUTPUT_PREFIX/OUTPUT_PREFIX.ChIA-PET_Report
".
short-read:
java -jar ChIA-PET.jar --mode 0 --fastq1 test_short_1.fastq --fastq2 test_short_2.fastq --linker ChIA-PET_Tool_V3/linker/linker.txt --minimum_linker_alignment_score 8 --GENOME_INDEX hg19.fa --GENOME_LENGTH 3E9 --CHROM_SIZE_INFO ChIA-PET_Tool_V3/chromInfo/hg19.chromSize.txt --CYTOBAND_DATA ChIA-PET_Tool_V3/chromInfo/hg19_cytoBandIdeo.txt --SPECIES 1 --output test_short --prefix K562 --thread 4
long-read:
java -jar ChIA-PET.jar --mode 1 --fastq1 test_long_1.fastq --fastq2 test_long_2.fastq --linker ChIA-PET_Tool_V3/linker/linker_long.txt --minimum_linker_alignment_score 14 --GENOME_INDEX hg19.fa --GENOME_LENGTH 3E9 --CHROM_SIZE_INFO ChIA-PET_Tool_V3/chromInfo/hg19.chromSize.txt --CYTOBAND_DATA ChIA-PET_Tool_V3/chromInfo/hg19_cytoBandIdeo.txt --SPECIES 1 --output test_long --prefix GM12878 --thread 4
- Example of peak file named
OUTPUT_PREFIX
.peak.FDRfiltered.txt
chrom | summit start | summit end | peak coverage | p-value | p.adjust |
---|---|---|---|---|---|
chr1 | 840305 | 840556 | 21 | 1.18e-08 | 2.57e-07 |
chrom: chromosome name.
summit start: The start coordinate of peak summit.
summit end: The end coordinate of peak summit.
peak coverage: The highest coverage by tags in a peak.
p-value: This value represents the statistical significance of a peak, which is calculated by Poisson distribution.
p.adjust: P.adjust means p-value adjusted with Benjamini-Hockberg method for multiple hypothesis testing.
- Example of interaction file named
OUTPUT_PREFIX
.cluster.FDRfiltered.txt
chrom1 | start1 | end1 | chrom2 | start1 | end2 | ipet counts | type | distance | tag count within anchor 1 | tag count within anchor 2 | p-value | p.adjust | -log10(p-value) | -log10(p.adjust) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr7 | 562341 | 563231 | chr7 | 574231 | 575143 | 2 | 1 | 11901 | 3 | 9 | 7.11e-11 | 1.99E-10 | 10.15 | 9.7 |
chrom1: The name of the chromosome on which the cluster anchor 1 exists.
start1: The start coordinate of cluster anchor 1.
end1: The end coordinate of cluster anchor 1.
chrom2: The name of the chromosome on which the cluster anchor 2 exists.
start2: The start coordinate of cluster anchor 2.
end2: The end coordinate of cluster anchor 2.
ipet count: Number of PETs between cluster anchor 1 and cluster anchor 2.
type: Interactions type. 1 represents intra-chromosomal interaction, and 0 represents inter-chromosomal interaction.
distance: Distance between anchors of an intra-chromosomal interaction cluster. If two anchors are located on different chromosomes, the value is set to 2,147,483,647.
tag count within anchor 1: Number of tags that fall in the cluster anchor 1.
tag count within anchor 2: Number of tags that fall in the cluster anchor 2.
p-value: This value represents the statistical significance of a chromatin interaction, which is calculated by hyper-geometric distribution.
p.adjust: P.adjust means p-value adjusted with Benjamini-Hockberg method for multiple hypothesis testing.
-log10(p-value): The negative logarithm of p-value.
-log10(p.adjust): The negative logarithm of adjusted p-value.
- Li G, Cai L, Chang H, Hong P, Zhou Q, Kulakova EV, Kolchanov NA, Ruan Y. Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET) sequencing technology and application. BMC Genomics, 2014, 15, S11
- Li G, Chen Y, Snyder MP, Zhang MQ. ChIA-PET2: a versatile and flexible pipeline for ChIA-PET data analysis. Nucleic Acids Research, 2016: 1-10
- Li G, Fullwood MJ, Xu H, Mulawadi FH, Velkov S, Vega V, Ariyaratne PN, Mohamed YB, Ooi H, Tennakoon C, Wei C, Ruan Y, Sung W. Software ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biology 2010, 11, R22
- Li X, Luo OJ, Wang P, Zheng M, Wang D, Piecuch E, Zhu JJ, Tian SZ, Tang Z, Li G, Ruan Y. Long-read ChIA-PET for base-pair-resolution mapping of haplotype-specific chromatin interactions. Protocol 2017, 12: 899-915
- Paulsen J, Rødland EA, Holden L, Holden M, Hovig E. A statistical model of ChIA-PET data for accurate detection of chromatin 3D interactions. Nucleic Acids Research, 2014, 42, e143
- Phanstiel DH, Boyle AP, Heidari N, Snyder MP. Mango: a bias-correcting ChIA-PET analysis pipeline. Genome analysis, 2015, 31: 3092-3098