We introduce the TESA algorithm, which is a novel motif discovery algorithm designed for ChIP-exo data. ChIP-exo is a high-resolution DNA sequencing technique that improves the accuracy and precision of motif discovery compared to traditional ChIP-seq methods. TESA utilizes a weighted two-stage alignment procedure that takes into account positional sequencing coverage to allocate weights to sequence positions. This allows for the accurate localization of TF-DNA interaction sites and the assessment of closely located or overlapping binding occurrences. Experimental results demonstrate that TESA outperforms other motif discovery algorithms in terms of precision and the reconstruction of Position Weight Matrices (PWM) when applied to both prokaryotic and eukaryotic ChIP-exo datasets. The availability of the TESA source code and benchmark datasets further enhance its potential as a valuable tool in genomic research.
Overall, we highlight the significance of TESA in improving motif discovery accuracy and precision by leveraging the higher resolution and precise localization offered by ChIP-exo data. The algorithm's weighted two-stage alignment procedure and incorporation of a "bookend" model contribute to its superior performance in identifying DNA binding patterns. The comparisons with other motif discovery algorithms and the availability of benchmark datasets further validate the effectiveness of TESA in genomic research.
TESA incorporates a weighted two-stage alignment procedure and a "bookend" model to accurately identify DNA binding patterns. The algorithm consists of the following steps:
TESA constructs a matrix with dimensions 2𝑚 × 𝑛, where 𝑚 is the number of input sequences and 𝑛 is the length of each sequence. The matrix represents both the input sequences and their reverse complementary sequences. Each position in the matrix is assigned a normalized sequencing coverage. TESA allocates weights to pairs of segments, each of length 𝑙, between different sequences based on string similarity and sequencing coverage. The statistical significance of string similarity between segments is evaluated using the binomial distribution.
After the two-stage alignment, TESA constructs a graph 𝐺 using the top-scoring positions from the alignment. The graph incorporates both the input sequences and their reverse complements, with each position connected by an edge.
Cliques (complete subgraphs) are detected within the constructed graph using a heuristic algorithm. These cliques serve as potential motifs.
A specialized "bookend" method is deployed to optimize the length of the motifs by assessing the sequential overlaps between pairs of potential motifs.
In the final step, TESA expands the sets of motif instances identified in Step 4. The algorithm selects segments with elevated motif match scores and uses these instances to construct the motif PWM.
The sequence set refers to the collection of DNA sequences that are used as input data for motif discovery algorithms. The sequence set is specifically derived from ChIP-exo data. The ChIP-exo data includes the reference genome file in FASTA
format, a narrow peak file in BED format, and one or multiple sequencing coverage files in bigWig
format. These files are processed and transformed into the TESA
format, which combines the narrow peaks represented in FASTA
format with normalized sequencing coverages. This sequence set is then used as input for the TESA algorithm to identify DNA binding motifs.
GCC
version 9.4.0 or above is required. Otherwise you can installGCC
with root privileges using the following code:
yum -y install gcc
Alternatively you can meet all requirements by installing the developer toolset build-essential
:
sudo apt install build-essential
-
If you are running TESA using an input file with sequencing coverage, both
BEDTools
andBigWigMerge
are required.BEDTools
: https://BEDTools.readthedocs.io/en/latestBigWigMerge
: https://anaconda.org/bioconda/ucsc-bigwigmergeFollowing script can be used to install
BEDTools
2.29.1 andBigWigMerge
throughconda
:
BEDTools 2.29.1:
wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
tar -zxvf bedtools-2.29.1.tar.gz
cd bedtools2
make
BigWigMerge:
conda install -c bioconda ucsc-bigwigmerge
Tip: If you don't have root privileges, conda
can be used to build a virtual environment and perform the above operations in the conda
virtual environment.
It is strongly recommended to add the above software to the environment PATH
after installation. For example, one way, add export PATH=$PATH:<path_of_BEDTools_bin_path>
to ~/.bash_profile
, where <path_of_BEDTools_bin_path>
is your bedtools
binary files location.
Enter the folder tesa
and type make
, then the compiled codes are within the same directory as the source.
git clone https://github.com/OSU-BMBL/tesa.git
cd tesa/src
make clean && make
The major program in the provided package is tesa
. It is capable of parsing standard FASTA
format files, allowing sequences with different lengths. Example files are provided for reference.
To access help and view all available options.
./tesa -h (./tesa)
Take a look at an instance file in FASTA
format, test.fasta
, in the example folder. Then, you can run tesa
with a specific length parameter l
, where the program will handle segments of that length or greater during the two-stage alignment process to identify potential motifs. For example, you can run the following command to specify segments with length exactly 14.
./tesa -i ../example/test.fasta -l 14
For each input file with a specific length parameter l
, the program will generate an output file, namely, '.closures'
file. This file contains all the closures, which represent the instances of identified motifs.
Additionally, you can run tesa
recognizing the correct segment length in the scope [L,U]
by our program automatically.
./tesa -i ../example/test.fasta -L 14 -U 16
L
and U
are lower and upper length of segments during two-stage alignment separately. This is useful when the accurate length is not known in advance. We sort the top n
closures under each specific length in the increasing order of their P-values and save the top o
closures in the '.closures'
file. Especially, when the input value of L
equals to U
, it is equivalent to finding motifs based on segments of a specific length. For example, './tesa -i ../test.fasta -L 14 -U 14'
is equivalent to './tesa -i ../test.fasta -l 14'
.
You can run TESA with another option using sequencing coverage according to the following instructions:
-
Make sure both
BEDTools
andBigWigMerge
are ready and added to your environmentPATH
so that they can be directly called. -
A peak file with
BED
format (for instance,[PREFIX].bed
), severalbigWig
files for example named[PREFIX]_Forward.bw
and[PREFIX]_Reverse.bw
respectively and a reference file withFASTA
format with its index file ofFASTA.FAI
format are required. For instance, there is a toy run withtest.bed
,test_Forward.bw
,test_Reverse.bw
,reference.fa
andreference.fa.fai
as input. Additionally, you can generate reference usingSAMTools
.SAMTools
: https://github.com/samtools/samtools -
Run preprocessing script and generate output file
*.tesa
.
./preprocess.sh [PEAK_PREFIX] [REFERENCE_FILE] [REFERENCE_INDEX_FILE] [OUTPUT_PREFIX]
- Run TESA beyond new input with sequencing coverage.
./tesa [OUTPUT_PREFIX].tesa
For instance:
cd example
chmod +x ../script/preprocess.sh
../script/preprocess.sh test reference.fa reference.fa.fai test_out
../src/tesa -i test_out.tesa
Option | Parameter | Description | Default |
---|---|---|---|
-l | length |
Specify length of segments during two-stage alignment. In general, this value should be greater than 5. | The program uses length 14 in default. |
-L | minlength |
Specify minimum length of segments during two-stage alignment. This parameter only takes effect when l is not specified. |
This parameter does not effect beyond default parameters. |
-U | maxlength |
Specify maximum length of segments during two-stage alignment. Must have a value greater than L . This parameter only takes effect when l is not specified. |
This parameter does not effect beyond default parameters. |
-R | range |
The range when we use [L,U] . |
The default range is 1 . |
-o | output closures |
Number of closures to report (used under a specific input length). Only effective when the value is greater than the total number of closures found by this tool. | The program reports 10 closures in default. |
-n | number of closures |
Top n closures under each length are used when L < U . Only effective when the value is greater than the total number of closures found by this tool. |
This parameter does not effect beyond default parameters. |
-B | reverse complement |
TRUE or FALSE whether to search reverse complement. If it is set TRUE ,the program will search both the forward complement and reverse complement. |
This flag is set as TRUE in default which means input sequence should be bidirectional. |
-w | weight |
Should be set with a positive integer. The weight of the two motif ends. | The program uses default weight 2 . |
-k | motif seed size |
Should be set with a positive integer. The minimum size of the initial motif seeds. | It is strongly recommended not to set it smaller than the default value 3 . |
-c | consistency ratio |
The consistency level of the motif seeds. Should be set between (0.5-1.0] . |
The program uses ratio 1 in default. It must be strictly greater than 0.5 . |
-s | simulation times |
The number of simulation times when generating motif closures. A positive integer greater than or equal to 5 . |
Increasing this parameter will increase the run time. The program uses 5 in default. |
-u | similarity score |
The threshold of two closures' similarity score (0,1] . |
Default threshold is set as 0.95 . |
-a | upper limit |
The upper limit of conservation level (N,10] . |
The program uses 9 in default. |
-N | lower limit |
The lower limit of conservation level (0,a) . |
Lower limit is set as 6 in default. |
-P | palindromic |
TRUE or FALSE , the flag of palindromic of TFBS. |
The flag of palindromic is FALSE in default. |
-M | mirror |
TRUE or FALSE , the flag of mirror of TFBS. |
The flag of mirror of TFBS is FALSE in default. |
-e | enlargement |
The times of seed alignments enlargement. A positive integer between [1,3] . |
The program run 3 times for seed alignments enlargement in default. |
-b | background |
The conserve level (0,1] when search in background genome. Only effective when background gene exists. |
If background gene exists, the level uses 0.95 in default. |
-W | sequences weight |
TRUE or FALSE , the flag of considering sequences weight. |
The flag is set as FALSE in default. |
-A | approximation |
TRUE or FALSE , the flag of approximation of P-value calculation. |
The flag of approximation of P-value calculation is FALSE in default. |
-G | global |
TRUE or FALSE , the flag of global TF prediction. |
Global TF prediction flag is FALSE in default. |
-C | local |
TRUE or FALSE , the flag of local TF prediction. |
Local TF prediction flag is FALSE in default. |
-E | expansion |
TRUE or FALSE , the flag of expansion of closures base on the threshold [0.3,0.8] . |
No expansion of closures with FALSE in default. |
-F | fast |
TRUE or FALSE , the flag of fast version of TESA which just enhance two ends of motifs. |
In default, it is set as FALSE which means runs a full version. |
Any questions, problems, bugs are welcome and should be dumped to Cankun Wang cankun.wang@osumc.edu.