EP2vec

We propose a novel computational framework EP2vec to assay three-dimensional genomic interactions. We first extract sequence embedding features, defined as fixed-length vector representations learned from variable-length sequences using an unsupervised deep learning method. Then, we train a classifier to predict EPIs using the learned representations in supervised way. Experimental results demonstrate that EP2vec obtains F1 scores ranging from 0.841 to 0.933 on different datasets, which outperforms experimental features-based methods and sequence-based methods. We prove the robustness of sequence embedding features by carrying out sensitivity analysis. Besides, we identify motifs that represent cell line-specific information through analysis of the learned sequence embedding features by adopting attention mechanism. Last, we show that even superior performance with F1 scores 0.889~0.940 can be achieved by combining sequence embedding features and experimental features, which indicates the complementation of these two types of features. In conclusion, EP2vec sheds light on feature extraction for DNA sequences of arbitrary lengths, provides a powerful approach for three-dimensional interactions identification and finds significant motifs through interpreting sequence embedding features. In "EP2vec: a deep learning approach for extracting sequence embedding features to predict enhancer-promoter interactions", we

Use Paragraph Vector to train enhancer and promoter sequence embedding features seperately in an unsupervised way
Identify the true enhancer-promoter interactions from other possible interactions in a supervised way

The two-stage workflow of EP2vec. Stage 1 of EP2vec is unsupervised feature extraction which transforms enhancer sequences and promoter sequences in a cell line into sequence embedding features separately. Given a set of all known enhancers or promoters in a cell line, we first split all the sequences into k-mer words with stride s=1 and assign a unique ID to each of them. Regarding the preprocessed sequences as sentences, we embed each sentence to a vector by Paragraph Vector. Concretely, we use vectors of words in a context with the sentence vector to predict the next word in the context using softmax classifier. After training converges, we get embedding vectors for words and all sentences, where the vectors for sentences are exactly the sequence embedding features that we need. Note that in sentence ID, SEQUENCE is a placeholder for ENHANCER or PROMOTER, and is the total number of enhancers or promoters in a cell line. Stage 2 is supervised learning for predicting EPIs. Given a pair of sequences, namely an enhancer sequence and a promoter sequence, we represent the two sequences using the pre-trained vectors and then concatenate them to obtain the feature representation. Lastly, we train a Gradient Boosted Gradient Trees classifier to predict whether this pair is a true EPI.

Training Data

EP2vec uses the same training data as TargetFinder, where interacting enhancer-promoter pairs are annotated using high-resolution genome-wide Hi-C data (Rao et al., 2014). Labeled training datasets used in the TargetFinder are available in https://github.com/shwhalen/targetfinder.git. We downsample the training sets to ratio 1:1 in all cell types. The K562train.csv, GM12878train.csv, NHEKtrain.csv, IMR90train.csv HUVECtrain.csv and HeLa-S3train.csv are the downsampled training sets. Besides, we need all known enhancers and promoters in a cell line to perform unsupervised feature extraction. We use the enhancers and promoters and TargetFinder in https://github.com/shwhalen/targetfinder.git. The following table show details of each cell line dataset. The enhancers (or promoters) column indicates the number of all known active enhancers (or promoters) for each cell line, which are used for unsupervised feature learning for enhancer (or promoter) sequences.

Dataset	enhancers	promoters	true EPIs	false EPIs
K562	82806	8196	1977	1975
IMR90	108996	5253	1254	1250
GM12878	100036	8453	2113	2110
HUVEC	65358	8180	1524	1520
HeLa-S3	103460	7794	1740	1740
NHEK	144302	5254	1291	1280

Model Training

ep2vec.py accepts 4 parameters, the length of k-mer, the length of stride, the dimension of embedding vector and the interested cell line. If you want to get the result of 6-mer with stride 1 and embedding dimension 100 in K562, simply run

python ep2vec.py 6 1 100 K562

and you can get the auROC scores, F1 scores and auPRC scores of 10-fold cross-validation. More detailed of the source code can be found in ep2vec.py.

Methods Comparison

We compared the performance of four methods, namely EP2vec, TargetFinder, gkmSVM, and SPEID, in different datasets, respectively. We rewrite the source code of TargetFinder and SPEID according to their papers in targetfinder.py and speid.py. On the whole, the F1 scores for six cell line datasets of the above four methods range from 0.867 - 0.933, 0.844 - 0.922, 0.731 - 0.822, 0.809 - 0.900, respectively.

Dependency

EP2vec requires:

Python
gensim
scikit-learn
bedtools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EP2vec

Training Data

Model Training

Methods Comparison

Dependency

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
GM12878train.csv		GM12878train.csv
HUVECtrain.csv		HUVECtrain.csv
HeLa-S3train.csv		HeLa-S3train.csv
IMR90train.csv		IMR90train.csv
K562train.csv		K562train.csv
NHEKtrain.csv		NHEKtrain.csv
README.md		README.md
ep2vec.py		ep2vec.py
speid.py		speid.py
targetfinder.py		targetfinder.py
workflow.jpg		workflow.jpg

Nanguage/ep2vec

Folders and files

Latest commit

History

Repository files navigation

EP2vec

Training Data

Model Training

Methods Comparison

Dependency

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages