Skip to content

Prediction of enhancers

Tianshun Gao edited this page Jan 26, 2025 · 3 revisions

Tutorial on single-cell enhancer prediction

dbscATAC had identified 13,470,526 enhancers and 10,402,346 enhancer-gene interactions derived from 1,668,076 single cells spanning 1,028 tissue/cell types in 13 species.

Download the raw data

Take GSE149683 for instance, its raw data can be download from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149683 Can run to get the Seurat rds files for all tissue samples:

perl Enh_download_rawdata.pl

Preprocess from the large matrix to cell tpye specific matrices

Based on the tissue/cell type annotations assigned to all cells, the large matrix extracted from the RDS file was divided into tissue/cell type specific sub-populations. Can run:

Rscript Enh_RDS_split_into_celltype_matrix.R

Calling for putative single-cell enhancers

To identify single-cell typical enhancers from hundreds of tissue/cell-type specific single cells, we improved a previously designed unsupervised method by introducing a weighting system to assign quality scores to each single cell and combining all cells' peak profiles to identify typical enhancers.
In this approach, the ATAC peak profile of each single cell was treated as an independent dataset. Our method operates under the assumption that higher-quality datasets are more strongly associated with predicted enhancers, while lower-quality datasets have weaker associations. By comparing the similarities among all single-cell datasets, a relative quality score was assigned to each dataset. Traditionally, the scATAC-seq matrix is binarized to reflect the ‘open’ or ‘closed’ state of chromatin, based on the sparsity of the data and the conceptual framework of chromatin accessibility. However, a recent study demonstrated that modeling fragment counts, rather than binarizing the matrix, preserves quantitative regulatory information and improves the analysis of scATAC-seq data.
To better evaluate the similarity between the datasets of any two single cells (e.g.and), we employed the Tanimoto Coefficient to calculate their correlation. The improved unsupervised learning approach is integrated with the Cicero tool to accurately identify single-cell enhancers and their target genes.
can run:

Rscript Enh_RDS_split_into_celltype_matrix.R
perl Enh_Calling_for_putative_enhancers.pl

Filterring by promoters, silencers, and exon regions

To obtain the final enhancers, the putative single-cell enhancers should be filtered by promoters, silencers, and exon regions. Only the single-cell enhancers not overlapping with any promoters, silencers, or exon regions are retained:

perl Enh_filterring_by_pro_exon_silencer.pl

#Visualization of cell type specific single-cell enhancer through the module "Search single-cell enhancers" in the home page of our webiste:
Enh_example

Clone this wiki locally