Skip to content

Dowell-Lab/DAStk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DAStk

The Differential ATAC-seq Toolkit (DAStk) is a set of command-line tools to aid analyzing differential ATAC-seq data. By leveraging changes in accessible chromatin, we can detect significant changes in transcription factor (TF) activity. This is a simple but powerful tool for cellular perturbation analysis. In fact, the applications of DAStk are not necessarily limited to ATAC-seq, but can extend to comparing any pair of bedGraphs containing regions of interest, e.g. transcription regulatory elements (TRE) from nascent transcription assays like PRO-seq, or others like ChIP-seq peaks.

You will need the following inputs:

  • A pair of files listing peaks of ATAC-seq signal in two biological conditions (e.g. DMSO and drug-treated) in any BedGraph-compatible format (tab-delimited)
  • A set of files listing the putative binding sites across the reference genome of choice, one file per transcription factor motif, also in any BedGraph-like format. These are normally generated from position weight matrices (PWMs) available at TF model databases like HOCOMOCO. These files are expected to have a .bed, .BedGraph or .txt extension.

IMPORTANT: All files mentioned above (ATAC-seq peaks and computed motif sites) MUST be sorted by the same criteria. Different bioinformatics tools use different lexical sorting criteria (e.g. chr10 after chr1, or chr2 after chr1) so please ensure the sorting criteria is uniform.

Install

You can install DAStk using pip:

$ pip install DAStk

... or:

$ pip install --upgrade DAStk 

This is the simplest option, and it will also create the executable commands process_atac and differential_md_score. Alternatively, you can clone this repository by running:

$ git clone https://github.com/Dowell-Lab/DAStk.git

Required Python libraries (automatically taken care of if installed thru pip):

  • numpy
  • argparse
  • matplotlib
  • scipy
  • adjustText
  • pandas
  • multiprocessing
  • pybedtools
  • futures
  • scikit-learn
  • itertools
  • networkx
  • upsetplot

These scripts feature comprehensive help when called with the --help argument. Every argument provides a short and long form (i.e. -t or --threads). There are normally two steps in a typical workflow:

  1. Process the ATAC-seq peak files (process_atac) to calculate the MD-score statistic for each motif provided.
  2. Detect the most statistically significant changes in MD-score (differential_md_score) between both biological conditions (taking into the account the peaks involved), and generate MA and barcode plots.

TL;DR;

Process each ATAC-seq peak bedGraph file, the genome abbreviations are "hg38", "mm10", etc:

$ process_atac -e PEAKS_FILENAME_1 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY
$ process_atac -e PEAKS_FILENAME_2 -m MOTIF_SITES_DIRECTORY -g GENOME_ABBREVIATION -o OUTPUT_DIRECTORY

Perform differential analysis on the calculated motif displacement scores, highlighting the significant motifs at your p-value threshold of choice:

$ differential_md_score -1 MD_OUTPUT_FOR_FILENAME_1 -2 MD_OUTPUT_FOR_FILENAME_2 -p 0.0000001 -m "DMSO" -n "Drug treatment" -b -o OUTPUT_DIRECTORY

Perform downstream analysis to discover what is common among those significant TFs:

$ tf_result_explanations -p 0.0000001 -d DIFFERENTIAL_MD_SCORE_OUTPUT_FILENAME -o EXPLANATION_FILENAME

Questions?

$ process_atac --help
$ differential_md_score --help

Data Processing and Plotting

There are three plotting functions to assess the results from process_atac and differential_md_score:

  • ma_plot
  • barcode_plot
  • tf_intersect

The first two allow you to generate custom MA and Barcode plots using the stats file produced using the differential_md_score module. While the MA and Barcode plots will be generated by default when running the differential_md_score module, it may be useful to adjust the labels, adjust the p-value threshold, compare motifs with non-significant p-values etc. The default plots will generate labels based on your file basenames (everything before the _md_score.txt extension if you kept the default file names from process_atac). However, to keep plots scaled appropriately, this label is limited to 19 characters. If it is longer, it will be truncated.

The third plotting module, tf_intersect can be used to generate a venn diagram (3 or fewer conditions) or an upset plot using multiple result files as inputs to compare common motifs between conditions. For example, using the -e/--enriched argument and by providing multiple *_md_score.txt files files from process_atac, all motifs with an MD score >0.2 will be intersected. Additionally, if more than three input stats files are specified, three catplots will be generated alongside the upset plot by taking the mean value of all of the stats files for total genome-wide motif hits, motif hits within 3kb of the region file, and the MD score.

IMPORTANT Either raw MD score outputs from process_atac or differential MD scores from differential_md_score may be used, however it is not recommended to mix file input types. There are also some plotting functions unique to each (e.g. intersecting significant motifs is only relevant to the differential_md_score outputs. Furthermore, -e/--enriched and -d/--depleted have different relative values between the two output types. Using the process_atac results, the raw MD score will be used and defaults are relative to a background singal of 0.1. Using the differential_md_score stats output, the enriched/depleted is based on the differential MD score (i.e. the difference in MD scores between the two conditions) and therefore the thresholds are specified by a different argument.

A stats file of the intersection data will also be generated in the format needed to generate the upset plot (see https://pypi.org/project/pyupset/0.1.1.post4/). There is a limit of 12 input files for plotting. However, if more than 12 files are specified, a stats file output will still be generated specifying motif intersections.

The following will provide a full list of arguments and usage instructions instructions for each of these plotting tools:

$ ma_plot --help
$ barcode_plot --help
$ tf_intersect --help

While these plots will be generated by default when running the differential_md_score module, it may be useful to adjust the labels, adjust the p-value threshold, compare motifs with non-significant p-values etc. The default plots will generate labels based on your file basenames (everything before the _md_score.txt extension if you kept the default file names from process_atac). However, to keep plots scaled appropriately, this label is limited to 19 characters. If it is longer, it will be truncated.

Inquiring about TF-TF relations

In release v1.0.0 a new functionality has been included, if using assays on human tissue and the TF motifs from HOCOMOCO:

$ tf_result_explanations --help

This build now includes a graph that combines all human proteins in Uniprot, with their respective annotations, and all human Reactome data on biological pathways, biochemical reactions, and complex formation. This allows the user to query for existing knowledge about what aspects of the significantly changing TFs are shared among two or more of them. They may participate in the same pathway or biological process, be members of the same complex, have shared cofactors, directly interact with each other, etc.

You can optionally provide a list of "uninteresting" intersections by listing the ontology concept URIs in a file (see our example list, for things in common between TFs that are too obvious), so that they are omitted from the output. You can also specify additional entities to include in the search, besides the TFs showing a change of activity beyond the provided cutoff, also in a different file listing ontology URIs. The list of labels for each concept used shows the possible URIs that could be included as an additional search endpoint. For example, if we end up with 5 different TFs that are significantly changing between conditions and we also want to search how they all relate to zinc, we can provide a file with the URI http://purl.obolibrary.org/obo/CHEBI_29105. This tool produces a report that looks like the excerpt below:

Transcription factors displaying a significant difference in activity:
CEBPB, CEBPD, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, ZNF93

Here's what we know about these TFs presenting significant activity changes (p=1.00E-03):

Direct interactions between each of these TFs:
NFYB interacts with TP53
TP53 interacts with CEBPB

Other ways these TFs are related:
CEBPB, CEBPE, ELK1, HLF, NFIA, NFIB, NFIX, NFYB, NRF1, TP53, ZNF180, ZNF341, ZNF396, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: located in nucleus

ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component Zinc finger, C2H2 type

ZNF180, ZNF341, ZNF396, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: has function metal ion binding

ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, ZNF540, and ZNF93: has component KRAB box

CEBPB, CEBPD, CEBPE, NFYB, TP53, and ZNF396: has function protein heterodimerization activity

ZNF180, ZNF432, ZNF441, ZNF519, ZNF529, and ZNF540: molecularly interacts with KRAB-ZNF / KAP Complex [nucleoplasm]

CEBPB, CEBPD, CEBPE, and HLF: has component Basic region leucine zipper

[...]

CEBPB and CEBPD: interacts with ATF4

CEBPB and CEBPD: interacts with CEBPA

CEBPB and CEBPD: participates in positive regulation of osteoblast differentiation

CEBPB and CEBPE: participates in cellular response to lipopolysaccharide

CEBPB and CEBPE: participates in defense response to bacterium

Usage examples

Unpack the motif files (see below for how to create your own, instead):

$ mkdir /path/to/grch38_motifs
$ tar xvfz motifs/HOCOMOCO_v11_p1e-6_grch38.tar.gz --directory /path/to/grch38_motifs

Calculate the MD-scores for the first biological condition:

$ process_atac --threads 8 --atac-peaks /path/to/DMSO/ATAC/peaks/file \
  --genome hg38 \
  --motif-path /path/to/directory/containing/motif/files \
  --output /path/to/output/directory

The above command generates a file called BASENAME_md_scores.txt. It's generally a good idea to use the cell type (or sample number) and a brief condition description (e.g. k562_DMSO or SRR1234123_Metronidazole) in the file name provided.

We would then generate the same file, for the other biological condition we are comparing against:

$ process_atac --threads 8 --atac-peaks /path/to/treatment/ATAC/peaks/file \
  --genome hg38 \
  --motif-path /path/to/directory/containing/motif/files \
  --output /path/to/output/directory

The above generates a file called BASENAME_md_scores.txt. Finally:

$ differential_md_score --assay-1 DMSO --assay-2 Treatment --p-value 0.0000001 --barcodes --output /path/to/output/directory

The above generates a tab-delimited file with all differential MD scores for each motif and their p-values (sorted by p-value), an MA plot that labels the most significant TF activity changes, at a p-value cutoff of 1e-7. Note that better plot-friendly condition names (say, "DMSO" and "Treatment") can be provided using the --label-1 and --label-2 arguments. The plots look like the example below:

Sample MA plot

The -b flag also generates a "barcode plot" of each of these statistically significat motif differences that depicts how close the ATAC-seq peaks were to the (putative TF-binding) motif sites, within a 1500 base-pair radius of the motif center:

Sample barcode plot

If you can take advantage of multiprocessing, you can calculate MD-scores for both conditions simultaneously, assigning several threads to each, then generate the plots once both *_md_scores.txt files are ready.

The columns for the tab-separated output file from differential_md_score are:

Motif name , p-value , # total motif hits, # nearby peaks in condition 1 , # nearby peaks in condition 2 , MD-score in condition 1 , MD-score in condition 2, differential MD-score

To query what we know about these highlighted TFs displaying a significant difference in activity, we can use:

$ tf_result_explanations -p 0.0000001 \
  -d /path/to/output/directory/your_condition1_vs_condition2_differential_md_score.txt \
  -o /path/to/report_file.txt

Additional Arguments

Genome File

If your genome is not incuded in the UCSC genome repository, you will instead need to provide a chromosome sizes file in processess_atac. This can be generated using samtools faidx as follows:

$ samtools faidx genome.fa
$ cut -f1,2 genome.fa.fai > genome.chrom.sizes

This file can then be specified using the -c/--chromosomes argument in process_atac. Scaffold chromosomes will be removed. Please note that the --genome and --chromosomes arguments are mutually exclusive, and that the --genome argument assumes the [fetchChromsizes](https://anaconda.org/bioconda/ucsc-fetchchromsizes) tool is installed.

Normalization

If the -g/--global-normalization argument is used in the differential_md_score module, then the total number of genome-wide motif hits will be used to normalize the barcode plots. This may be helpful in better assessing changes across different perturbations/experiments or cell lines. Otherwise, the barcode plots will simply be set to the same max heat to facilitate better visualization of relative motif density between conditions. This normalization argument has also been implemented in the barcode_plot plotting module and as such the output stats file from differential_md_score now include the total number of genome-wide motif hits for each motif.

NOTE Using the -g/--global-normalization argument will also result in a different heatmap color (YlOrRd) rather than the default (cividis) so that the two normalization types are easily differentiated.

Altering Window Size

While we strongly recommend using the default 1500bp radius window in calculating the MD score (and differential MD score), as of v0.3.0 we now have a radius argument (-r/--radius) which will allow you to expand or shrink this window. If changed, the MD score calculation will follow the same principle in that it will be a ratio of motifs hits within the cetner 1/10th of the window relative divided by the number of total motif hits within the window. For example, if the user specifies a radius of 2000bp, there will be a window size of 4000bp, a center of 400bp around the features of interest, and the MD score will be # motif hits within 400bp/ # motifs within 4000bp. Keep in mind that expanding this window may be useful in visualization, but will result in an MD score approaching 0.1 (background).

Motif Files

Feel free to use the pre-scanned motif files provided (available through Zenodo): HOCOMOCO_v11_p1e-6_grch38.tar.gz, HOCOMOCO_v11_p1e-6_hg19.tar.gz and HOCOMOCO_v11_p1e-6_mm10.tar.gz for the GRCh38/hg38, hg19 and mm10 reference genomes, respectively. They have been generated from HOCOMOCO's v11 mononucleotide model, background-corrected for each reference genome. To generate your own bed files for each motif from this or any other source, you can use FIMO in combination with the downloaded .meme files from your TF database of choice. For example, if using HOCOMOCO, you can create the motif file for TP53 using their mononucleotide model with a p-value threshold of 0.000001 by:

$ fimo -max-stored-scores 10000000 --thresh 1e-6 -oc /path/to/output/directory -motif /path/to/motif/file \
  /path/to/HOCOMOCOv11_HUMAN_mono_meme_format.meme /path/to/whole_genome.fa

Please refer to the complete FIMO documentation for any questions.


Citation

Please cite DAStk if you have used it in your research!

Tripodi, I.J.; Allen, M.A.; Dowell, R.D. Detecting Differential Transcription Factor Activity from ATAC-Seq Data. Molecules 2018, 23, 1136.

If you have used the provided scanned motif regions from HOCOMOCO, please cite them as well:

Kulakovskiy, I.V., Vorontsov, I.E., Yevshin, I.S., Sharipov, R.N., Fedorova, A.D., Rumynskiy, E.I., Medvedeva, Y.A., Magana-Mora, A., Bajic, V.B., Papatsenko, D.A., et al. (2018). HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res 46, D252–D259.

For any questions or bug reports, please use the Issue Tracker.

Ignacio Tripodi (ignacio.tripodi at colorado.edu)
Computer Science Department, BioFrontiers Institute
University of Colorado, Boulder, USA

Margaret Gruca (margaret.gruca at colorado.edu)
BioFrontiers Institute
University of Colorado, Boulder, USA