This is Tobias Rausch's ATACseq pipeline modified to run on our system and to support hg38, his relevant work should be cited (citation info at end).
Some misc additions like bigWig generation added too.
Create a directory where you'd like to install this pipeline and cd into it.
git clone https://github.com/FrietzeLabUVM/ATACseq.git
cd ATACseq
make all
A few R packages are required
conda activate bin/envs/atac2
R --no-init-file -e "install.packages('BiocManager', repos='http://cran.rstudio.com/'); BiocManager::install('GenomicFeatures'); BiocManager::install('DESeq2'); BiocManager::install('vsn'); BiocManager::install('pheatmap'); BiocManager::install('RMariaDB'); BiocManager::install('corrplot'); BiocManager::install('hexbin')"
Patches for hg38 support.
fetchChromSizes hg38 > bin/envs/atac/share/igvtools-2.3.93-0/genomes/hg38.chrom.sizes
perl bin/envs/atac2/share/homer-4.9.1-6/.//configureHomer.pl -install hg38
conda deactivate
Pretty sure this shouldn't be necessary, but it is, and I can't fight conda anymore.
cp bin/envs/atac/lib/libcrypto.so.1.0.0 bin/envs/atac2/lib
If one of the above commands fail your operating system probably lacks some build essentials. These are usually pre-installed but if you lack them you need to install these. For instance, for Ubuntu this would require:
apt-get install build-essential g++ git wget unzip
To annotate motifs and estimate TSS enrichments some simple scripts are included in this repository to download these databases.
cd bed/ && Rscript promoter.R && cd ..
cd motif/ && ./downloadMotifs.sh && cd ..
./src/atac.sh <hg19|mm10> <read1.fq.gz> <read2.fq.gz> <genome.fa> <output prefix>
The pipeline produces at various steps JSON QC files (*.json.gz
). You can upload and interactively browse these files at https://gear.embl.de/alfred/. In addition, the pipeline produces a succinct QC file for each sample. If you have multiple output folders (one for each ATAC-Seq sample) you can simply concatenate the QC metrics of each sample.
head -n 1 ./*/*.key.metrics | grep "TssEnrichment" | uniq > summary.tsv
cat ./*/*.key.metrics | grep -v "TssEnrichment" >> summary.tsv
To plot the distribution for all QC parameters.
Rscript R/metrics.R summary.tsv
The ATAC-Seq pipeline produces various output files.
- Bowtie BAM alignment files filtered for duplicates and mitochondrial reads.
- Quality control output files from alfred, samtools, FastQC and cutadapt adapter filter metrics.
- Macs peak calling files and IDR filtered peak lists.
- Succinct browser tracks in bedGraph format and IGV's tdf format.
- Footprint track of nucleosome positions and/or transcription factor bound DNA.
- Homer motif finding results.
Merge peaks across samples and create a raw count matrix.
ls ./Sample1/Sample1.peaks ./Sample2/Sample2.peaks ./SampleN/SampleN.peaks > peaks.lst
ls ./Sample1/Sample1.bam ./Sample2/Sample2.bam ./SampleN/SampleN.bam > bams.lst
./src/count.sh hg19 peaks.lst bams.lst <output prefix>
To call differential peaks on a count matrix for TSS peaks, called counts.tss.gz, using DESeq2 we first need to create a file with sample level information (sample.info). For instance, if you have 2 replicates per condition:
echo -e "name\tcondition" > sample.info
zcat counts.tss.gz | head -n 1 | cut -f 5- | tr '\t' '\n' | sed 's/.final$//' | awk '{print $0"\t"int((NR-1)/2);}' >> sample.info
Rscript R/dpeaks.R counts.tss.gz sample.info
Peaks can of course be intersected with enhancer or conserved element tracks, i.e.:
cd tracks/ && downloadTracks.sh
bedtools intersect -a ./Sample2/Sample2.peaks -b tracks/conserved.bed
There is a basic Rscript available for plotting peak densities.
Rscript R/karyoplot.R input.peaks
As part of the Clinical Genomics and NGS course we put together a ATAC-Seq tutorial in R Statistics.
Tobias Rausch, Markus Hsi-Yang Fritz, Jan O Korbel, Vladimir Benes.
Alfred: Interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing.
Bioinformatics. 2018 Dec 6.
B Erarslan, JB Kunz, T Rausch, P Richter-Pechanska et al.
Chromatin accessibility landscape of pediatric T‐lymphoblastic leukemia and human T‐cell precursors
EMBO Mol Med (2020)
This ATAC-Seq pipeline is distributed under the GPLv3.