# Transcription Factor Footprinting with ATAC-STARR Accessibility data

## Introduction

We have accessibility data from ATAC-STARR that we can do TF footprinting on to identify genomic locations bound by a TF. To do this we will use the tobias software package. 

We will analyze the buenrostro data in parallel as our benchmark. We only need the cutcounts file from buenrostro, not the second or third steps. 

## Run ATACorrect to get tn5-bias corrected cut-counts

In [1]:
%%bash
hg38='/data/hodges_lab/hg38_genome/hg38.fa'
DIR='/data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data'
OUT='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'
#GM12878 Buenrostro ATAC-seq. Only need the cut counts file for buenrostro. 
TOBIAS ATACorrect --bam ${DIR}/bams/pos-sorted/GM12878_ATAC-seq_buenrostro_merged.unique.pos-sorted.bam --genome $hg38 \
    --peaks ${DIR}/genrich_peaks/GM12878_ATAC-seq_buenrostro_genrich_0.0001-qvalue.narrowPeak \
   --blacklist /home/hansetj1/hg38_encode_blacklist_ENCFF356LFX.bed --outdir ${OUT}/ATACorrect \
    --cores 12 --prefix GM12878_ATAC-seq_buenrostro

# TOBIAS 0.12.12 ATACorrect (run started 2021-11-10 11:37:40.593256)
# Working directory: /gpfs52/data/hodges_lab/ATAC-STARR_B-cells/bin
# Command line call: TOBIAS ATACorrect --bam /data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data/bams/pos-sorted/GM12878_ATAC-seq_buenrostro_merged.unique.pos-sorted.bam --genome /data/hodges_lab/hg38_genome/hg38.fa --peaks /data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data/genrich_peaks/GM12878_ATAC-seq_buenrostro_genrich_0.0001-qvalue.narrowPeak --blacklist /home/hansetj1/hg38_encode_blacklist_ENCFF356LFX.bed --outdir /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect --cores 12 --prefix GM12878_ATAC-seq_buenrostro

# ----- Input parameters -----
# bam:	/data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data/bams/pos-sorted/GM12878_ATAC-seq_buenrostro_merged.unique.pos-sorted.bam
# genome:	/data/hodge

In [2]:
%%bash
#GM12878 ATAC-STARR Accesssiblity Peaks
hg38='/data/hodges_lab/hg38_genome/hg38.fa'
DIR='/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR'
OUT='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'
TOBIAS ATACorrect --bam ${DIR}/bams/merged_replicates/GM12878inGM12878_DNA_merged.unique.pos-sorted.bam --genome $hg38 \
    --peaks ${DIR}/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak \
    --blacklist /home/hansetj1/hg38_encode_blacklist_ENCFF356LFX.bed --outdir ${OUT}/ATACorrect \
    --cores 12 --prefix GM12878inGM12878_DNA

# TOBIAS 0.12.12 ATACorrect (run started 2021-11-10 12:16:19.906043)
# Working directory: /gpfs52/data/hodges_lab/ATAC-STARR_B-cells/bin
# Command line call: TOBIAS ATACorrect --bam /data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/bams/merged_replicates/GM12878inGM12878_DNA_merged.unique.pos-sorted.bam --genome /data/hodges_lab/hg38_genome/hg38.fa --peaks /data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak --blacklist /home/hansetj1/hg38_encode_blacklist_ENCFF356LFX.bed --outdir /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect --cores 12 --prefix GM12878inGM12878_DNA

# ----- Input parameters -----
# bam:	/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/bams/merged_replicates/GM12878inGM12878_DNA_merged.unique.pos-sorted.bam
# genome:	/data/hodges_lab/hg38_genome/hg38.fa
# peaks:	/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM12878_DNA_genr

## Run ScoreBigwig to create footprint signal files

In [3]:
%%bash
DIR='/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR'
OUT='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'
TOBIAS ScoreBigwig --signal ${OUT}/ATACorrect/GM12878inGM12878_DNA_corrected.bw \
    --regions ${DIR}/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak \
    --output ${OUT}/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw --cores 12

# TOBIAS 0.12.12 ScoreBigwig (run started 2021-11-10 12:57:52.954379)
# Working directory: /gpfs52/data/hodges_lab/ATAC-STARR_B-cells/bin
# Command line call: TOBIAS ScoreBigwig --signal /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect/GM12878inGM12878_DNA_corrected.bw --regions /data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak --output /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw --cores 12

# ----- Input parameters -----
# signal:	/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect/GM12878inGM12878_DNA_corrected.bw
# output:	/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw
# regions:	/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue

## Run BINDetect to identify TF families that are bound to the GM12878 genome

In [4]:
%%bash
DIR='/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR'
OUT='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'
MOTIF='/home/hansetj1/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt'
hg38='/data/hodges_lab/hg38_genome/hg38.fa'

TOBIAS BINDetect --bound-pvalue 0.05 --motifs $MOTIF \
    --signals ${OUT}/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw \
    --genome $hg38 --peaks ${DIR}/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak \
    --outdir ${OUT}/BINDetect/GM12878inGM12878_0.05 --cond_names GM12878inGM12878_DNA --cores 12

# TOBIAS 0.12.12 BINDetect (run started 2021-11-10 13:17:00.895320)
# Working directory: /gpfs52/data/hodges_lab/ATAC-STARR_B-cells/bin
# Command line call: TOBIAS BINDetect --bound-pvalue 0.05 --motifs /home/hansetj1/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt --signals /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw --genome /data/hodges_lab/hg38_genome/hg38.fa --peaks /data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM12878_DNA_genrich_3-replicates_0.0001-qvalue.narrowPeak --outdir /data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/BINDetect/GM12878inGM12878_0.05 --cond_names GM12878inGM12878_DNA --cores 12

# ----- Input parameters -----
# signals:	['/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ScoreBigwig/GM12878inGM12878_DNA_footprints.bw']
# peaks:	/data/hodges_lab/ATAC-STARR_B-cells/data/ATAC-STARR/ChrAcc_peaks/GM12878inGM128

Matplotlib is building the font cache; this may take a moment.


## Generate CTCF and JUNB heatmaps

In [None]:
%%bash
#BedFiles
BED_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/BINDetect/GM12878inGM12878_0.05'

#BigWigs
BUEN_DIR='/data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data/footprinting/ATACorrect'
AS_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect'
ENCODE_DIR='/data/hodges_lab/public_data/GM12878/obtained_as_processed_files/from-ENCODE/bigWig'

#Output
OUTPUT_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'

#Make signal heatmap using ENCODE CTCF ChIP-seq bigwig and the TOBIAS-generated cutcounts signal file at CTCF sites. Rank by ChIP-seq intensity. Show heatmap only. 
computeMatrix reference-point -S ${ENCODE_DIR}/GM12878_CTCF_ChIP_hg38_ENCFF485CGE.bw \
    ${BUEN_DIR}/GM12878_ATAC-seq_buenrostro_corrected.bw \
    ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/CTCF_MA0139.1/beds/CTCF_MA0139.1_all.bed \
    -a 200 -b 200 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_CTCF.gz

plotHeatmap -m ${OUTPUT_DIR}/matrix_footprinting_CTCF.gz -o ${OUTPUT_DIR}/heatmap_footprinting_CTCF.pdf \
    --dpi 300 --plotFileFormat pdf --sortUsing mean --sortUsingSamples 1 \
    --heatmapHeight 15 --refPointLabel center  --regionsLabel "Accessible CTCF Motifs" \
    --samplesLabel "CTCF ChIP-seq" "Buenrostro" "ATAC-STARR Accessibilty" --zMin 0 0 0 --zMax 25 0.1 0.1 \
    --colorMap Blues gist_heat_r gist_heat_r --whatToShow "heatmap and colorbar"

#JUNB
#Make signal heatmap using ENCODE JUNB ChIP-seq bigwig and the TOBIAS-generated cutcounts signal file at JUNB sites. Rank by ChIP-seq intensity. Show heatmap only. 
computeMatrix reference-point -S ${ENCODE_DIR}/GM12878_JUNB_hg38_ENCFF245XUS.bw ${BUEN_DIR}/GM12878_ATAC-seq_buenrostro_corrected.bw ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/JUNB_MA0490.2/beds/JUNB_MA0490.2_all.bed \
    -a 200 -b 200 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_JUNB.gz

plotHeatmap -m ${OUTPUT_DIR}/matrix_footprinting_JUNB.gz -o ${OUTPUT_DIR}/heatmap_footprinting_JUNB.pdf \
    --dpi 300 --plotFileFormat pdf --sortUsing mean --sortUsingSamples 1 \
    --heatmapHeight 15 --refPointLabel center  --regionsLabel "Accessible JUNB Motifs" \
    --samplesLabel "JUNB ChIP-seq" "Buenrostro" "ATAC-STARR Accessibilty" --zMin 0 0 0 --zMax 25 0.1 0.1 \
    --colorMap Blues gist_heat_r gist_heat_r --whatToShow "heatmap and colorbar"


## Generate aggregate plots

In [6]:
%%bash
#BedFiles
BED_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/BINDetect/GM12878inGM12878_0.05'

#BigWigs
BUEN_DIR='/data/hodges_lab/public_data/GM12878/obtained_as_raw_files/ATAC_GM12878_2013-buenrostro/data/footprinting/ATACorrect'
AS_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting/ATACorrect'
ENCODE_DIR='/data/hodges_lab/public_data/GM12878/obtained_as_processed_files/from-ENCODE/bigWig'

#Output
OUTPUT_DIR='/data/hodges_lab/ATAC-STARR_B-cells/results/ATAC-STARR_TF-footprinting'

computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/CTCF_MA0139.1/beds/CTCF_MA0139.1_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/CTCF_MA0139.1/beds/CTCF_MA0139.1_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_CTCF.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_CTCF.gz -o ${OUTPUT_DIR}/aggregate_footprinting_CTCF.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "CTCF" --plotWidth 10 --plotHeight 8 #in cm

#Make aggregate plots using cutcounts signal file at IRF4 sites. Seperate by un-bound and bound.  
computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/IRF4_MA1419.1/beds/IRF4_MA1419.1_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/IRF4_MA1419.1/beds/IRF4_MA1419.1_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_IRF4.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_IRF4.gz -o ${OUTPUT_DIR}/aggregate_footprinting_IRF4.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "IRF4" --plotWidth 10 --plotHeight 8 #in cm

#Make aggregate plots using cutcounts signal file at PU.1/SPI1 sites. Seperate by un-bound and bound.  
computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/SPI1_MA0080.5/beds/SPI1_MA0080.5_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/SPI1_MA0080.5/beds/SPI1_MA0080.5_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_SPI1.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_SPI1.gz -o ${OUTPUT_DIR}/aggregate_footprinting_SPI1.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "SPI1" --plotWidth 10 --plotHeight 8 #in cm

#Make aggregate plots using cutcounts signal file at JUNB sites. Seperate by un-bound and bound.  
computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/JUNB_MA0490.2/beds/JUNB_MA0490.2_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/JUNB_MA0490.2/beds/JUNB_MA0490.2_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_JUNB.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_JUNB.gz -o ${OUTPUT_DIR}/aggregate_footprinting_JUNB.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "JUNB" --plotWidth 10 --plotHeight 8 #in cm

#Make aggregate plots using cutcounts signal file at ELK1 sites. Seperate by un-bound and bound.  
computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/ELK1_MA0028.2/beds/ELK1_MA0028.2_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/ELK1_MA0028.2/beds/ELK1_MA0028.2_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_ELK1.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_ELK1.gz -o ${OUTPUT_DIR}/aggregate_footprinting_ELK1.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "ELK1" --plotWidth 10 --plotHeight 8 #in cm

#Make aggregate plots using cutcounts signal file at NFKB1 sites. Seperate by un-bound and bound.  
computeMatrix reference-point -S ${AS_DIR}/GM12878inGM12878_DNA_corrected.bw \
    -R ${BED_DIR}/NFKB1_MA0105.4/beds/NFKB1_MA0105.4_GM12878inGM12878_DNA_bound.bed \
    ${BED_DIR}/NFKB1_MA0105.4/beds/NFKB1_MA0105.4_GM12878inGM12878_DNA_unbound.bed \
    -a 75 -b 75 --referencePoint center --missingDataAsZero --binSize 1 -p 12 \
    -o ${OUTPUT_DIR}/matrix_footprinting_NFKB1.gz

plotProfile -m ${OUTPUT_DIR}/matrix_footprinting_NFKB1.gz -o ${OUTPUT_DIR}/aggregate_footprinting_NFKB1.pdf \
    --dpi 300 --plotFileFormat pdf --colors black grey \
    --refPointLabel center --yAxisLabel "Accessible TF Motifs" --regionsLabel "bound motif" "unbound motif"\
    --samplesLabel "NFKB1" --plotWidth 10 --plotHeight 8 #in cm