# ATAC-seq Module 3: Downstream Analysis

<img src=Tutorial3/LessonImages/ATACseqWorkflowLesson3.jpg alt="Drawing" style="width: 1000px;"/>

## Overview & Purpose
In the previous sections of this module we performed preprocessing quality control, mapping, deduplication, visualization, profiling around TSSs, and peak identification. In this section we will focus on differential peak identification, motif footprinting, and annotation of nearby genomic features. 

### Required Files
In this stage of the module you will use several of the files that we prepared in the previous sections. Don't worry if you are just jumping in now, we have examples of these files saved and will include a step that copies them for your use. You can also use this module on your own data or any published ATAC-seq dataset, but you should complete the mapping and deduplication steps first.

<div class="alert-info" style="font-size:200%">
STEP 1: Set Up Environment
</div>

Initial items to configure your Google Cloud environment. In this step we will use conda to install the following packages:

Differential Peak Identification:
[manorm](https://anaconda.org/bioconda/manormfast)

Genome Annotation:
[homer](https://anaconda.org/bioconda/homer)

Motif Analysis:
[tobias](https://anaconda.org/bioconda/tobias)

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
numthreadsint = int(numthreads[0])
!~/mambaforge/bin/mamba install -c bioconda homer -y
!pip install manorm
!pip install tobias
!pip install jupyterquiz==2.0.7
from jupyterquiz import display_quiz
from IPython.display import IFrame
from IPython.display import display
import pandas as pd

## Set Up File System
Now lets create some folders to stay organized.

In [None]:
# These commands create our directory structure.
!mkdir -p Tutorial3/InputFiles
!mkdir -p Tutorial3/GenomeAnnotation
!mkdir -p Tutorial3/DiffPeaks
!mkdir -p Tutorial3/MotifFootprinting
!mkdir -p Tutorial3/Plots

# These commands help identify the google cloud storage bucket where the example files are held.
original_bucket = "gs://nigms-sandbox/unmc_atac_data_examples/Tutorial3"

# This command copies our example files to the InputFiles folder that we created above.
!gsutil -m cp $original_bucket/InputFiles/* Tutorial3/InputFiles

### OK
Let's make sure that the files copied correctly. You should see two .bam files, two .bai files, and two .narrowPeak files after running the following command:

In [None]:
!ls Tutorial3/InputFiles/*

<div class="alert-info" style="font-size:200%">
Differential Peak Identification
</div>
If you have two or more samples and desire to discover differential peaks, we recommend using manorm. Novices may be tempted to simply intersect the two peak lists to find the overlap, however this is highly inadvisable. 

<div class="alert-info" style="font-size:200%">
Interactive Quiz Question 1: Click on the correct answer in the following cell.
</div>

In [None]:
display_quiz("quiz_files/DiffPeaks.json")

### Consider the below peak which was identified in both the control and mutant sample. A simple intersect would result in this peak being reported as unchanged between the two samples. To represent the differences we will use [manorm](https://anaconda.org/bioconda/manormfast).

<img src="Tutorial3/LessonImages/PeakOverlapProblem.jpg" alt="Drawing" style="width: 100px;"/>

In [None]:
# We specify several non-default parameters to better reflect ATAC-seq data.
!manorm --p1 Tutorial3/InputFiles/CTL_peaks.narrowPeak --p2 Tutorial3/InputFiles/Mutant_peaks.narrowPeak --r1 Tutorial3/InputFiles/CTL_dedup.bam --r2 Tutorial3/InputFiles/Mutant_dedup.bam --rf bam --n1 CTL --n2 Mutant --pe -w 1000 -o Tutorial3/DiffPeaks --wa 2> Tutorial3/DiffPeaks/log_manorm.txt
print("done")

The above command will write out several files including the differential peaks for each sample as well as the unchanged peaks.

In [None]:
!ls Tutorial3/DiffPeaks/output_filters

In [None]:
# Let's also check the format of these files.
!head Tutorial3/DiffPeaks/output_filters/CTL_vs_Mutant_M_above_1.0_biased_peaks.bed

In [None]:
# We can also count how many are in each.
!wc -l Tutorial3/DiffPeaks/output_filters/*bed

In [None]:
# Our log file tells us this information as well.
!tail Tutorial3/DiffPeaks/log_manorm.txt

<div class="alert-info" style="font-size:100%">
Annotating Peaks
</div>

Let's take the differential peaks and annotate them with nearby genes and perform gene ontology using [homer](https://anaconda.org/bioconda/homer).

First we need to reformat the differential peaks file to the format required by homer.

In an earlier command, we examined the format of manorm's output using head and saw that it outputs a five column format. We will change this to a 6-column bed format including a unique name for each peak.

In [None]:
# This command will reformat the peaks file including the line number in naming the peaks (NR) as well as a place-holder strand in the 6th column (note that peaks don't necessarily have a strand, but the format requires this column). The -F \t tells awk that the file is tab delimited.
!awk '{print $1"\t"$2"\t"$3"\t"$4"_"NR"\t"$5"\t+"}' Tutorial3/DiffPeaks/output_filters/CTL_vs_Mutant_M_above_1.0_biased_peaks.bed > Tutorial3/GenomeAnnotation/CTL_specific_peaks.bed
# Let's head this to compare.
!head Tutorial3/GenomeAnnotation/CTL_specific_peaks.bed

Now let's configure homer to recognize our genome build. We aligned our reads to hg38, so we'll have homer use that.

In [None]:
!perl /opt/conda/share/homer/configureHomer.pl -install hg38 2> Tutorial3/DiffPeaks/homer_log1.txt
print("done")

Let's use that reformatted peak file to get nearby genes and perform gene ontology analysis.

In [None]:
!annotatePeaks.pl Tutorial3/GenomeAnnotation/CTL_specific_peaks.bed hg38 -go Tutorial3/GenomeAnnotation/CTL_GO -annStats Tutorial3/GenomeAnnotation/CTL_annStats.txt > Tutorial3/GenomeAnnotation/CTL_specific_Annotated.txt

Let's look at the output files. First, let's look at the first two lines of at our annotation stats.

In [None]:
# Clean up duplicate entries.
!sort -u Tutorial3/GenomeAnnotation/CTL_annStats.txt | grep -v Annotation > Tutorial3/GenomeAnnotation/CTL_annStats_clean.txt

# Load results into a pandas table.
annstats = pd.read_csv("Tutorial3/GenomeAnnotation/CTL_annStats_clean.txt", sep='\t', header=None, names=['annotation','peakcount','size','foldenrichment','log10significance'])

# View entries sorted by enrichment.
annstats_sorted = annstats.sort_values(by=["foldenrichment"], ascending=False)
display(annstats_sorted)

From this we can see highest enrichment in 5' UTRs and promoters.

Let's plot the results as a bar plot.

In [None]:
annstats_sorted.plot.bar(x="annotation", y="foldenrichment")

Homer also outputs the nearest annotation for each peak. Let's look at the first few lines of our annotation file.

In [None]:
!head -4 Tutorial3/GenomeAnnotation/CTL_specific_Annotated.txt

Lastly, let's take a look at the gene ontology results

In [None]:
# List the files in our GO directory.
!ls Tutorial3/GenomeAnnotation/CTL_GO/

Let's view the top terms in the biological_process category.

In [None]:

bp_GO = pd.read_csv("Tutorial3/GenomeAnnotation/CTL_GO/biological_process.txt", sep='\t')

# Keep most significant.
bp_GO_top10 = bp_GO.nsmallest(10, "logP")
display(bp_GO_top10)

We can also plot the enrichment scores

Note that our results may look a little odd because we have severely downsampled the data to run quickly and focus on a single region of chr4. 

In [None]:
bp_GO_top10.plot.bar(x="Term", y="Enrichment")

Homer also saves an HTML file where you can navigate through the various categories.

In [None]:
# View the html results.
IFrame(src='Tutorial3/GenomeAnnotation/CTL_GO/geneOntology.html', width=900, height=600)

In the above HTML you can click through the different ontology categories to view enriched terms and scores for genes near our differential peaks. Note that there are links to motifs, but these lead to "pages not found" because we have yet to do this analysis. We will run motif analysis in the next section using TOBIAS.

<div class="alert-info" style="font-size:200%">
Motif Footprinting
</div>

### ATAC-seq can be used to identify accessibility at transcription factor (TF) binding sites. We'll use [tobias](https://anaconda.org/bioconda/tobias).

<img src="Tutorial3/LessonImages/TobiasFigure.jpg" alt="Drawing" style="width: 800px;"/>

From: [Bentsen et al., Nat. Comm. 2020](https://www.nature.com/articles/s41467-020-18035-1)

Tn5 insertion during ATAC-seq has a sequence bias. In our first step, let's correct for that bias.

In [None]:
# Index the bam.
!samtools index Tutorial3/InputFiles/CTL_dedup.bam
!samtools index Tutorial3/InputFiles/Mutant_dedup.bam
# Tn5 has an insertion sequence bias, which Tobias can correct for. Let's use the master list of peaks provided by manorm, but we need to first remove the header and extra columns.
!cat Tutorial3/DiffPeaks/CTL_vs_Mutant_all_MAvalues.xls | cut -f 1-3 | grep -v start > Tutorial3/MotifFootprinting/MasterPeakList.bed

# Now let's do the signal correction.
!TOBIAS ATACorrect --bam Tutorial3/InputFiles/CTL_dedup.bam --genome Tutorial3/InputFiles/chr4.fa --peaks Tutorial3/MotifFootprinting/MasterPeakList.bed --outdir Tutorial3/MotifFootprinting --prefix CTL --cores $numthreadsint --verbosity 1
# Let's also do this for the mutant.
!TOBIAS ATACorrect --bam Tutorial3/InputFiles/Mutant_dedup.bam --genome Tutorial3/InputFiles/chr4.fa --peaks Tutorial3/MotifFootprinting/MasterPeakList.bed --outdir Tutorial3/MotifFootprinting --prefix Mutant --cores $numthreadsint --verbosity 1

print("done")

Now let's use the bias-corrected bigwig files to calculate footprint scores around peaks.

In [None]:
!TOBIAS ScoreBigwig -s Tutorial3/MotifFootprinting/CTL_corrected.bw -r Tutorial3/MotifFootprinting/MasterPeakList.bed -o Tutorial3/MotifFootprinting/CTL_footprintscores.bw --cores $numthreadsint --verbosity 1

# Let's do the same for our mutant sample.
!TOBIAS ScoreBigwig -s Tutorial3/MotifFootprinting/Mutant_corrected.bw -r Tutorial3/MotifFootprinting/MasterPeakList.bed -o Tutorial3/MotifFootprinting/Mutant_footprintscores.bw --cores $numthreadsint --verbosity 1

Now that we have our corrected signal and footprint scores, let's do TF binding site prediction as well as differential footprinting.

Caution: this step searches through the signal at every signal location corresponding to motifs in your jaspar file. Here we use all the motifs in the jaspar database. This can take several minutes...

In [None]:
# First, we'll download the current jaspar motifs.
!wget https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt -P Tutorial3/MotifFootprinting/

#Downgrade adjusttext otherwise the next command will throw an error
!pip install adjusttext==0.7.3

# Next we can calculate statistics for each motif represented in our jaspar motif file. If we list both our CTL and Mutant sample, it will calculate the differential footprint score for us as well.
!TOBIAS BINDetect --motifs Tutorial3/MotifFootprinting/JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt --signals Tutorial3/MotifFootprinting/CTL_footprintscores.bw Tutorial3/MotifFootprinting/Mutant_footprintscores.bw --genome Tutorial3/InputFiles/chr4.fa --peaks Tutorial3/MotifFootprinting/MasterPeakList.bed --outdir Tutorial3/MotifFootprinting/DiffMotifs --cond_names CTL Mutant --cores $numthreadsint --verbosity 1

print("done")

In [None]:
# View the HTML results.
IFrame(src='Tutorial3/MotifFootprinting/DiffMotifs/bindetect_CTL_Mutant.html', width=900, height=600)

In the above HTML file you can hover over each point to see the motif name and the sequence. This type of plot is a volcano plot showing the differential signal on the x-axis and the significance values on the y-axis.

For example, the original paper focused on TP63, which is one of our differential dots in the HTML file. 

<img src="Tutorial3/LessonImages/TP63_volcano.jpg" alt="Drawing" style="width: 500px;"/>

Let's visualize the average footprint at TP63 motifs.

In [None]:
#IFrame(src='Tutorial2/MotifFootprinting/MYBL1_MA0776.1/plots/MYBL1_MA0776.1_log2fcs.pdf', width=900, height=600) 

# Note change to Tutorial3.
#!TOBIAS PlotAggregate --TFBS Tutorial3/MotifFootprinting/DiffMotifs/TP63_MA0525.2/beds/TP63_MA0525.2_all.bed --signals Tutorial3/MotifFootprinting/CTL_corrected.bw Tutorial3/MotifFootprinting/Mutant_corrected.bw --output Tutorial3/MotifFootprinting/TP63_footprint_compare.png --share_y both --verbosity 1 --plot_boundaries --flank 60 --smooth 2
!TOBIAS PlotAggregate --TFBS Tutorial3/MotifFootprinting/DiffMotifs/TP63_MA0525.2/beds/TP63_MA0525.2_all.bed --signals Tutorial3/MotifFootprinting/CTL_corrected.bw Tutorial3/MotifFootprinting/Mutant_corrected.bw --output Tutorial3/MotifFootprinting/TP63_footprint_compare.png --share_y both --verbosity 1 --plot_boundaries --flank 60 --smooth 2 --signal-on-x

In [None]:
IFrame(src='Tutorial3/MotifFootprinting/TP63_footprint_compare.png', width=600, height=400) 

We can also get all the motifs that have differential footprints:

In [None]:
# Load the results as a pandas table Tutorial2/MotifFootprinting/bindetect_results.txt.
dframe = pd.read_csv("Tutorial3/MotifFootprinting/DiffMotifs/bindetect_results.txt", sep='\t')
display(dframe)
DiffMotifs = dframe[dframe['CTL_Mutant_pvalue'] < .05]
# Write out to a tab separated file.
DiffMotifs.to_csv('Tutorial3/MotifFootprinting/DiffMotifs_p05.txt')

<div class="alert-success" style="font-size:200%">
Great job! 
</div>
Thank you for completing these tutorials. Feel free to download these notebooks, customize, and use them to process your own data. Please see Tutorial 4 for Single-Cell ATAC-seq analysis.

<div class="alert alert-block alert-danger">
    <b>&#128721; Caution:</b> Remember to shut down your VM after you are finished with your work in order to avoid incurring additional charges.
</div>