# ATAC-seq Module 1: Preprocessing and Quality Control

<img src="../images/Tutorial1/LessonImages/ATACseqWorkflowLesson1.jpg" alt="Drawing" style="width: 1000px;"/>

## Overview
ATAC-seq generates genome-wide accessibility profiles. Tn5 transposase integrates DNA sequencing adapters into accessible chromatin, creating small fragments protected by nucleosomes and transcription factors. Analyzing this data we can uncover differential accessibility and identify TF footprints.

<img src="../images/Tutorial1/LessonImages/MethodAnimation.gif" alt="Drawing" style="width: 500px;"/>

This short tutorial demonstrates the initial processing steps for ATAC-seq analysis. In this module we focus on generating quality reports of the fastq files, adapter trimming, mapping, and removal of PCR duplicates.

In this tutorial we will process a randomly chosen published dataset. This is available from GEO: GSE67382
Bao X, Rubin AJ, Qu K, Zhang J et al. A novel ATAC-seq approach reveals lineage-specific reinforcement of the open chromatin landscape via cooperation between BAF and p63. Genome Biol 2015 Dec 18;16:284. PMID: [26683334](https://pubmed.ncbi.nlm.nih.gov/26683334/)

This dataset is paired-end 50 bp sequencing. We will analyze two samples representing NHEK cells with BAF depletion compared to a control. Note that to allow faster processing we have limited the reads to that of chromosome 4. 

## Learning Objectives

- **Setting up the computational environment:** Learners will learn how to install necessary bioinformatics tools (FastQC, MultiQC, Trimmomatic, Bowtie2, Samtools, Picard).

- **Understanding ATAC-seq data formats:** Learners will become familiar with FASTQ files and their structure.

- **Performing quality control (QC):** Learners will perform quality control on raw reads using FastQC and MultiQC to assess sequencing quality.  They'll interpret the QC reports to identify potential issues like adapter contamination and base quality.

- **Trimming reads:** Learners will learn to use Trimmomatic to trim adapter sequences and low-quality bases from the reads, improving the quality of downstream analysis.  They will understand why trimming is crucial for ATAC-seq.

- **Mapping reads to a reference genome:** Learners will map reads to a reference genome using Bowtie2, understanding the importance of a reference genome and the process of generating Bowtie2 indexes (although the indexes are pre-made in this example).  They'll also convert SAM files to BAM files using samtools.

- **Removing PCR duplicates:** Learners will use Picard to remove PCR duplicates from the mapped reads, understanding the impact of PCR duplicates on ATAC-seq analysis.

- **Interpreting results:** Learners will interpret the results of each step, assessing the quality of the processed data and understanding the impact of each preprocessing step.

<div class="alert alert-block alert-success">
    <i class="fa fa-hand-paper-o" aria-hidden="true"></i>
    <b>Tip: </b>  If you're having trouble with any part of this tutorial, feel free to leverage AWS Bedrock (Amazon's advanced generative AI tool) at the bottom of this module.
</div>  

## Prerequisites
+ **Software:**

    * **Jupyter Notebook:**  Needed to run the notebook itself.
    * **Mamba** Used for package management.  The notebook uses mamba, a faster alternative to conda.
    * **Bioconda Channel:**  The notebook installs several packages from the Bioconda channel, a channel specifically for bioinformatics tools.
    * **Specific Python Packages:**
        * `jupyterquiz`: For interactive quizzes within the notebook.
        * `jupytercards`: For flashcards.
        * `pandas`: For data manipulation and analysis.
        * `IPython`: For enhanced interactive computing.
    * **Bioinformatics Tools:**
        * `fastqc`: For assessing the quality of raw sequencing reads.
        * `multiqc`: For aggregating and summarizing quality control reports from multiple tools.
        * `trimmomatic`: For trimming adapters and low-quality bases from sequencing reads.
        * `bowtie2`: For aligning reads to a reference genome.
        * `samtools`: For manipulating and converting sequence alignment/map (SAM/BAM) files.
        * `picard`: For various bioinformatics tasks, particularly duplicate marking and removal in this case.

+ **APIs:**

    * **Amazon S3 API:**  The notebook uses `aws s3` to interact with an Amazon S3 bucket to copy example data files (`aws s3 cp`). This requires that the appropriate AWS services are enabled in your AWS account.

## Get Started
### Required Files
In this stage of the module you will use the fastq files that have been prepared. In Step 1 we will copy these files over to your instance. You can also use this module on your own data or any published ATAC-seq dataset. 

<div class="alert-info" style="font-size:200%">
STEP 1: Set Up Environment
</div>

In this step we will use mamba to install the following packages:

Quality Reporting:
[fastqc](https://anaconda.org/bioconda/fastqc), [multiqc](https://anaconda.org/bioconda/multiqc)

Read Trimming: 
[trimmomatic](https://anaconda.org/bioconda/trimmomatic)

Mapping:
[bowtie2](https://anaconda.org/bioconda/bowtie2)

Deduplication:
[samtools](https://anaconda.org/bioconda/samtools), [picard](https://anaconda.org/bioconda/picard)

In [None]:
numthreads=! lscpu | grep '^CPU(s)'| awk '{print $2-1}'
numthreadsint = int(numthreads[0])
! mamba install -c bioconda fastqc bowtie2 picard multiqc samtools trimmomatic -y
! pip install jupyterquiz jupytercards

In [None]:
from jupyterquiz import display_quiz
from IPython.display import IFrame
from IPython.display import display
from jupytercards import display_flashcards
import pandas as pd

## Set Up File System
Now lets create some folders to stay organized and copy over our prepared fastq files. We're going to create a directory called "Tutorial1" which we'll use for this module. We'll then create sub-folders for our input files and for the files that we'll be creating during this module. We'll also copy over the fasta file for chromosome 4 as well as some bowtie2 index files (don't worry we'll teach you how to create these index files).

In [None]:
# These commands create our directory structure.
! mkdir -p Tutorial1/InputFiles
! mkdir -p Tutorial1/QC
! mkdir -p Tutorial1/Trimmed
! mkdir -p Tutorial1/Mapped
! mkdir -p Tutorial1/RefGenome

# These commands help identify the Amazon S3 bucket where the example files are held.
original_bucket = "s3://nigms-sandbox/unmc_atac_data_examples/Tutorial1"

# This command copies our example files from Amazon S3 bucket to the Tutorial1/Inputfiles folder that we created above.
! aws s3 cp --recursive $original_bucket/InputFiles/ Tutorial1/InputFiles/
! aws s3 cp --recursive $original_bucket/RefGenome/ Tutorial1/RefGenome/


### OK
Let's make sure that the files copied correctly. You should see four files after running the following command:

In [None]:
! ls Tutorial1/InputFiles

<div class="alert-info" style="font-size:200%">
STEP 2: QC
</div>

Sequences are typically provided as files in fastq format. This format includes four lines per sequence.


In [None]:
display_flashcards('../quiz_files/FastqFlashCard.json')

### Click on the above image to see what each line represents.

Next, let's take a look at the sequence quality of the raw reads using fastqc:

In [None]:
# This command runs fastqc on each fastq.gz file inside our InputFiles directory and stores the ouput reports in our QC directory.
! fastqc -t $numthreadsint -q -o Tutorial1/QC Tutorial1/InputFiles/*fastq.gz

# We then use multiqc to summarize the report.
! multiqc -o Tutorial1/QC -f Tutorial1/QC 2> Tutorial1/QC/multiqc_log.txt

# We'll load this into a pandas table to work in this context, but fastqc also produces an html report that you can browse.
dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_fastqc.txt", sep='\t')
display(dframe)

Alternatively, we can view the fastqc HTML files:

In [None]:
# We can display the resulting fastqc results.
IFrame(src='Tutorial1/QC/CTL_R1_fastqc.html', width=1080, height=800)

Look at the the "Per base sequence content" in the above FastQC report. We'll trim the reads to remove some of this effect. For now, think about possible explanations for this result.

Also look at the "Sequence Duplication Levels". Sometimes duplicates appear due to the PCR amplification step of library preparation. We'll remove duplicates in a later step. 

Lastly, look at the report at the "Overrepresented sequences". What are some possible explanations for this result?

<div class="alert-info" style="font-size:120%">
Trimming
</div>
Next let's trim our sequences.

Why is it particularly important to trim the reads in ATAC-seq? To understand let's review how ATAC-seq works. Tn5 inserts adapter sequences into accessible regions. 

<img src="../images/Tutorial1/LessonImages/adapterinsert.jpg" alt="Drawing" style="width: 400px;"/>

Image source: [Grandi et al., Nature Protocols 2022](https://www.nature.com/articles/s41596-022-00692-9)


What would happen if the distance between inserted sites is short? For example our sequencing length in the example dataset is 50 bp, so what would the sequence look like if our fragment (insert size) is only 30 bp long?  

<div class="alert-info" style="font-size:200%">
Interactive Quiz Question 1: Click on the correct answer in following cell.
</div>

In [None]:
display_quiz('../quiz_files/adapterQuiz.json')

## Let's use trimmomatic to prepare the sequences before mapping.

In [None]:
# This will trim off N's as well as nextera adapters present in ATAC-seq library preparation. Placing the trimmed reads in our Trimmed folder.
! trimmomatic PE -threads $numthreadsint Tutorial1/InputFiles/CTL_R1.fastq.gz Tutorial1/InputFiles/CTL_R2.fastq.gz Tutorial1/Trimmed/CTLtrimmed_R1.fastq.gz Tutorial1/Trimmed/CTLunpaired_R1.fastq.gz Tutorial1/Trimmed/CTLtrimmed_R2.fastq.gz Tutorial1/Trimmed/CTLunpaired_R2.fastq.gz ILLUMINACLIP:Tutorial1/RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3

## Let's do this for the other sample as well.

In [None]:
# This will trim off N's as well as nextera adapters present in ATAC-seq library preparation. Placing the trimmed reads in our Trimmed folder.
! trimmomatic PE -threads $numthreadsint Tutorial1/InputFiles/Mutant_R1.fastq.gz Tutorial1/InputFiles/Mutant_R2.fastq.gz Tutorial1/Trimmed/Mutanttrimmed_R1.fastq.gz Tutorial1/Trimmed/Mutantunpaired_R1.fastq.gz Tutorial1/Trimmed/Mutanttrimmed_R2.fastq.gz Tutorial1/Trimmed/Mutantunpaired_R2.fastq.gz ILLUMINACLIP:Tutorial1/RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3

## Now let's summarize the trimming results.

In [None]:
! fastqc -t $numthreadsint -q -o Tutorial1/Trimmed Tutorial1/Trimmed/*fastq.gz
! multiqc -o Tutorial1/QC -f Tutorial1/Trimmed 2> Tutorial1/QC/multiqc_log.txt

dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_general_stats.txt", sep='\t')
display(dframe)


Trimming can be particularly important for ATAC-seq to remove adapter sequences from small-sized fragments. However, keep in mind that after trimming, small fragments are sometimes too short to be used. You'll notice in the trimmomatic output that there were a few reads that were dropped. 

We can examine the reads further to see the impact of trimming by summarizing the size distribution.

In [None]:
# Example code to plot sizes
! gunzip -c Tutorial1/Trimmed/CTLtrimmed_R1.fastq.gz | sed -n '2~4p' | awk '{print length($1)}' | sort -k 1bn,1b | uniq -c | awk '{print $2","log($1)}' > Tutorial1/Trimmed/readlengths
dframe = pd.read_csv("Tutorial1/Trimmed/readlengths", sep=',', header=None, names = ['log counts of read length'], index_col=0)

dframe.plot.area()


In our example dataset, we used 50 bp paired-end sequencing and trimming did not impact the majority of reads. With your own datasets you may see more reads that are trimmed, depending on the length of sequencing that was performed and the sizes of the fragments that were obtained. In the next steps we will introduce mapping the reads. Keep in mind that too short of reads may not map to the genome. If you see the majority of reads lie in the 10-20 bp  range after trimming, this could indicate a problem with the library. 

<div class="alert-info" style="font-size:200%">
STEP 3: Mapping
</div>
Our fastq files include sequences and quality scores for each base, but we want to figure out which genomic location these sequences came from. To do this we will map each sequence to a reference genome using bowtie2. 
 

Mapping reads requires a reference genome. Due to time and memory considerations, in this tutorial we  prepared that file for you and will only map to chr4. However, in a full analysis, we would map to the entire genome. To do so you would need a fasta file corresponding to the reference genome (e.g. [hg38.fa](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/)) from which you'd create an index of the genome using bowtie2-build. This can be done with the command: 

bowtie2-build reference_genome_file.fa outputprefix.

As mentioned, we've gone ahead and created the index for you, and, earlier, you copied them into the RefGenome directory. These index files end in the bt2 extension. 

In [None]:
! ls Tutorial1/RefGenome/*bt2

These index files were created from our fasta file:

In [None]:
! ls Tutorial1/RefGenome/*fa

Notice that the single fasta file created multiple index files. When we align we'll specify the prefix of the index files.

In [None]:
# Notes: The -x option specifies the prefix of the index. -1 specifies our left-end trimmed reads file. -2 specifies our right-end trimmed reads file. -S specifies our output file in sam format.
! bowtie2 -p $numthreadsint -x Tutorial1/RefGenome/hg38chr4 -1 Tutorial1/Trimmed/CTLtrimmed_R1.fastq.gz -2 Tutorial1/Trimmed/CTLtrimmed_R2.fastq.gz -S Tutorial1/Mapped/CTL.sam


In [None]:
# Let's do the same thing for our other sample.
! bowtie2 -p $numthreadsint -x Tutorial1/RefGenome/hg38chr4 -1 Tutorial1/Trimmed/Mutanttrimmed_R1.fastq.gz -2 Tutorial1/Trimmed/Mutanttrimmed_R2.fastq.gz -S Tutorial1/Mapped/Mutant.sam

### Answer the following question only if you are using the example dataset we provided. This question is simply a check to ensure everything was processed correctly.

In [None]:
display_flashcards('../quiz_files/alignment.json')

## Bowtie2 outputs a file in [sam format](https://samtools.github.io/hts-specs/SAMv1.pdf), which contains the original sequence, quality scores, and the genomic coordinates matching each read. 

In the next commands we'll convert the file to the more compressed [bam format](https://genome.ucsc.edu/goldenPath/help/bam.html) and sort the reads by chromosomal coordinates.

In [None]:
# This will convert to bam by using samtools view with the -b option. The h and S option tells samtools that the file has a header and is in sam format. We will pipe this to samtools sort. Pay attention to the "-" at the end of the sort command which tells samtools to use stdin.
! samtools view -q 10 -bhS Tutorial1/Mapped/CTL.sam | samtools sort -o Tutorial1/Mapped/CTL.bam - 
print("done")

In [None]:
# Let's do the same thing for our Mutant sample.
! samtools view -q 10 -bhS Tutorial1/Mapped/Mutant.sam | samtools sort -o Tutorial1/Mapped/Mutant.bam - 
print("done")

You may have noticed the parameters -bhS and -q 10 in the above commands. Briefly, -bhS describes aspects of the file to samtools, such that you want to output a bam file (the b option), that it has a header (the h option), and that it is currently in sam format (the S option). We also specified -q 10 which removes reads with a mapping score <= 10. 

<div class="alert-info" style="font-size:200%">
Interactive Quiz Question 2: Click on the correct answer in the following cell.
</div>

In [None]:
display_quiz("../quiz_files/mappingquality.json")

<div class="alert-info" style="font-size:200%">
STEP 4: Removal of Duplicates
</div>
It's important to remove duplicates from our reads because part of the ATAC-seq method includes a PCR step for library amplification. This can create biases in the data resulting from PCR duplicates. To understand how PCR duplicates can affect the analysis, let's jump ahead a bit. Accessible sites are represented by ATAC-seq "peaks" of signal.

<img src="../images/Tutorial1/LessonImages/PeaksExample.jpg" alt="Drawing" style="width: 200px;"/>

<div class="alert-info" style="font-size:200%">
Interactive Quiz Question 3: Click on the correct answer in the following cell.
</div>

In [None]:
display_quiz("../quiz_files/duplicateQuiz.json")

Okay, let's remove these duplicates using picard.

In [None]:
# This will take the sorted bam file and remove duplicates, saving a new bam file and a summary in a text file.
! picard MarkDuplicates --REMOVE_DUPLICATES TRUE -I Tutorial1/Mapped/CTL.bam -O Tutorial1/Mapped/CTL_dedup.bam --METRICS_FILE Tutorial1/Mapped/CTL_dedup_metrics.txt --QUIET 2> Tutorial1/Mapped/PicardLog.txt
print("done")

In [None]:
# We also should do this for the other sample.
! picard MarkDuplicates --REMOVE_DUPLICATES TRUE -I Tutorial1/Mapped/Mutant.bam -O Tutorial1/Mapped/Mutant_dedup.bam --METRICS_FILE Tutorial1/Mapped/Mutant_dedup_metrics.txt --QUIET 2> Tutorial1/Mapped/PicardLog2.txt
print("done")

In [None]:
# We can use multiqc to summarize the metrics.
! multiqc -o Tutorial1/QC -f Tutorial1/Mapped 2> Tutorial1/Mapped/multiqc_log.txt
dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_general_stats.txt", sep='\t')
display(dframe)

## Conclusion

This Jupyter Notebook detailed the initial preprocessing and quality control steps crucial for accurate ATAC-seq data analysis.  We utilized a published dataset (GSE67382) to demonstrate the workflow, focusing on quality reporting with FastQC and MultiQC, adapter trimming with Trimmomatic, read mapping using Bowtie2, and PCR duplicate removal with Picard.  The notebook incorporated interactive elements, such as quizzes and flashcards, to reinforce key concepts related to ATAC-seq methodology and data interpretation.  The resulting processed BAM files, free of adapter sequences and PCR duplicates, are now ready for downstream analysis, such as peak calling and differential accessibility testing, as detailed in the subsequent tutorial.  This foundational preprocessing ensures the reliability and accuracy of subsequent analyses, laying the groundwork for robust biological conclusions.

## Clean Up
We have completed the preprocessing steps and are ready to move on to some downstream analysis. Take a break here or move on to the next tutorial: 

[Visualization and Peak Detection](./ATACseq_Tutorial2_PeakDetection.ipynb). 


<div class="alert alert-block alert-danger">
    <b>&#128721; Caution:</b> Remember to shut down your VM after you are finished with your work in order to avoid incurring additional charges.
</div>

## AWS Bedrock (Optional)
--------

If you're having trouble with this submodule (or others within this tutorial), feel free to leverage Bedrock by running the cell below. Bedrock is a fully managed service that simplifies building and scaling generative AI applications. It provides access to various foundation models (FMs) from Amazon and other AI companies.

Before being able to use the chatbot you must request **Llama 3 8B Instruct** model access through AWS Bedrock. In order to do this follow the instructions to request model access provided in [AWS Bedrock Intro Notebook](https://github.com/STRIDES/NIHCloudLabAWS/blob/main/notebooks/GenAI/AWS_Bedrock_Intro.ipynb). After requesting the Llama 3 8B Instruct access it should only take a minute to get approved. While waiting for model approval attach the **AmazonBedrockFullAccess** permission to your notebook service role. Once approved run the following code cell to use the model within the notebook. 

In [None]:
# Ensure you have the necessary libraries installed
!pip install -q ipywidgets
import sys
import os
util_path = os.path.join(os.getcwd(), 'util')
if util_path not in sys.path:
    sys.path.append(util_path)

# Import the display_widgets function from your Python file
from genai import display_widgets

# Call the function to display the widgets
display_widgets()