# Module 1: Preprocessing and Quality Control

<img src="images/LessonPlan.jpg" alt="Drawing" style="width: 1000px;"/>

## Overview & Purpose
## ChIP-seq, CUT&RUN, and CUT&Tag
This short tutorial demonstrates the initial processing steps for ChIP-seq, CUT&RUN, and CUT&Tag analysis. In this first module, we focus on generating quality reports of the raw sequences, adapter trimming, mapping, and removal of PCR duplicates. 

<img src="images/all3methods.jpg" alt="Drawing" style="width: 500px;"/>

To demonstrate the process, this tutorial will analyze published datasets comparing BAF depletion or mutant to a control sample. This module covers the processing of the data from three distinct but similar methods using downsampled data to improve runtime speed. The original data can be found in the GEO repository with the following accessions: ChIP-seq - p63 occupancy from GSM1645714 & GSM1645715; CUT&RUN - H3K27ac occupancy from GSM4743882 & GSM4743880; CUT&Tag - H3K27ac occupancy from GSM6411302 & GSM6411302. These data were published in the following articles:
Bao X, et al. A novel ATAC-seq approach reveals lineage-specific reinforcement of the open chromatin landscape via cooperation between BAF and p63. Genome Biol 2015 PMID: 26683334
Chang CY, et al. Increased ACTL6A occupancy within mSWI/SNF chromatin remodelers drives human squamous cell carcinoma. Mol Cell 2021 PMID: 34687603
Patil A, et al. A disordered region controls cBAF activity via condensation and partner recruitment. Cell 2023 PMID:  37788668

Note that to allow faster processing we have limited the reads to that of a single chromosome (chr4).  

While ChIP-seq, CUT&RUN, and CUT&Tag are all used to identify chromatin binding sites genome-wide, they differ in implementation which can impact the analysis.
<img src="images/methodcompare.png" alt="Drawing" style="width: 1000px;"/>
Image credit: Kaya-Okur et al., 2020. Efficient low-cost chromatin profiling with CUT&Tag. Nature Protocols.

## ChIP-seq
This method uses random fragmentation followed by immunoprecipitation.
<img src="images/chipseq.gif" alt="Drawing" style="width: 500px;"/>

## CUT&RUN
This method uses random pA-MNase to target the fragmentation to the site where your protein is bound.
<img src="images/CUTRUN.gif" alt="Drawing" style="width: 700px;"/>

## CUT&Tag
This method uses random pA-Tn5 to directly insert sequencing adapters next to where your protein is bound.
<img src="images/CUTTag.gif" alt="Drawing" style="width: 300px;"/>


### Ways to use this module
Throughout this module, we have color-coded commands according to ChIP-seq, CUT&RUN, and CUT&Tag. Therefore this module can be used to learn about the processing of each method individually, to compare each method to the others, or you can follow the colored commands to only process one type, either ChIP-seq, CUT&RUN, or CUT&Tag.
Commands for each method will be designated by an individual log before the command, just like the following examples

<img src="images/ChIPseqLogo.jpg" alt="Drawing" style="width: 250px;"/>

In [None]:
#run this cell for ChIP-seq
print("Code for ChIP-seq will be placed after the above image. Run these cells if performing ChIP-seq analysis.")

<img src="images/CUT&RUNLogo.jpg" alt="Drawing" style="width: 250px;"/>

In [None]:
#run this cell for CUT&RUN
print("Code for CUT&RUN will be placed after the above image. Run these cells if performing CUT&RUN analysis.")

<img src="images/CUT&TagLogo.jpg" alt="Drawing" style="width: 250px;"/>

In [None]:
#run this cell for CUT&Tag
print("Code for CUT&Tag will be placed after the above image. Run these cells if performing CUT&Tag analysis.")

<div class="alert alert-block alert-success" style="font-size:100%">
<span style="color:black"> By following the colors/images, you can run one, two, or all three types of analyses.</span>
</div>

### Required Files
In this stage of the module, you will use the fastq files that have been prepared. You can also use this module on your own data or any published ChIP-seq, CUT&RUN, or CUT&Tag dataset. 

<div class="alert-info" style="font-size:200%">
STEP 1: Set Up Environment
</div>

Initial items to configure your Cloud environment. In this step we will use conda to install the following packages:

Quality Reporting:
[fastqc](https://anaconda.org/bioconda/fastqc), [multiqc](https://anaconda.org/bioconda/multiqc)

Read Trimming: 
[trimmomatic](https://anaconda.org/bioconda/trimmomatic)

Mapping:
[bowtie2](https://anaconda.org/bioconda/bowtie2)

Deduplication:
[samtools](https://anaconda.org/bioconda/samtools), [picard](https://anaconda.org/bioconda/picard)

In [None]:
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
numthreadsint = int(numthreads[0])-1
!conda install -c bioconda fastqc bowtie2 picard multiqc samtools trimmomatic -y
!pip install jupyterquiz==2.0.7 jupytercards
from jupyterquiz import display_quiz
from IPython.display import IFrame
from IPython.display import display
from jupytercards import display_flashcards
import pandas as pd
print("done")

In [None]:
# These commands move into our Tutorial 1 directory and create our subdirectory structure.
%cd Tutorial1/
!mkdir -p InputFiles
!mkdir -p QC
!mkdir -p Trimmed
!mkdir -p Mapped

In [None]:
!ls InputFiles

<div class="alert-info" style="font-size:200%">
STEP 2: QC
</div>

Sequences are typically provided as files in fastq format. This format includes four lines per sequence.


In [None]:
display_flashcards('../flashcards/FastqFlashCard.json')

### Click on the above image to see what each line represents.
Next, let's take a look at the sequence quality of the raw reads using fastqc:

In [None]:
# This command runs fastqc on each fastq.gz file inside our InputFiles directory and stores the ouput reports in our QC directory.
!fastqc -t $numthreadsint -q -o QC InputFiles/*fastq.gz

In [None]:
# We can display the resulting fastqc results.
IFrame(src='Tutorial1/QC/p63ChIPseq_ctl_fastqc.html', width=1080, height=800)

<div class="alert-success" style="font-size:100%">
After you've browsed the above report, try to visualize the quality of the other sample(s). Use the cell below to adapt the commands to visualize the bafi sample.
</div>

In [None]:
#Type the code to visualize the fastqc report on the other sample: Tutorial1/QC/p63ChIPseq_bafi_fastqc.html



<div class="alert-info" style="font-size:200%">
Trimming
</div>
<img src="images/trimming.jpg" alt="Drawing" style="width: 250px;"/>
Now that we've viewed the quality let's trim the sequences to remove poor-quality bases and any adapter contamination. We'll use a package called Trimmomatic.

## Let's use trimmomatic to prepare the sequences before mapping.

In [None]:
# This will trim off N's as well as nextera adapters present in the ChIPseq library preparation. Placing the trimmed reads in our Trimmed folder.
!trimmomatic SE -threads $numthreadsint InputFiles/p63ChIPseq_ctl.fastq.gz Trimmed/p63ChIPseq_ctl_trimmed.fastq.gz ILLUMINACLIP:RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3
!trimmomatic SE -threads $numthreadsint InputFiles/p63ChIPseq_bafi.fastq.gz Trimmed/p63ChIPseq_bafi_trimmed.fastq.gz ILLUMINACLIP:RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3                        

#Now let's do that for the other sample:
