# ATAC-seq Bonus Submodule: Tutorial 1 with custom data

<img src="./Tutorial1/LessonImages/ATACseqWorkflowLesson1.jpg" alt="Drawing" style="width: 1000px;"/>

## Overview & Purpose
In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable ATAC-Seq analysis, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis. Notice that we do not give you all the answers in the code blocks, but if you get stuck, use the dropdowns for help.

<div class="alert-info" style="font-size:200%">
STEP 1: Set Up Environment
</div>


In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
# add mamba to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
# install dependencies
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
numthreadsint = int(numthreads[0])
! mamba install -c bioconda fastqc bowtie2 picard multiqc samtools trimmomatic  -y
! pip install jupyterquiz==2.0.7 jupytercards jupyter-book>=0.14

In [None]:
from jupyterquiz import display_quiz
from IPython.display import IFrame
from IPython.display import display
from jupytercards import display_flashcards
import pandas as pd

<div class="alert-info" style="font-size:200%">
STEP 2: Get new fastq data
</div>

We are going to pull a new dataset from SRA. In this example, we are going to use data from an experiment that compared cis-regulatory elements across tissues in zebrafish. The BioProject ID is [PRJNA553572](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA553572). There are over 200 samples in this experiment, but for simplicity, we are only going to use one liver sample and one muscle sample. The accession numbers for these samples are SRR12173474 and SRR12173476. To learn how to pull data from SRA, we recommend you consult the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/tutorials/notebooks/SRADownload/SRA-Download.ipynb). For this situation, we have already pulled the data from SRA and have it in our storage bucket, so we will copy it to our directory.

In [None]:
# These commands create our directory structure.
! mkdir -p Tutorial1/InputFiles
! mkdir -p Tutorial1/QC
! mkdir -p Tutorial1/Trimmed
! mkdir -p Tutorial1/Mapped
! mkdir -p Tutorial1/RefGenome

# These commands help identify the Google Cloud Storage bucket where the example files are held.
original_bucket = "gs://nigms-sandbox/unmc_atac_data_examples/Tutorial1"

Now copy the input data from the Cloud Storage Bucket to your local Tutorial1/InputFiles directory.
The original fastq files were very large, so here we are using down-sampled versions that contain 10% of the original dataset.

In [None]:
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  Make sure you include the `!` in front of your command! 
    
  
`gsutil -m cp $original_bucket/InputFiles/Liver* Tutorial1/InputFiles/`  
`gsutil -m cp $original_bucket/InputFiles/Muscle* Tutorial1/InputFiles/`
  

</details>


Now copy the reference genome from the Cloud Storage Bucket to your local Tutorial1/RefGenome directory


In [None]:
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
    
`gsutil -m cp $original_bucket/RefGenome/* Tutorial1/RefGenome`

</details>


In [None]:
! gsutil -m cp $original_bucket/RefGenome/* Tutorial1/RefGenome


<div class="alert-info" style="font-size:200%">
STEP 3: QC
</div>


Now run fastqc on your data. You can use the the wildcard \*.fastq.gz to find the files like this: Tutorial1/InputFiles/*fastq.gz

In [None]:
# This command runs fastqc on each fastq.gz file inside our InputFiles directory and stores the ouput reports in our QC directory.
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    

  `fastqc -t $numthreadsint -q -o Tutorial1/QC Tutorial1/InputFiles/*fastq.gz`


</details>


We then use multiqc to summarize the report.

In [None]:
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    

  `multiqc -o Tutorial1/QC -f Tutorial1/QC 2> Tutorial1/QC/multiqc_log.txt`

</details>


We'll load the outputs into a pandas data frame , but fastqc also produces an html report that you can browse.


In [None]:
<YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    
  This is a python command, so we don't use a `!` at the start.   
  `dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_fastqc.txt", sep='\t')`  
  `display(dframe)`

</details>


As we come to the trimming step, we need to make some changes. As written, the notebook calls CTL and MUTANT, but our new samples are named Liver and Muslce. We will change the command to Liver and Muscle respectively. Try it out in the cell below.

In [None]:
! <YOUR COMMAND HERE> 

<details>
  <summary>Click for help</summary>
    

`trimmomatic PE -threads $numthreadsint Tutorial1/InputFiles/Liver_sub_1.fastq.gz Tutorial1/InputFiles/Liver_sub_2.fastq.gz Tutorial1/Trimmed/Liver_trimmed_R1.fastq.gz Tutorial1/Trimmed/Liver_unpaired_R1.fastq.gz Tutorial1/Trimmed/Liver_trimmed_R2.fastq.gz Tutorial1/Trimmed/Liver_unpaired_R2.fastq.gz ILLUMINACLIP:Tutorial1/RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3`

</details>


In [None]:
! <YOUR COMMAND HERE >

<details>
  <summary>Click for help</summary>
    
  
`trimmomatic PE -threads $numthreadsint Tutorial1/InputFiles/Muscle_sub_1.fastq.gz Tutorial1/InputFiles/Muscle_sub_2.fastq.gz Tutorial1/Trimmed/Muscle_trimmed_R1.fastq.gz Tutorial1/Trimmed/Muscle_unpaired_R1.fastq.gz Tutorial1/Trimmed/Muscle_trimmed_R2.fastq.gz Tutorial1/Trimmed/Muscle_unpaired_R2.fastq.gz ILLUMINACLIP:Tutorial1/RefGenome/NexteraPE.fa:2:30:10 LEADING:3 TRAILING:3`

</details>


The summary step can proceed as normal

In [None]:
! <YOUR COMMAND HERE >

<details>
  <summary>Click for help</summary>
    
   
  `fastqc -t $numthreadsint -q -o Tutorial1/Trimmed Tutorial1/Trimmed/*fastq.gz`  
  `multiqc -o Tutorial1/QC -f Tutorial1/Trimmed 2> Tutorial1/QC/multiqc_log.txt`  
  `dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_general_stats.txt", sep='\t')`  
  `display(dframe)`  
  
</details>

<div class="alert-info" style="font-size:200%">
STEP 4: Mapping
</div>


The most important change we need to make in the mapping step is the reference genome. Our original notebook was based on human sequences, but our new dataset is comprised of zebrafish sequences. We will use the GRCz11 reference genome. We have already stored it in our storage bucket, but you can easily get bowtie2 reference genomes for several common organisms [here](https://benlangmead.github.io/aws-indexes/bowtie). Simply put the .bt2 files in your reference genome directory. We can confirm that we have downloaded them in the cell below.

In [None]:
!ls Tutorial1/RefGenome/*bt2

Next we will run the bowtie2 mapping step using the GRCz11 .bt2 reference files. As with above, we need to make the appropriate changes from CTL and MUTANT to our tissues.

In [None]:
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    

  `bowtie2 -p $numthreadsint -x Tutorial1/RefGenome/GRCz11 -1 Tutorial1/Trimmed/Liver_trimmed_R1.fastq.gz -2 Tutorial1/Trimmed/Liver_trimmed_R2.fastq.gz -S Tutorial1/Mapped/Liver.sam`

</details>


Do the same thing for the Muscle sample. Based on the command above, make the necessary changes to the bowtie2 command to run on Muscle samples rather than Liver.

In [None]:
! <YOUR COMMAND HERE>

In the next commands we'll convert the file to the more compressed [bam format](https://genome.ucsc.edu/goldenPath/help/bam.html) and sort the reads by chromosomal coordinates. Again, we change the samples to match Liver and Muscle.

In [None]:
! <YOUR COMMAND HERE>

<details>
  <summary>Click for help</summary>
    

  `samtools view -q 10 -bhS Tutorial1/Mapped/Liver.sam | samtools sort -o Tutorial1/Mapped/Liver.bam -`  
  `print("done")`

</details>


Now do the same thing for Muscle using the Liver command above as a guide.

In [None]:
! <YOUR COMMAND HERE>

<div class="alert-info" style="font-size:200%">
STEP 5: Removal of Duplicates
</div>

As with most of the other steps here, the duplicate removal step can be updated for the new data by changing the sample names in the input files. We leave this step as an exercise for the user. Feel free to have a look at the Tutorial1 for a template for the picard MarkDuplicates command.

In [None]:
! <YOUR COMMAND HERE>

In [None]:
! <YOUR COMMAND HERE>

The last step is the same. We can look at the metrics the same way regardless of the samples used above.

In [None]:
# We can use multiqc to summarize the metrics.
!multiqc -o Tutorial1/QC -f Tutorial1/Mapped 2> Tutorial1/Mapped/multiqc_log.txt
dframe = pd.read_csv("Tutorial1/QC/multiqc_data/multiqc_general_stats.txt", sep='\t')
display(dframe)

<div class="alert-success" style="font-size:200%">
Great job! 
</div>

That wraps up the preprocessing notebook on our new dataset. Overall, it is usually a matter of changing the commands to match the new filenames from the new data. Using this notebook as a guide, try to think through how you could update the others to run on the Liver and Muscle samples. 

Using the subsampled datasets, this tutorial should complete in about 10 minutes of runtime using the n1-standard-4 machine recommended by the module README file. Feel free to adjust the machine type and see how the runtime of different steps vary with more memory and compute resources. If you want to run the full dataset without subsampling, it would take a couple of hours using a n1-standard-64 machine.

If you want to continue to adapt this for real-world data, you could also try to modify the notebooks to run on multiple samples. Currently, they rely on one case and one control sample, but in a real sequencing run you would likely have several samples of each. How could you modify the code here to handle that type of situation?