# ATAC-seq Bonus Submodule: Tutorial 1 with custom data

<img src="./Tutorial1/LessonImages/ATACseqWorkflowLesson1.jpg" alt="Drawing" style="width: 1000px;"/>

## Overview & Purpose
In this notebook, we are going to explore how to run this module with a new dataset. These submodules provide a great framework for running a rigorous and scalable ATAC-Seq analysis, but there are some considerations that must be made in order to run this with your own data. We will walk through that process here so that hopefully, you are able to take these notebooks to your research group and use them for your own analysis. Notice that we do not give you all the answers in the code blocks, but if you get stuck, use the dropdowns for help.

<div class="alert-info" style="font-size:200%">
STEP 1: Set Up Environment
</div>


In [None]:
!curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
!bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge

In [None]:
# add mamba to your path
import os
os.environ["PATH"] += os.pathsep + os.environ["HOME"]+"/mambaforge/bin"

In [None]:
# install dependencies
numthreads=!lscpu | grep '^CPU(s)'| awk '{print $2-1}'
numthreadsint = int(numthreads[0])
! mamba install -c bioconda fastqc bowtie2 picard multiqc samtools trimmomatic  -y
! pip install jupyterquiz==2.0.7 jupytercards jupyter-book>=0.14

In [None]:
from jupyterquiz import display_quiz
from IPython.display import IFrame
from IPython.display import display
from jupytercards import display_flashcards
import pandas as pd

<div class="alert-info" style="font-size:200%">
STEP 2: Get new fastq data
</div>

We are going to pull a new dataset from SRA. In this example, we are going to use data from an experiment that compared cis-regulatory elements across tissues in zebrafish. The BioProject ID is [PRJNA553572](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA553572). There are over 200 samples in this experiment, but for simplicity, we are only going to use one liver sample and one muscle sample. The accession numbers for these samples are SRR12173474 and SRR12173476. To learn how to pull data from SRA, we recommend you consult the [STRIDES tutorial on SRA downloads](https://github.com/STRIDES/NIHCloudLabGCP/blob/main/tutorials/notebooks/SRADownload/SRA-Download.ipynb). For this situation, we have already pulled the data from SRA and have it in our storage bucket, so we will copy it to our directory.

In [None]:
# These commands create our directory structure.
! mkdir -p Tutorial1/InputFiles
! mkdir -p Tutorial1/QC
! mkdir -p Tutorial1/Trimmed
! mkdir -p Tutorial1/Mapped
! mkdir -p Tutorial1/RefGenome

# These commands help identify the Google Cloud Storage bucket where the example files are held.
original_bucket = "gs://nigms-sandbox/unmc_atac_data_examples/Tutorial1"

Now copy the input data from the Cloud Storage Bucket to your local Tutorial1/InputFiles directory.
The original fastq files were very large, so here we are using down-sampled versions that contain 10% of the original dataset.

In [None]:
! <YOUR COMMAND HERE>