# SeqAcademy Multiomics Tutorial

# Contents
1. Intro to Jupyter Notebook
2. Installation
    1. Set up channels
    2. Create an environment and install the packages
3. Alignment
    1. HISAT 
    2. Samtools
    3. MultiQC
4. ChIP-Seq Analysis
    1. MACS
    2. Bedtools
    3. IGV
5. RNA-Seq Analysis
    1. HTSeq
    2. DESeq
    3. Visualization

# 1. Intro to Jupyter Notebook

http://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html

This tutorial is an example of a Jupyter Notebook, which is the continuation of the IPython Notebook project.  The Jupyter Notebook App runs on any web browser application.

A notebook contains various features, such as a kernel that can execute code and the ability to render markdown text.  Thus, there are two important types of cells:

* Code cells - type commands in a variety of languages (Python, bash, R, etc.) and execute them
* Markdown cells - format text and data for a clear presentation of material

Jupyter notebooks are self-contained documents that can provide a convenient way to both annotate and execute code, in a variety of languages all from within a web browser.

# 2. Installation

Before running any programs, we'll make sure that each software is installed correctly. This tutorial uses Bioconda (https://bioconda.github.io/). Bioconda is a channel for the conda package manager specializing in bioinformatics software. The available packages are listed here: https://bioconda.github.io/recipes.html#recipes.

Corey Schafer among others has made a youtube tutorial video on downloading and installing anaconda as well as setting up virutal environments, which will be discussed in the "Create Environment and Install Packages" portion of this tutorial.
https://www.youtube.com/watch?v=YJC6ldI3hWk

### 2A. Set up channels

You will need to add the bioconda channel as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (that is, bioconda is highest priority).

The conda-forge channel contains many general-purpose packages not already found in the defaults channel. The r channel is only included due to backward compatibility. It is not mandatory, but without the r channel packages compiled against R 3.3.1 might not work.

This tutorial uses cells written in python and unix to perform its analyses. Lines that are written in unix are prefixed by an exclamation point. 

Select the following cell and run it. To run a cell, select the cell, click "Cell" the upper taskbar, and select "Run Cells". Or click the cell and press shift + enter. Alternatively, the contents of any cell may be copy+pasted into the terminal emulator to run.

The `--add channels` is an option that is supplied to the command ot tell it to add certain channels that you specify. The way it is written, the channels "defaults", "conda-forge", and "bioconda", would be added in that order. 

In [1]:
!conda config --add channels defaults
!conda config --add channels conda-forge
!conda config --add channels bioconda



### 2B. Create an environment and install the packages

In this tutorial we will create an environment named "tutorial" and install the packages in there. Environments offer ways of installing packages in specific environments so they can be managed and run for different specifications. You can create, export, list, remove and update environments that have different versions of Python and/or packages installed in them. Switching or moving between environments is called activating the environment. You can also share an environment file.

This command will create an environment "tutorial" in which to install the packages used in this tutorial.

Run the following commands to create the environment. The `-n` flag specifies the name of the environment to create (which is called "tutorial") and the list of packages following the name are the packages that will be installed in the "tutorial" environment.

This will most likely take 10-15 minutes.

Here is a good youtube video tutorial by Corey Schafer that walks through the steps we are doing

In [None]:
!conda create -n tutorial hisat2 multiqc macs2 bioconductor-deseq matplotlib ggplot samtools bioconductor-rsamtools bedtools htseq  --yes

Then activate the environment with the following command.

In [3]:
# For Mac and Linux

!source activate tutorial

In [None]:
# For Windows

!activate tutorial

# 3. Alignment
### 3A. HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts)

In this tutorial, we'll use Hisat to align the sample reads to a reference genome. Hisat automatically downloads and preprocesses the reads so they're ready to be aligned. Hisat (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT’s hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp.

The RNA-Seq data we'll use is from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP106028 and the ChIP-Seq data is from https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP132584

The model organism for this project is Yeast i.e. Saccharomyces cerevisiae. For RNA-Seq, yeast data between euploid and aneuoploid conditions will be compared. For ChIP-SEq, yeast data between 3AT-treated and untreated conditions will be compared.

The following cell contains python code. 

The Python programming language comes with a variety of built-in functions. Among these are several common functions, including:

+ print() which prints expressions out
+ abs() which returns the absolute value of a number
+ int() which converts another data type to an integer
+ len() which returns the length of a sequence or collection

These built-in functions, however, are limited, and we can make use of modules to make more sophisticated programs.

Modules are Python .py files that consist of Python code. Any Python file can be referenced as a module. A Python file called hello.py has the module name of hello that can be imported into other Python files or used on the Python command line interpreter. 

Run the following cell to see what it does and observe the output

In [4]:
from pandas import read_csv

RNASeqSRARunTableFile='data/RNASeqSRA.tsv'
RNASeqSRATable = read_csv(RNASeqSRARunTableFile, delimiter='\t')
RNASeqoutrun = (RNASeqSRATable["Run"]).astype(list)
RNASeqoutputSam = "test/" + RNASeqoutrun + ".sam"
RNASeqoutputAlignmentSummary = "test/" + RNASeqoutrun + ".txt"
RNASeqoutputMetrics = "test/" + RNASeqoutrun + ".metrics"
RNASeqoutputSortBam = "test/" + RNASeqoutrun + ".sorted.bam"

ChIPSeqSRARunTableFile='data/ChIPSeqSRA.tsv'
ChIPSeqSRATable = read_csv(ChIPSeqSRARunTableFile, delimiter='\t')
ChIPSeqoutrun = (ChIPSeqSRATable["Run"]).astype(list)
ChIPSeqoutputSam = "test/" + ChIPSeqoutrun + ".sam"
ChIPSeqoutputAlignmentSummary = "test/" + ChIPSeqoutrun + ".txt"
ChIPSeqoutputMetrics = "test/" + ChIPSeqoutrun + ".metrics"
ChIPSeqoutputSortBam = "test/" + ChIPSeqoutrun + ".sorted.bam"

print("RNA-Seq run " + RNASeqoutrun)
print("ChIP-Seq run " + ChIPSeqoutrun)

0    RNA-Seq run SRR5494627
1    RNA-Seq run SRR5494630
Name: Run, dtype: object
0    ChIP-Seq run SRR6703656
1    ChIP-Seq run SRR6703661
Name: Run, dtype: object


Then run the following command to create the yeast index. The following command is a bash script `makeYeastIndex.sh` in the directory `scripts`. When run, the script downloads sequences for the latest Yeast release from Ensembl. By default, it builds and index for just the base files, since alignments to those sequences are the most useful.  To change which categories are built by this script, edit the CHRS_TO_INDEX variable in the `scripts/makeYeastIndex.sh` file. 

We will also use the `mkdir` command to make a directory `test` that will be used for storing the output from this tutoirial. 

We will begin by creating a directory in which to put the yeast index and the files used to create it and 'cd' into this directory.

In [None]:
!mkdir yeast_index
!cd yeast_index

Our next step is to download the sequences for release 84 of the Saccharomyces cerevisiae genome.

In [11]:
!ENSEMBL_RELEASE=84
!ENSEMBL_YEAST_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/fasta/saccharomyces_cerevisiae/dna/

!F=Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
!if [ ! -f $F ] ; then
!	wget ${ENSEMBL_YEAST_BASE}/$F.gz || (echo "Error getting $F" && exit 1)
!	gunzip $F.gz || (echo "Error unzipping $F" && exit 1)
!	mv $F genome.fa
fi

!CMD="hisat2-build genome.fa genome" 
!echo Running $CMD
!if $CMD ; then
!	echo "genome index built; you may remove fasta files"
!else
!	echo "Index building failed; see error message"
fi

!rm genome.fa

/bin/sh: -c: line 1: syntax error: unexpected end of file
/bin/sh: wget: command not found
Error getting 
gunzip: can't stat: .gz (.gz.gz): No such file or directory
Error unzipping 
usage: mv [-f | -i | -n] [-v] source target
       mv [-f | -i | -n] [-v] source ... directory


NameError: name 'fi' is not defined

In [None]:
#!/bin/sh
mkdir yeast_index
cd yeast_index
#
# Downloads sequences for the latest Yeast release from Ensembl.
#
# By default, this script builds an index for just the base files,
# since alignments to those sequences are the most useful.  To change
# which categories are built by this script, edit the CHRS_TO_INDEX
# variable below.
#

ENSEMBL_RELEASE=84
ENSEMBL_YEAST_BASE=ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/fasta/saccharomyces_cerevisiae/dna/

F=Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
if [ ! -f $F ] ; then
	wget ${ENSEMBL_YEAST_BASE}/$F.gz || (echo "Error getting $F" && exit 1)
	gunzip $F.gz || (echo "Error unzipping $F" && exit 1)
	mv $F genome.fa
fi

CMD="hisat2-build genome.fa genome" 
echo Running $CMD
if $CMD ; then
	echo "genome index built; you may remove fasta files"
else
	echo "Index building failed; see error message"
fi

rm genome.fa


In [5]:
!bash scripts/makeYeastIndex.sh
!mkdir test

scripts/makeYeastIndex.sh: line 18: wget: command not found
Error getting Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
gunzip: can't stat: Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz (Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz.gz): No such file or directory
Error unzipping Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
mv: rename Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa to genome.fa: No such file or directory
Running hisat2-build genome.fa genome
scripts/makeYeastIndex.sh: line 25: hisat2-build: command not found
Index building failed; see error message
rm: genome.fa: No such file or directory


Align the RNA-Seq samples using Hisat.

This step will most likely take several hours. 

In [None]:
for index, individual in enumerate(RNASeqoutrun):
    run = RNASeqoutrun[index]
    summary = RNASeqoutputAlignmentSummary[index] 
    metrics = RNASeqoutputMetrics[index]
    sam = RNASeqoutputSam[index]
    bam = RNASeqoutputSortBam[index]
    !hisat2 -x yeast_index/genome --sra-acc $run --new-summary --summary-file $summary --met-file $metrics -S $sam

### 3B. Samtools 

We'll use samtools to sort the output files and convert them to bam files.

Sort the output files and convert them to bam files. 

In [None]:
for index, individual in enumerate(RNASeqoutrun):
    run = RNASeqoutrun[index]
    summary = RNASeqoutputAlignmentSummary[index] 
    metrics = RNASeqoutputMetrics[index]
    sam = RNASeqoutputSam[index]
    bam = RNASeqoutputSortBam[index]
    !samtools view -bSF4 $sam | samtools sort -o $bam
    

Do the same thing for ChIP-Seq samples.

In [None]:
for index, individual in enumerate(ChIPSeqoutrun):
    run = ChIPSeqoutrun[index]
    summary = ChIPSeqoutputAlignmentSummary[index] 
    metrics = ChIPSeqoutputMetrics[index]
    sam = ChIPSeqoutputSam[index]
    bam = ChIPSeqoutputSortBam[index]
    index = "yeast_index/genome"
    !hisat2 -p 2 -x $index --sra-acc $run --new-summary --summary-file $summary --met-file $metrics -S $sam

In [None]:
for index, individual in enumerate(ChIPSeqoutrun):
    run = ChIPSeqoutrun[index]
    summary = ChIPSeqoutputAlignmentSummary[index] 
    metrics = ChIPSeqoutputMetrics[index]
    sam = ChIPSeqoutputSam[index]
    bam = ChIPSeqoutputSortBam[index]
    !samtools view -bSF4 $sam | samtools sort -o $bam

### 3C. MultiQC

This section details quality control checks on the read data from either RNAseq or ChIPseq data using MultiQC. MultiQC takes all output and log files from an alignment software program and aggregates the information from all samples into one convenient report (html by default).

MultiQC was installed earlier in the tutorial, so all we need to do is run it on the data.

MultiQC is configured to run the same no matter what type of sequencing data is available, therefore the same command can be used to analyze either our RNAseq data or our ChIPseq data.  We include the option 'hisat_output' since we are aligning using the HISAT2 program.  See http://multiqc.info/docs/ for more information.

We use the 'hisat_output' option because we are analyzing data downloaded and aligned using the HISAT2 program.  We use the '--force' option to overwrite any previous versions of the multiqc_report.  '--quiet' only shows log warnings.

In [None]:
!multiqc "".join(RNASeqoutrun) --quiet --outdir test/multiqc_rnaseq --force
!multiqc "".join(ChIPSeqoutrun) --quiet --outdir test/multiqc_chipseq --force

# 4. ChIP-Seq Analysis
### 4A. MACS (Model-based Analysis for ChIP-Seq)

Peak-calling is one of the main steps scientists use in determining the locations where protein is bound in DNA. Peak detection software, such as MACS (Model-Based Analysis for ChIP-Seq), call peaks using the aligned sequecnes as input and returns precise locations of predicted peaks as output. In this tutorial, we'll use MACS.

More information about MACS: http://liulab.dfci.harvard.edu/MACS/Download.html

In [None]:
ChIPSeqControl = ChIPSeqSRATable.loc[ChIPSeqSRATable["source_name"] == "Untreated"]["Run"].astype(list)
ChIPSeqTreatment = ChIPSeqSRATable.loc[ChIPSeqSRATable["source_name"] != "Untreated"]["Run"].astype(list)

print("Control sample " + ChIPSeqControl)
print("Treatment sample " + ChIPSeqTreatment)

In [None]:
for index, individual in enumerate(ChIPSeqControl):
    outputdirectory = "test/" + ChIPSeqTreatment.iloc[index]
    name = ChIPSeqTreatment.iloc[index]
    immunoprecipitate = "test/" + ChIPSeqTreatment.iloc[index] + ".sorted.bam"
    control = "test/" + ChIPSeqControl.iloc[index] + ".sorted.bam"
    !macs2 callpeak -c $control -t $immunoprecipitate -n $name --outdir $outputdirectory

For an in-depth discussion of what MACS2 does: https://github.com/taoliu/MACS/wiki/Advanced:-Call-peaks-using-MACS2-subcommands

### 4B. Bedtools

In this tutorial, we'll use Bedtools to extract the intersecting regions of the MACS output between the experimental conditions.

The Bedtools suite of programs is widely used for genomic interval manipulation or "genome algebra". 

First we'll sort the output. The following line uses the `sort` command to sort the MACS output.

In [None]:
for index, individual in enumerate(ChIPSeqTreatment):
    macs_output = "test/" + name + "/" + name + "_peaks.narrowPeak"
    sort = "test/" + name + "/" + name + "_peaks.narrowPeak.sorted"
    !sort -k 1,1 -k2,2n $macs_output > $sort

Then we'll find the intersecting regions between the different experimental conditions.

In [None]:
!bedtools intersect -a test/SRR6703661/SRR6703661_peaks.narrowPeak.sorted -b test/SRR6703663/SRR6703663_peaks.narrowPeak.sorted -u > test/ChIPSeqintersect.bed

### 4C. Integrative Genomics Viewer

A BAM file viewer will allow you to see your reads in an interactive graphical display. There are many different viewers available such as UCSC Genome Browser, Integrative Genomics Viewer (IGV), and NCBI Genome Workbench.

To load .bam files please first do the following:

Transfer the data to your computer. Amazon provides documentation on copying files with scp (Linux, OSX) and WinSCP (Windows). 

For scp, the command will be approximately: scp ubuntu@your-instance-name-or-ip-address:/home/ubuntu/data/transcript.fa

Load annotation in IGV
File -> Load from File ... -> ChIPSeqintersect.bed

# 5. RNA-Seq
## 5A. HTSeq (High-through sequencing)

HTSeq is a Python library to facilitate the rapid development of RNA-Seq analysis. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. In this tutorial we will use htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

In [None]:
from pandas import read_csv

gtf = "data/Saccharomyces_cerevisiae.R64-1-1.84.gtf"

RNASeqSRARunTableFile='data/RNASeqSRA.tsv'
RNASeqSRATable = read_csv(RNASeqSRARunTableFile, delimiter='\t')
RNASeqoutrun = (RNASeqSRATable["Run"]).astype(list)
RNASeqoutputSortBam = "test/" + RNASeqoutrun + ".sorted.bam"

In [None]:
for index, individual in enumerate(RNASeqoutputSortBam):
    input = individual
    output = individual + ".genecount.txt"
    !htseq-count -m intersection-nonempty -s no -f bam $input $gtf > $output

### 5B. DESeq (Differential Expression Sequencing)

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution

In [None]:
!Rscript scripts/runDeseq.R

### 5C. Visualization

The following script performs principal component analysis and creates volcano plots and bar graphs of RNA-Seq expression.

In [None]:
!Rscript scripts/loadYeastGeneCounts.R