# RNA-seq Analysis Notebook
## Overview
This notebook contains a guided walkthrough to building a simple pipeline for analysis of an RNA-seq dataset.

The pipeline described here consists of the following steps:
1. **Download** an RNA-seq dataset (ARCHS4)
2. **Normalize expression** data (Variance Stabilizing Transformation)
3. Perform **Dimensionality Reduction** (PCA and t-SNE)
4. Visualize the dataset as a **clustered heatmap** (Clustergrammer)
5. Perform **Differential Gene Expression Analysis** (limma and CD)
6. Perform **Enrichment analysis** (Enrichr)

## Load Packages

In [7]:
%%capture
# Python packages
import sys
import rpy2
import numpy as np
from plotly.offline import init_notebook_mode

# Initialize Plotly and R magic
%load_ext rpy2.ipython
%R require(DESeq2)
%R require(limma)

# Custom scripts
sys.path.append('scripts')
import archs4
from plots import *
from signature import *

FileNotFoundError: [WinError 2] The system cannot find the file specified

In [2]:
# Activate Plotly
init_notebook_mode()

# Set display option for pandas
pd.set_option('max_colwidth', -1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


NameError: name 'pd' is not defined

## 1. Download RNA-seq Dataset
Here we download RNA-seq datasets processed by ARCHS4.

The following datasets are suggested:
* **Nucleotide stress induction of HEXIM1 suppresses melanoma by modulating cancer cell-specific gene transcription** (GSE68053_GPL16791)
* **Potent and targeted activation of HIV-1 using the CRISPR/Cas9 activator Complex** (GSE72259_GPL16791)
* **EZH2 and BCL6 cooperate to assemble CBX8-BCOR Polycomb complex to repress bivalent promoters, mediate germinal center formation and promote lymphomagenesis** (GSE73109_GPL11154)
* **HEB associates with PRC2 and SMAD2/3 to regulate developmental fates** (GSE60285_GPL13112)
* **EGFR Mutation Promotes Glioblastoma Through Epigenome and Transcription Factor Network Remodeling** (GSE72468_GPL11154)
* **OSKM induce extraembryonic endoderm stem (iXEN) cells in parallel to iPS cells** (GSE77550_GPL17021)

A full list of datasets processed by ARCHS4 is available in the *archs4_datasets.txt* file.

In [None]:
# Fetch dataset from ARCHS4 server.  Insert code specified in brackets to extract specified dataset
rawcount_dataframe, sample_metadata_dataframe = archs4.fetch_dataset('Insert dataset code here.')

In [None]:
# Display the raw readcount dataframe
rawcount_dataframe.head()

In [None]:
# Display the sample metadata dataframe
sample_metadata_dataframe

## 2. Normalization

Before proceeding with the analysis, we normalize the raw readcount dataset using the **Variance Stabilizing Transformation** (VST) method, from the *DESeq2* package in R.

In [None]:
# Push the dataset to R
%Rpush rawcount_dataframe

# Normalize
%R vst_dataframe <- as.data.frame(varianceStabilizingTransformation(as.matrix(rawcount_dataframe)))

# Pull the dataset from R
%Rpull vst_dataframe

# Display
vst_dataframe.head()

## 3. Dimensionality Reduction

### 3.1 PCA
First, we perform a **Principal Components Analysis** (PCA) on the dataset, reducing it to two or three dimensions.  To achieve this, use the PCA function in the Python package *sklearn* - reference code is available at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

In [None]:
# Insert code to perform PCA here

In [None]:
# Plot using one of the following functions
# plot_2d_scatter(x, y)
# plot_3d_scatter(x, y, z)

### 3.2 t-SNE
Second, we perform **t-Distributed Stochastic Neighbor Embedding** (t-SNE) on the dataset, reducing it to two or three dimensions.  To achieve this, use the tsne function in the Python package *sklearn* - reference code is available at http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

In [None]:
# Insert code to perform t-SNE here

In [None]:
# Plot using one of the following functions
# plot_2d_scatter(x, y)
# plot_3d_scatter(x, y, z)

## 4. Clustergrammer
Next, we generate an **interactive clustered heatmap** to explore the most variable genes in the dataset.  To achieve this, we use the *Clustergrammer* package - reference code is available at http://clustergrammer.readthedocs.io/clustergrammer_widget.html#clustergrammer-widget-workflow-example.

In [None]:
# Insert code to create widget here
# Steps: (1) create Network object, (2) load dataframe
#        (3) Z-score normalize the rows, (4) filter top 500 genes by variance
#        (5) cluster the heatmap, (6) display the widget

## 5. Differential Expression Analysis
Here we identify **Differentially Expressed Genes** (DEGs) using two approaches: limma and Characteristic Direction.  

To achieve this, we need to select two sets of samples:
* a group of *experimental / treated samples*
* a second group of *control / untreated samples*

### 1. limma
First, we perform the analysis using the *limma* R package.  Reference here https://bioconductor.org/packages/release/bioc/html/limma.html.

In [None]:
# Run limma using a Python wrapper
limma_dataframe = compute_signature(rawcount_dataframe,
                                    method = 'limma',
                                    experimental_samples = [], # insert list of experimental sample names
                                    control_samples = [],# insert list of control sample names
                                    )

In [None]:
# Explore results
limma_dataframe.head()

##### Volcano Plot
The Volcano plot is a common way to display results of a differential gene expression analysis.  It displays logFC on the x axis and log10(P-value) on the Y axis.

In [None]:
# Create a volcano plot using the following function:
# plot_2d_scatter(x, y)

##### MA Plot
The MA plot is a second common way to display results of a differential gene expression analysis.  It displays average normalized expression on the x axis and logFC on the Y axis.

In [None]:
# Create a MA plot using the following function:
# plot_2d_scatter(x, y)

### 2. Characteristic Direction
Second, we calculate a differential gene expression signature using the *Characteristic Direction* method, which has been shown to outperform other methods to identify DEGs in the context of transcription factor (TF) and drug perturbation responses (Clark et al, 2013, [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-79)).

In [None]:
# Run CD using a Python wrapper
cd_dataframe = compute_signature(rawcount_dataframe,
                                 method = 'CD',
                                 experimental_samples = [], # insert list of experimental sample names
                                 control_samples = [],# insert list of control sample names
                                 )

In [None]:
# Explore results
cd_dataframe.head()

## 6. Enrichment Analysis
We now use the differential gene expression signature computed with CD and perform **enrichment analysis** on the top most overexpressed and underexpressed genes using the *Enrichr* API.

Reference on how to use the API in Python here http://amp.pharm.mssm.edu/Enrichr/help#api.

##### 6.1 Calculate Upregulated and Downregulated genesets

In [3]:
# Write code to extract gene lists for enrichment analysis
# Steps: (1) sort the genes by the CD value, (2) take the top 500 top and bottom genes,

##### 6.2 Upload genesets to Enrichr using API

In [4]:
# Upload genesets to Enrichr as shown in the manual.

Once the gene list has been submitted, you can view the enrichment results by appending the 'shortId' at the end of the following URL: http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=.

## 7. Small Molecole Query
Finally, we use the differential gene expression signature computed with CD to identify **small molecules which mimic or reverse** the observed pattern using the *L1000CDS<sup>2</sup>* API.

Reference on how to use the API in Python here http://amp.pharm.mssm.edu/L1000CDS2/help/#api.

In [5]:
# Upload genesets to L1000CDS2 as shown in the manual (gene-set search example).

Once the gene list has been submitted, you can view the results by appending the 'shareId' at the end of the following URL: http://amp.pharm.mssm.edu/L1000CDS2/#/result/.