# RNA-seq Analysis Notebook
## Overview
This notebook contains a guided walkthrough to building a simple pipeline for analysis of an RNA-seq dataset.

The pipeline described here consists of the following steps:
1. **Download** an RNA-seq dataset (ARCHS4)
2. **Normalize expression** data (Variance Stabilizing Transformation)
3. Perform **Dimensionality Reduction** (PCA and t-SNE)
4. Visualize the dataset as a **clustered heatmap** (Clustergrammer)
5. Perform **Differential Gene Expression Analysis** (limma and CD)
6. Perform **Enrichment analysis** (Enrichr)

## Load Packages

In [1]:
%%capture
# Python packages
import sys
import rpy2
import numpy as np
from plotly.offline import init_notebook_mode

# Initialize Plotly and R magic
init_notebook_mode()
%load_ext rpy2.ipython
%R require(DESeq2)
%R require(limma)

# Custom scripts
sys.path.append('scripts')
import archs4
from plots import *
from signature import *

## 1. Download RNA-seq Dataset
Here we download RNA-seq datasets processed by ARCHS4.

The following datasets are suggested:
* *Homo sapiens* datasets:
    * **Nucleotide stress induction of HEXIM1 suppresses melanoma by modulating cancer cell-specific gene transcription** (GSE68053_GPL16791)
    * **Potent and targeted activation of HIV-1 using the CRISPR/Cas9 activator Complex** (GSE72259_GPL16791)
    * **EZH2 and BCL6 cooperate to assemble CBX8-BCOR Polycomb complex to repress bivalent promoters, mediate germinal center formation and promote lymphomagenesis** (GSE73109_GPL11154)
    
   
* *Mus musculus* datasets:
    * **HEB associates with PRC2 and SMAD2/3 to regulate developmental fates** (GSE60285_GPL13112)
    * **Transcriptomic signatures uncover gene expression differences associated with the development of phenotypic differences in serial organs** (GSE76316_GPL13112)
    * **OSKM induce extraembryonic endoderm stem (iXEN) cells in parallel to iPS cells** (GSE77550_GPL17021)

A full list of datasets processed by ARCHS4 is available in the *archs4_datasets.txt* file.

In [3]:
# Fetch dataset from ARCHS4 server.  Insert code specified in brackets to extract specified dataset
rawcount_dataframe, sample_metadata_dataframe = archs4.fetch_dataset('GSE73109_GPL11154')

In [4]:
# Display the raw readcount dataframe
rawcount_dataframe.head()

Unnamed: 0_level_0,GSM1886845,GSM1886846,GSM1886847,GSM1886848,GSM1886849,GSM1886850,GSM1886851,GSM1886852,GSM1886853,GSM1886854,GSM1886855,GSM1886856,GSM1886857,GSM1886858,GSM1886859,GSM1886860,GSM1886861,GSM1886862
ID_REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
A1BG,214,239,96,131,164,63,340,305,320,194,152,212,254,304,224,174,161,157
A1CF,2,2,1,3,2,0,5,4,1,1,4,2,3,0,2,2,6,0
A2M,2,1,0,4,0,0,2,2,0,1,1,1,3,8,2,9,6,5
A2ML1,4,7,2,12,9,6,3,4,4,13,3,12,14,11,8,17,10,6
A2MP1,2,2,2,0,0,0,8,12,5,0,4,1,2,3,1,4,1,2


In [5]:
# Display the sample metadata dataframe
sample_metadata_dataframe

Unnamed: 0,cell line,treatment
GSM1886845,SUDHL6,treated with DMSO for 12h
GSM1886846,SUDHL6,treated with DMSO for 12h
GSM1886847,SUDHL6,treated with DMSO for 12h
GSM1886848,SUDHL6,treated with 10uM 79-6.1085 for 12h
GSM1886849,SUDHL6,treated with 10uM 79-6.1085 for 12h
GSM1886850,SUDHL6,treated with 10uM 79-6.1085 for 12h
GSM1886851,Farage,treated with DMSO for 12h
GSM1886852,Farage,treated with DMSO for 12h
GSM1886853,Farage,treated with DMSO for 12h
GSM1886854,Farage,treated with 10uM 79-6.1085 for 12h


## 2. Normalization

Before proceeding with the analysis, we normalize the raw readcount dataset using the **Variance Stabilizing Transformation** (VST) method, from the *DESeq2* package in R.

In [6]:
# Push the dataset to R
%Rpush rawcount_dataframe

# Normalize
%R vst_dataframe <- as.data.frame(varianceStabilizingTransformation(as.matrix(rawcount_dataframe)))

# Pull the dataset from R
%Rpull vst_dataframe

# Display
vst_dataframe.head()

Unnamed: 0,GSM1886845,GSM1886846,GSM1886847,GSM1886848,GSM1886849,GSM1886850,GSM1886851,GSM1886852,GSM1886853,GSM1886854,GSM1886855,GSM1886856,GSM1886857,GSM1886858,GSM1886859,GSM1886860,GSM1886861,GSM1886862
A1BG,7.432296,7.585719,7.906538,7.114824,7.367192,7.515685,8.004349,7.954705,8.058813,7.339749,7.221488,7.49837,7.767352,7.824397,7.482212,7.441895,7.322329,7.60409
A1CF,2.175147,2.173957,2.468948,2.640235,2.314552,0.883706,2.790828,2.669106,1.829498,1.826317,2.815022,2.209561,2.488948,0.883706,2.168899,2.310407,3.193082,0.883706
A2M,2.175147,1.810731,0.883706,2.877175,0.883706,0.883706,2.139604,2.183849,0.883706,1.826317,1.901842,1.837125,2.488948,3.199711,2.168899,3.610595,3.193082,3.233832
A2ML1,2.657766,3.1413,3.035112,3.985136,3.61727,4.364832,2.400114,2.669106,2.689196,3.813224,2.583887,3.751074,3.932969,3.51876,3.261509,4.332729,3.716329,3.414973
A2MP1,2.175147,2.173957,3.035112,0.883706,0.883706,0.883706,3.213892,3.705472,2.874809,0.883706,2.815022,1.837125,2.215275,2.389895,1.806986,2.832915,1.90874,2.460303


## 3. Dimensionality Reduction

### 3.1 PCA
First, we perform a **Principal Components Analysis** (PCA) on the dataset, reducing it to two or three dimensions.  To achieve this, use the PCA function in the Python package *sklearn* - reference code is available at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

In [None]:
# Insert code to perform PCA here

In [None]:
# Plot using one of the following functions
# plot_2d_scatter(x, y, sample_names)
# plot_3d_scatter(x, y, z, sample_names)

### 3.2 t-SNE
Second, we perform **t-Distributed Stochastic Neighbor Embedding** (t-SNE) on the dataset, reducing it to two or three dimensions.  To achieve this, use the tsne function in the Python package *sklearn* - reference code is available at http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

In [None]:
# Insert code to perform t-SNE here

In [None]:
# Plot using one of the following functions
# plot_2d_scatter(x, y, sample_names)
# plot_3d_scatter(x, y, z, sample_names)

## 4. Clustergrammer
Next, we generate an **interactive clustered heatmap** to explore the most variable genes in the dataset.  To achieve this, we use the *Clustergrammer* package - reference code is available at http://clustergrammer.readthedocs.io/clustergrammer_widget.html#clustergrammer-widget-workflow-example.

In [None]:
# Insert code to create widget here
# Steps: (1) create Network object, (2) load dataframe
#        (3) Z-score normalize the rows, (4) filter top 500 genes by variance
#        (5) cluster the heatmap, (6) display the widget

## 5. Differential Expression Analysis
Here we identify **Differentially Expressed Genes** (DEGs) using two approaches: limma and Characteristic Direction.  

To achieve this, we need to select two sets of samples:
* a group of *experimental / treated samples*
* a second group of *control / untreated samples*

### 1. limma
First, we perform the analysis using the *limma* R package.  Reference here https://bioconductor.org/packages/release/bioc/html/limma.html.

In [None]:
# Run limma using a Python wrapper
limma_dataframe = compute_signature(rawcount_dataframe,
                                    method = 'limma',
                                    experimental_samples = , # insert list of experimental sample names
                                    control_samples = ,# insert list of control sample names
                                    )

In [None]:
# Explore results
limma_dataframe.head()

##### Volcano Plot
The Volcano plot is a common way to display results of a differential gene expression analysis.  It displays logFC on the x axis and log10(P-value) on the Y axis.

In [None]:
plots.plot_2d_scatter(x = limma_dataframe['logFC'],
                      y = -np.log10(limma_dataframe['adj.P.Val']),
                      text = limma_dataframe.index,
                      xlab = 'logFC',
                      ylab='-log10(P)')

##### MA Plot
The MA plot is a second common way to display results of a differential gene expression analysis.  It displays average normalized expression on the x axis and logFC on the Y axis.

In [None]:
plots.plot_2d_scatter(x = limma_dataframe['AveExpr'],
                      y = limma_dataframe['logFC'],
                      text = limma_dataframe.index,
                      xlab = 'AveExpr',
                      ylab='logFC')

### 2. Characteristic Direction
Second, we calculate a differential gene expression signature using the *Characteristic Direction* method, which has been shown to outperform other methods to identify DEGs in the context of transcription factor (TF) and drug perturbation responses (Clark et al, 2013, [link](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-79)).

In [None]:
# Run CD using a Python wrapper
cd_dataframe = compute_signature(rawcount_dataframe,
                                 method = 'CD',
                                 experimental_samples = ['GSM1436351', 'GSM1436352'], # insert list of experimental sample names
                                 control_samples = ['GSM1436353', 'GSM1436354'],# insert list of control sample names
                                 )

In [None]:
# Explore results
cd_dataframe.head()

## 6. Enrichment Analysis
We now use the differential gene expression signature computed with CD and perform **enrichment analysis** on the top most overexpressed and underexpressed genes using the *Enrichr* API.

Reference on how to use the API in Python here http://amp.pharm.mssm.edu/Enrichr/help#api.

In [None]:
# Write code to upload gene lists to the Enrichr API
# Steps: (1) sort the genes by the CD value, (2) take the top 500 top and bottom genes,
# (3) perform POST request as shown in the manual.

## 7. Small Molecole Query
Finally, we use the differential gene expression signature computed with CD to identify **small molecules which mimic or reverse** the observed pattern using the *L1000CDS<sup>2</sup>* API.

Reference on how to use the API in Python here http://amp.pharm.mssm.edu/L1000CDS2/help/#api.

In [None]:
# Write code to upload signature to L1000CDS2 API
### "Gene-set search" 
# Steps: (1) sort the genes by the CD value, (2) take the top 500 top and bottom genes,
# (3) perform POST request as shown in the manual (gene-set search example).