<a href="https://colab.research.google.com/github/ICBI/BISR_Tutorials/blob/main/RNA_seq/RNAseq_CompareGroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Group comparision analysis example for VRE.
Example using TCGA ESCA (Easophageal cancer) obtained from Broad Inst TCGA data repo Firebrowser (<http://firebrowse.org>)

### Assumptions for analysis

* User knows programming to be able to clean data and put it in the right format
* Two input data files: Gene expression data from RNA-seq (raw counts), and clinical data
* RNA-seq file format : Rows are features (gene names), Columns are samples. Only tumor samples were chosen for this analysis
* RNA-seq file format : Features are assumed to be in this format `GeneName|GeneID` . Example shown here is from public TCGA data obtained from Firebrowser.
* Clinical data: This is a cleaned file that contains two columns - patient ids and vital status (Dead/Alive)
* User knows to identify baseline and comparison groups for analysis

## Goal of this analysis
* Use the gene expression data from RNA-seq (raw counts) obtained from tumor tissue of Esophageal cancer patients to compare Dead and Alive patients.
* Baseline Group (less screwed up group) = Alive
* Comparison Group (screwed up group) = Dead

## Analysis Steps

* Read in gene exp data
* Read in clinical data file
* Use the clinical data to separate gene expression data for baseline and comparison groups
* Set labels for the groups
* Call function to perform the group comparison analysis. This function will use the Bioconducor package `EdgeR` <https://bioconductor.org/packages/release/bioc/html/edgeR.html> . The package requires the RNA-seq data to be in the form of raw counts.
* Read in the results of group comp analysis
* Select threshold for short listing results. Ideal threshold would result in number of genes  less than 700 or 1000.
* Call function to perform Enrichment Analysis. This is done using the EnrichR R package. <https://cran.r-project.org/web/packages/enrichR/index.html>

### Step 0 - Before we start
* Use terminal (shell) to copy the input files from the google drive `MyDrive/rna_seq`
* Upload code files to the google drive base folder `MyDrive/rna_seq`


In [None]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [None]:
#installing R in this
#!pip uninstall rpy2 -y #unisntall any old version
!pip install rpy2==3.5.1
%load_ext rpy2.ipython


In [5]:
#testing R
#called magic R cell to indicate R code in a python notebook

test1 <- 5
print(test1)

[1] 5


### Install packages

In [None]:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
install.packages("openxlsx")
BiocManager::install("edgeR")
library(edgeR)
library(openxlsx)
library(stringr)

UsageError: Cell magic `%%R` not found.


### Location of input files

In [7]:
enrichRFileLocation <- "https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/input/20201203-EnrichR-Databases.txt"

geneExpLocation <- "https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/input/20210318_TCGA_ESCA_RNAseq_GeneExp_RawCounts_Tumor.tsv"

clinFileLocation <- "https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/input/TCGA_ESCA_ClinDataVitalStatus.tsv"

groupCompOutputLocation <- "https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/input/output_edgeR_ExactTest.csv"

#### Load function R code files

In [6]:
## call function to run EdgeR
source("https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/funcSubsetForEdgeR.R")

####Calling function to subset gene results and run EnrichR for each option
source("https://github.com/ICBI/BISR_Tutorials/blob/main/RNA_seq/funcSubsetForEnrichR.R")

ERROR: ignored

### ** Part 1 - Group comparison analysis **

#### Read in gene expression file

In [8]:
#20531 rows features, 173 tumor samples columns
geneExp <- read.csv(file = geneExpLocation,
                    sep="\t", stringsAsFactors = F, row.names = 1)

# First 20 rows - no gene names
head(geneExp[1:20, 1:4])

#View Rows 5000-5010 - can see gene names
head(geneExp[5000:5010,1:3])

“EOF within quoted string”


ERROR: ignored

#### Read in cleaned clinical data

In [9]:


# 185 sample  IDs
clinData <- read.csv(file = clinFileLocation,
                     sep = "\t", stringsAsFactors = F)

head(clinData)


Unnamed: 0_level_0,X..DOCTYPE.html.
Unnamed: 0_level_1,<chr>
1,<html lang=en data-color-mode=auto data-light-theme=light data-dark-theme=dark data-a11y-animated-images=system>
2,<head>
3,<meta charset=utf-8>
4,<link rel=dns-prefetch href=https://github.githubassets.com>
5,<link rel=dns-prefetch href=https://avatars.githubusercontent.com>
6,<link rel=dns-prefetch href=https://github-cloud.s3.amazonaws.com>


#### Subset clinical data  into two data frames - dead and alive

In [None]:
%%R

## Subset clinical data patients into two data frames

clinDataAlive <- subset(clinData, V2 == "alive") #128 patients
clinDataDead <- subset(clinData, V2 == "dead") #57 patients dead


In [None]:
#checks
%%R
dim(clinDataAlive)

[1] 128   2


#### Subset gene exp data  into two data frames - dead and alive

In [None]:
%%R
## Subset gene exp data into two data frames - alive and dead
#Make sure data is numerical format, not strings

#Subset alive patients
matchBaseline <- which(colnames(geneExp) %in% clinDataAlive$V1)
geneExpBaseline <- geneExp[,matchBaseline] #124 samples alive

#Subset dead patients
matchComparison <- which(colnames(geneExp) %in% clinDataDead$V1)
geneExpComparison<- geneExp[,matchComparison] # 49 samples dead

#Get number of columns
nColBaseline <- ncol(geneExpBaseline)
nColComparision <- ncol(geneExpComparison)

In [None]:
%%R

#checks
dim(geneExpBaseline)
dim(geneExpComparison)

[1] 20531    49


#### Prep data in the right format for the function for Group Comp

In [None]:
%%R

## Prep data to call function for Group Comp

# RNA-seq data for function
inputForGroupComp <- cbind(geneExpBaseline, geneExpComparison)

labelsForGroupComp <- c(rep("alive", nColBaseline),
                          rep("dead", nColComparision))

In [None]:
#checks
%%R
dim(inputForGroupComp)
head(labelsForGroupComp)

[1] "alive" "alive" "alive" "alive" "alive" "alive"


#### Data is now prepped. Call function to run Group comp analysis using EdgeR package

Loading required package: limma



In [None]:
## call function to run EdgeR
funcSubsetForEdgeR(inputData = inputForGroupComp,
           groupLabels = labelsForGroupComp,
           baselineGrp = "alive",
           compGrp = "dead",
           outputFileLocation = "/home/jupyter/tutorials/storage/output/edgeR")

Called from: funcSubsetForEdgeR(inputData = inputForGroupComp, groupLabels = labelsForGroupComp, 
    baselineGrp = "alive", compGrp = "dead", outputFileLocation = "/home/jupyter/tutorials/storage/output/edgeR")
debug: inputData[inputData < 0] = 0
debug: group_subset <- factor(as.character(groupLabels))
debug: group_subset <<- relevel(group_subset, ref = baselineGrp)
debug: if (min(inputData) < 0) {
    input2_Subset <<- inputData + 2
} else {
    input2_Subset <<- inputData
}
debug: input2_Subset <<- inputData
debug: funcEdgeR(inputData = input2_Subset, grpData = group_subset, 
    outFile = outputFileLocation)


#### Now the group comparison analysis is over.
### ** Part 2 ** Read in the results of the group comp analysis. Set 4 different threshold options **

In [None]:
## Read in EdgeR output. Set thresholds. Get gene names. Remove duplicates

#17962 features. Make sure the data frame is a numeric data frame
edgeRoutput <- read.csv(file = groupCompOutputLocation,
                  header = T, stringsAsFactors = F, row.names = 1)

### Set threshold values
pValueCutOff <- 0.01
logFCCutOff <- 1


#### Option 1 - find out how many features with p-value < 0.01

In [None]:
### Option 1 - find out how many features with p-value < 0.01

which1 <- which(as.numeric(edgeRoutput$PValue) <= pValueCutOff)
print("Option 1: Features with p-value < 0.01:")
print(length(which1))
option1 <- edgeRoutput[which1, ] #save results from Option 1


[1] "Option 1: Features with p-value < 0.01:"
[1] 6366


#### Option 2 - find out how many features with p-value < 0.01 and logFoldChange >= 1 or <= -1

In [None]:
### Option 2 - find out how many features with p-value < 0.01 and logFoldChange >= 1 or <= -1

which2 <- which((edgeRoutput$PValue <= pValueCutOff) &
                    (edgeRoutput$logFC >= logFCCutOff | edgeRoutput$logFC <= -logFCCutOff))
print("Option 2: features with p-value < 0.01 and Log Fold Change >=1 or <= -1:")
print(length(which2))
option2 <- edgeRoutput[which2, ] #save results from Option 2

[1] "Option 2: features with p-value < 0.01 and Log Fold Change >=1 or <= -1:"
[1] 1274


#### Option 3 - find out how many features with FDR < 0.01

In [None]:
### Option 3 - find out how many features with FDR < 0.01

which3 <- which(as.numeric(edgeRoutput$FDR) <= 0.01)
print("Option 3: features with FDR < 0.01:")
print(length(which3))
option3 <- edgeRoutput[which3, ] #save results from Option 3

[1] "Option 3: features with FDR < 0.01:"
[1] 5598


#### Option 4 - find out how many features with FDR < 0.01 and LogFoldChange >= 1.5 or <= -1.5

In [None]:
### Option 4 - find out how many features with FDR < 0.01 and LogFoldChange >= 1.5 or <= -1.5

which4 <- which((edgeRoutput$FDR <= 0.01) &
                    (edgeRoutput$logFC >= 1.5 | edgeRoutput$logFC <= -1.5))
print("Option 4: Features with FDR < 0.01, and Log Fold change >=1.5 or <= -1.5:")
print(length(which4))
option4 <- edgeRoutput[which4, ] #save results from Option 4

[1] "Option 4: Features with FDR < 0.01, and Log Fold change >=1.5 or <= -1.5:"
[1] 649


### ** Part 3 **
### Call function to run Enrichment analyasis (using EnrichR package) on the 4 options

In [None]:


# Option 1
library(openxlsx)
funcSubsetForEnrichR(shortListResults = option1,
                     filename1 = "/home/jupyter/tutorials/storage/output/option1_shortlist_results.csv",
                     optionName1 = "option1")

Uploading data to Enrichr... Done.
  Querying KEGG_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying WikiPathways_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying KEGG_2019_Mouse... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying GO_Biological_Process_2018... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Reactome_2016... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying BioPlanet_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying ClinVar_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRRUST_Transcription_Factors_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Transcription_Factor_PPIs... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRANSFAC_and_JASPAR_PWMs... Done.
Parsing results

In [None]:
####Calling function to subset gene results and run EnrichR for
# Option 2
library(openxlsx)
funcSubsetForEnrichR(shortListResults = option2,
                     filename1 = "/home/jupyter/tutorials/storage/output/option2_shortlist_results.csv",
                     optionName1 = "option2")

Uploading data to Enrichr... Done.
  Querying KEGG_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying WikiPathways_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying KEGG_2019_Mouse... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying GO_Biological_Process_2018... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Reactome_2016... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying BioPlanet_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying ClinVar_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRRUST_Transcription_Factors_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Transcription_Factor_PPIs... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRANSFAC_and_JASPAR_PWMs... Done.
Parsing results

In [None]:
####Calling function to subset gene results and run EnrichR for
# Option 3
library(openxlsx)
funcSubsetForEnrichR(shortListResults = option3,
                     filename1 = "/home/jupyter/tutorials/storage/output/option3_shortlist_results.csv",
                     optionName1 = "option3")

Uploading data to Enrichr... Done.
  Querying KEGG_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying WikiPathways_2019_Human... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying KEGG_2019_Mouse... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying GO_Biological_Process_2018... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Reactome_2016... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying BioPlanet_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying ClinVar_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRRUST_Transcription_Factors_2019... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying Transcription_Factor_PPIs... Done.
Parsing results... Done.
Uploading data to Enrichr... Done.
  Querying TRANSFAC_and_JASPAR_PWMs... Done.
Parsing results

In [None]:
####Calling function to subset gene results and run EnrichR for
# Option 4
library(openxlsx)
funcSubsetForEnrichR(shortListResults = option4,
                     filename1 = "/home/jupyter/tutorials/storage/output/option4_shortlist_results.csv",
                     optionName1 = "option4")

#### All Analaysis steps done.
