<a href="https://colab.research.google.com/github/ICBI/BISR_Tutorials/blob/main/RNA_seq/RNAseq_CompareGroups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Group comparision analysis example using RNA-seq data

## Datasaet used in the example
Example using TCGA ESCA (Easophageal cancer) obtained from Broad Inst TCGA data repo Firebrowser (<http://firebrowse.org>)

## Assumptions for analysis
* User knows programming to be able to clean data and put it in the right format

## Input data files
* Two input data files:
(a) Gene expression data from RNA-seq (raw counts), and
(b) clinical data
* RNA-seq file format :
    * Rows are features (gene names), Columns are samples. Only tumor samples were chosen for this analysis
    * RNA-seq file format : Features are assumed to be in this format `GeneName|GeneID` . Example shown here is from public TCGA data obtained from Firebrowser.
* Clinical data: This is a cleaned file that contains two columns - patient ids and vital status (Dead/Alive)
* User knows to identify baseline and comparison groups for analysis

## Goal of this analysis
* Use the gene expression data from RNA-seq (raw counts) obtained from tumor tissue of Esophageal cancer patients to compare Dead and Alive patients.
* Baseline Group = Alive
* Comparison Group = Dead

## Analysis Steps

* Mount google drive (if google colab is used)
* Install packages
* Read in cleaned gene exp data
* Read in cleaned clinical data file
* Use the clinical data to separate gene expression data for baseline and comparison groups
* Set labels for the groups
* Call function to perform the group comparison analysis. This function will use the Bioconducor package `EdgeR` <https://bioconductor.org/packages/release/bioc/html/edgeR.html> . The package requires the RNA-seq data to be in the form of raw counts.
* Read in the results of group comp analysis
* Select threshold for short listing results. Ideal threshold would result in number of genes  less than 700 or 1000.
* Call function to perform Enrichment Analysis. This is done using the EnrichR R package. <https://cran.r-project.org/web/packages/enrichR/index.html>

### Step 0 - Checks, mount google drive, install packages

IMPORTANT NOTE - Check to see if "run type" of this google colab file is "R" and not "Python"

Mount google drive - takes a few minutes

In [None]:
install.packages("googledrive") #only need to install occasionally install.packages(“httpuv”)
library("googledrive")
#library("httpuv")

In [None]:
#may update python version  #occasionally
if (file.exists("/usr/local/lib/python3.7/dist-packages/google/colab/_ipython.py")) {
  install.packages("R.utils")
  library("R.utils")
  library("httr")
  my_check <- function() {return(TRUE)}
  reassignInPackage("is_interactive", pkgName = "httr", my_check)
  options(rlang_interactive=TRUE)
}

In [None]:
#### testing R
test1 <- 5
print(test1)

[1] 5


### Install R/Bioconductor packages
NOTE - this takes SEVERAL minutes to run

In [None]:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
install.packages("openxlsx")
library(openxlsx)
library(stringr)

In [None]:
BiocManager::install("edgeR")
library(edgeR)

In [None]:
install.packages("enrichR")
library(enrichR)

### Location of input files in github
Note - When using github make sure to use the link to the "raw file" on github

In [61]:
enrichRFileLocation <- "https://raw.githubusercontent.com/ICBI/BISR_Tutorials/main/RNA_seq/input/20201203-EnrichR-Databases.txt"

geneExpLocation <- "https://raw.githubusercontent.com/ICBI/BISR_Tutorials/main/RNA_seq/input/20210318_TCGA_ESCA_RNAseq_GeneExp_RawCounts_Tumor.tsv"

clinFileLocation <- "https://raw.githubusercontent.com/ICBI/BISR_Tutorials/main/RNA_seq/input/TCGA_ESCA_ClinDataVitalStatus.tsv"

### Output folder in google drive


In [None]:
groupCompOutputLocation <- "/content/sample_data/outputedgeR_ExactTest.csv"

outputFolder = "/content/sample_data/"

#### Load the R code files - these are  helper function files that perform a specific analysis
Note - When using github make sure to use the link to the "raw file" on github

In [64]:
## call function to run EdgeR
source("https://raw.githubusercontent.com/ICBI/BISR_Tutorials/main/RNA_seq/funcSubsetForEdgeR.R")

####Calling function to subset gene results and run EnrichR for each option
source("https://raw.githubusercontent.com/ICBI/BISR_Tutorials/main/RNA_seq/funcEnrichment.R")

### Step 1 - Group comparison analysis

#### Read in gene expression file

In [None]:
#20531 rows features, 173 tumor samples columns
geneExp <- read.csv(file = geneExpLocation,
                    sep="\t", stringsAsFactors = F, row.names = 1)

# Check the dimenstions
dim(geneExp)

#View Rows 5000-5010 - can see gene names
head(geneExp[5000:5010,1:3])

Unnamed: 0_level_0,TCGA.2H.A9GF,TCGA.2H.A9GG,TCGA.2H.A9GH
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
DLGAP2|9228,3,11,10
DLGAP3|58512,32,943,23
DLGAP4|22839,7327,8041,5178
DLGAP5|9787,2284,650,2744
DLK1|8788,1,0,0
DLK2|65989,7,175,8


#### Read in cleaned clinical data

In [None]:


# 185 sample  IDs
clinData <- read.csv(file = clinFileLocation,
                     sep = "\t", stringsAsFactors = F)
dim(clinData)
head(clinData)


Unnamed: 0_level_0,Colname,V1,V2
Unnamed: 0_level_1,<int>,<chr>,<chr>
1,1,TCGA.IC.A6RF,alive
2,2,TCGA.JY.A6FB,alive
3,3,TCGA.JY.A938,alive
4,4,TCGA.L5.A43I,dead
5,5,TCGA.L5.A43J,dead
6,6,TCGA.L5.A43M,alive


#### Subset clinical data  into two data frames - dead and alive

In [None]:


## Subset clinical data patients into two data frames

clinDataAlive <- subset(clinData, V2 == "alive") #128 patients
clinDataDead <- subset(clinData, V2 == "dead") #57 patients dead


In [None]:
#checks
dim(clinDataAlive)

#### Subset gene exp data  into two data frames - dead and alive

In [None]:

## Subset gene exp data into two data frames - alive and dead
#Make sure data is numerical format, not strings

#Subset alive patients
matchBaseline <- which(colnames(geneExp) %in% clinDataAlive$V1)
geneExpBaseline <- geneExp[,matchBaseline] #124 samples alive

#Subset dead patients
matchComparison <- which(colnames(geneExp) %in% clinDataDead$V1)
geneExpComparison<- geneExp[,matchComparison] # 49 samples dead

#Get number of columns
nColBaseline <- ncol(geneExpBaseline)
nColComparision <- ncol(geneExpComparison)

In [None]:


#checks
dim(geneExpBaseline)
dim(geneExpComparison)

#### Prep data in the right format for the function for Group Comp

In [None]:

## Prep data to call function for Group Comp

# RNA-seq data for function
inputForGroupComp <- cbind(geneExpBaseline, geneExpComparison)

labelsForGroupComp <- c(rep("alive", nColBaseline),
                          rep("dead", nColComparision))

In [None]:
#checks
dim(inputForGroupComp)
head(labelsForGroupComp)

In [None]:
## Testing if google drive is mounted and whether you can locate this test file
write.csv(clinDataAlive,
    file = paste(outputFolder, "_outputtest12345.csv", sep=""))

#### Data is now prepped. Call function to run Group comp analysis using EdgeR package

#### Call function to run `edgeR`

The output files will be saved to google drive folder `/content/sample_data/` from where user can download the files to local computer.

Note - this takes several minutes

In [None]:
## call function to run EdgeR
funcSubsetForEdgeR(inputData = inputForGroupComp,
           groupLabels = labelsForGroupComp,
           baselineGrp = "alive",
           compGrp = "dead",
           outputFileLocation = "/content/sample_data/outputedgeR")

Now the group comparison analysis is over. The following output files are generated.
* Three types of group comparison tests are executed in the package - Exact Test , GLM Log likelihood ratio (LRT) and Quasi likelihood function (QLF)
* Quality check plot - MDS plot
* Full Merged data for reference

List of output files:

```
* outputedgeR_ExactTest.csv
* outputedgeR_GLM_LRT.csv
* outputedgeR_GLM_QLF.csv
* outputedgeR_MDSPlot.pdf
* outputedgeR_mergedData.csv
```

### Step 2 Read in the results of the group comp analysis. Short list based on filtering criteria

Note - the location is the same location where the edgeR output files were saved.The variable `groupCompOutputLocation` contains the full file path including file name was declared in the beginning

In [None]:
## Read in EdgeR output. Set thresholds. Get gene names. Remove duplicates

#17962 features. Make sure the data frame is a numeric data frame
edgeRoutput <- read.csv(file = groupCompOutputLocation,
                  header = T, stringsAsFactors = F, row.names = 1)

### Set threshold values
pValueCutOff <- 0.01
logFCCutOff <- 1

#check dimensions of this matrix object
dim(edgeRoutput)

Find out how many features with FDR < 0.01 and LogFoldChange >= 1.5 or <= -1.5

In [None]:
### Find out how many features with FDR < 0.01 and LogFoldChange >= 1.5 or <= -1.5

whichFeatures <- which((edgeRoutput$FDR <= 0.01) &
                    (edgeRoutput$logFC >= 1.5 | edgeRoutput$logFC <= -1.5))
print("Features with FDR < 0.01, and Log Fold change >=1.5 or <= -1.5:")
print(length(whichFeatures))
shortMatrix <- edgeRoutput[whichFeatures, ] #save results from

#check dimensions of the short listed matrix
dim(shortMatrix)

#save this short listed file
write.csv(shortMatrix, file = paste(outputFolder, "output_shortListResults.csv", sep=""))



[1] "Features with FDR < 0.01, and Log Fold change >=1.5 or <= -1.5:"
[1] 649


So we started with 179624 features in the full output, and after filtering using FDR and Fold change, we got 649 features

### Step 3 - Cleaning output for next step
After extracting only the gene names from the short listed matrix , the duplicate gene names and NA values are removed (if any).

In [None]:
#### Function to clean gene names
funcSplit <- function(rep_gene) {
    rep_gene_split <- unlist(strsplit(x = rep_gene, split = "|", fixed = TRUE))
    gene <- rep_gene_split[1]
    return(gene)
    }

geneListSplit <- apply(X = as.matrix(row.names(shortResults)),
                           MARGIN = 1, FUN = funcSplit )
head(geneListSplit)
length(geneListSplit)

In [None]:
#remove duplicates, and NA values
geneListSplit1 <- unique(geneListSplit)
geneListSplit2 <- na.omit(geneListSplit1)

head(geneListSplit2)
length(geneListSplit2)

#Save list of unique genes
write.table(x = geneListSplit2,
            file = paste(outputFolder, "_shortListedUniqueGenes.tsv", sep=""),
            quote = F, sep = "\t", row.names = FALSE, col.names = F)

### Step 4 - Call function to run Enrichment analyasis

In [67]:
# Read in list of EnrichR databases
dblist1 <- read.csv(file = enrichRFileLocation,
                   header = F, stringsAsFactors = F)
head(dblist)

Unnamed: 0_level_0,V1
Unnamed: 0_level_1,<chr>
1,KEGG_2019_Human
2,WikiPathways_2019_Human
3,KEGG_2019_Mouse
4,GO_Biological_Process_2018
5,Reactome_2016
6,BioPlanet_2019



Note - takes a few minutes to run


In [None]:

library(openxlsx)
library(enrichR)

#call function to run Enrichment
funcEnrichment(dblist = dblist1, #list of databases
    genes1 = geneListSplit2,  # unique gene names list
    outputFileName = paste(outputFolder, "output_enrichR.xlsx", sep="")) #output file name



#### What is the output of this function ?
* This code runs the EnrichR package on each database listed in the `dblist` object. The results are collated into an excel file `output_enrichR.xlax` which is then saved to google drive.  
* Each tab in the excel file is a database
* Inside each tab is the enrichment results when the input list of genes were run against that database.

### All Analaysis steps done. List of all output files to download.

You can view the files on the left panel in the `sample_data` folder. All files from this tutorial start with the prefix `output_`.


**Output files from group comparison analysis**
* outputedgeR_ExactTest.csv
* outputedgeR_GLM_LRT.csv
* outputedgeR_GLM_QLF.csv
* outputedgeR_MDSPlot.pdf
* outputedgeR_mergedData.csv

**Short listed gene exp matrix based on filtering criteria**
* output_shortListResults.cav

**EnrichR output**
* output_enrichR.xlsx