# Loading Packages

In [None]:


#bioconductor
library("GenomicRanges")
library("DESeq2")
library("ACME")
library("GEOquery")
library("EnhancedVolcano")

#R 
library("stringr")
library("dplyr")
library("parallel")
library("rmarkdown")
library("knitr")
library("ggfortify")
library("data.table")
library("ggrepel")

#define the input path
filepath <- c("/public/codelab/omics-workshop/OmicsWorkshopVignettes/03_BulkRNAATAC_KithXiang/data")


# RNA-seq

A typical analysis of RNAseq data involves mapping the reads to a reference genome and quantifying the expression
of the genes in order to determine which have significant differential between experimental groups. 

Processing the raw data into a count matrix requires a number of steps, and though it's possible to do it all on a laptop,
it's much preferred to use the HPC.  We're using R for most of the downstream analysis, but the preprocessing is
usually done with other bioinformatics tools, most of which are already installed on the HPC.  

It's important to keep in mind that each of these steps can be done with a different tool.  It's the collection of all
these tools and the order they're applied that makes an analysis *pipeline*.  Be sure to document all your steps when 
working on your own projects.

# Processing from scratch

## Quality control

When you get your raw reads from the sequencer(.fastq files), there is usually some sort of QC report that comes with them that can
tell of potential problems with the sequencing such as low read counts, a drop in quality after a certain read length,
or problems in specific spots of the flowcell.

If you don't have a QC report, it's easy enough to generate one yourself with 
[*fastqc*](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Once it's installed, it can be run at the command line like:

## Trimming adapters

The adapters usually need to be taken off from the fastq reads before they can be aligned to a reference genome.
I like to use a tool called [*trim_galore*](https://github.com/FelixKrueger/TrimGalore) which works for both 
single and paired reads.

If you know the adapter sequence, you can plug this into the trimming tool to chop it of from the reads, or if
you know that a standard adapter was used, they can be removed with default parameters.


## Alignment

This is the most time intensive part of the pipeline, and involves mapping the *small* fastq reads to
the *long* chromosomes of a known reference genome.  The reference needs to be indexed to allow for faster comparisons,
and special care needs to be taken to account for splicings across exons when dealing with RNA.

A popular aligner is [*star*](https://github.com/alexdobin/STAR), which is extremely fast and handles splicing, but
has a very large memory requirement(>20GB), so unless you have a very powerful machine, it needs to be run on the HPC.

Another option with a much lower memory footprint is [*tophat2*](https://ccb.jhu.edu/software/tophat/index.shtml).  This
one is an older program, but still handles splicing and can be run on a low end laptop.

Whichever tool you choose for the alignment, you'll have to get acquainted with its operating procedures and parameter settings. 
I find it easier to start with a working example than reading through a manual, and luckily, most aligners have
pretty good documentation with vignettes and small sample datasets to work through that to help you familiarize yourself
with the software.

## Feature counting

This step is all about quantifying gene expression.  Genes regions must be specified in some format, usually as a .gtf
or .gff file, which contains the chromosome, start, stop of each intron/exon and how they are related to each other.
A program is then used to enumerate the aligned reads over each transcript accounting for various ways genes can overlap.
A popular tool that is laptop friendly is [HTSeq](https://htseq.readthedocs.io/en/latest/htseqcount.html).

## Running the pipeline

Here's an example that shows how to process fastq files from start to finish on the Einstein HPC.
The `star` aligner has a special feature that allows for soft clipping at the start or end of a read,
which lets us skip the step of trimming adapters.  There's also a way to count the genes during 
alignment by running it with the `--quantMode GeneCounts` parameter -- another step handled 
without having to run a separate program.

### run1_star.sh

The pipeline script below has a simple calling form, and can be submitted to the HPC's 
[*slurm workload manager*](https://slurm.schedmd.com/overview.html) 
with the `sbatch` command.

`sbatch slurm_start_hg38.sh sampName sampR1.fastq.gz sampR2.fastq.gz`




### bash loop

If you have more than a few files to process, you won't want to type all that
out at the command line.  Even copy/pasting to a separate file and editing 
by hand is prone to error and should be avoided if possible.
A better way is to write a shell script to prepare the calls for you.
There are many ways to do this, but [bash](https://linuxconfig.org/bash-scripting-tutorial) 
probably the most popular and is available on most computers as the default shell.

In the code below, I'm looping over all fastq files of the first paired ends in a directory,
and working with them through a series of [piped](https://en.wikipedia.org/wiki/Pipeline_(Unix)) programs 

The first `sed` substitution finds the name of the 2nd paired end from the original filename.

The second `sed` substitution finds the name of the sample from the original filename, and adds
a 'star' prefix to the string.

The `echo` command, prints out what we want to run.

The final `>|` [redirect](https://en.wikipedia.org/wiki/Redirection_(computing)) command saves it all in a file.

The loop doesn't execute anything, it just writes out the commands we want to run
into a separate file.  To execute that file, you run it at the command line.

The bash loop below is essentially a script that writes another script.  I like to
do it this way so I can look over the code and catch spelling errors and mistakes
before attempting to run it on the HPC.

### slurm_star_hg38.sh

## Assembling the count matrix

What we have now is a set of count files that have the number of reads mapping to each gene, one per sample.
These can be loaded and merged in R quite easily, to be made ready for the downstream analysis described in
the later section of this report.

In [None]:


#get the count files
pfiles = list.files(path = paste0(filepath,"/counts_rna"),
    pat = ".*star.*.tab", full = TRUE, recursive = TRUE)

#load the count files and just use the first 2 columns
pcounts = lapply(pfiles, function(f){
    print(f)
    try({
        x = read.csv(f, sep="\t", header=F)
        #remove bad rows
        x = x[grepl(x[,1], pat="ENSG"),]
        y = data.frame(ensg=x[,1], counts=x[,2])
        y
    }, silent=T)
})






In [None]:
#remove NAs
names(pcounts) = pfiles
ix = unlist(lapply(pcounts, function(a){class(a) != "try-error"}))
pcounts = pcounts[ix]
is.na(pcounts)

In [None]:
#merge into a data.frame
x = do.call(cbind, lapply(pcounts, function(a){
	a$counts
}))
#make the names easier to read
colnames(x) = gsub(names(pcounts), pat=".*/(.*)_ReadsPerGene.out.tab", rep="\\1")
rownames(x) = pcounts[[1]][,1]


#get rid of NA counts
ix2 = which(apply(x, 1, function(a){
    !any(is.na(a))
}))
x=x[ix2,]

#take out the rownames that have __ in them
#only use the ENSG rows
x = x[grepl(rownames(x), pat="ENSG"),]

#take out the dot?
rownames(x) = gsub(rownames(x), pat="\\..*", rep="")

head(x)

# Loading from GEO

> [Gene Expression Omnibus(GEO)](https://www.ncbi.nlm.nih.gov/geo/) is a public functional 
> genomics data repository supporting MIAME-compliant data submissions. 
> Array- and sequence-based data are accepted. Tools are provided to help users query 
> and download experiments and curated gene expression profiles.   

Though it started as a repository for microarray data, it has grown through the years, and now accepts
next-gen sequencing datasets of all kinds, including RNA, ATAC, OxBS, singlecell, etc.
It's relatively easy to upload your own data, and there are many ways to download from 
the repository, including programmatic interfaces in R and other languages.

In the example below, we'll be downloading a dataset with the accession ID of GSE132040.
Every project in GEO is given a unique identifier, as well as every sample, platform, and dataset in the series.
Searching for a specific dataset can be done through the [website](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi), 
but usually, you'll get the accession ID from the paper you're reading.  Since GEO is so popular
a [Ctrl-F] search of "GSE" in the pdf usually brings you right to the accession number without having to search
through all the supplementaries.

We'll be using the `GEOquery` Bioconductor package to interface to the GEO repository.  
While this package has very robust methods for working with microarray expression data,
it doesn't have every feature available for all next-gen seq types.  For RNAseq data,
most times you'll have to download the processed gene count matrix manually, then link it with 
the phenotypic data that's stored on GEO.

In [None]:
# #create a folder/directory for where geo downloads will be cached
# if (!dir.exists("geo")){
#     dir.create("geo")
# }

# #download geo data with id GSE132040 and save to "geo" dir
# dat <- getGEO("GSE132040", destdir = "geo")
# head(dat)



# use local GEO files 
dat <- vector("list",length = 1)
names(dat) <- "GSE132040_series_matrix.txt.gz"   
dat[[1]] <-getGEO(filename= paste0(filepath,"/geo/GSE132040_series_matrix.txt.gz"))
head(dat)


## Loading the phenotype data

GEOquery offers a consistent way for accessing phenotype info from the datasets in GEO,
using the `pData` function.

In [None]:
#load the phenotype data from GEO, obtain phenodata
#some GSE's have more than 1 of dataset, getGEO returns them in a list
phenodata <- pData(dat[[1]])
head(phenodata)
#Alternatively, you can do the following 
#phenodata <- dat[["GSE132040_series_matrix.txt.gz"]]@phenoData@data

## Loading the expression data

Accessing expression data is usually handled with the `exprs` function, but not in this case.


In [None]:

#Load the expression data
rnadata.geo <- exprs(dat[[1]])
dim(rnadata.geo) #0 rows?!

#the columns of rnadata should match up to the rows of phenodata)
#cbind(colnames(rnadata),rownames(phenodata))
all(colnames(rnadata.geo) == rownames(phenodata))

Notice the ExpressionSet assayData has 0 features, and 947 samples.  This is the matrix that
holds the expression counts for each gene, and it's empty!

The expression data is not accessible from the usual exprs function,
probably because this is bulk RNAseq data and not microarray,
but we can still download the processed data from the GEO as a supplementary table. 

[GEO:  GSE132040](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132040)

[GSE132040_190214_A00111_0269_AHH3J3DSXX_190214_A00111_0270_BHHMFWDSXX.csv.gz](https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132040&format=file&file=GSE132040%5F190214%5FA00111%5F0269%5FAHH3J3DSXX%5F190214%5FA00111%5F0270%5FBHHMFWDSXX%2Ecsv%2Egz)



In [None]:

#you'll need to unzip the file first
#rnadata <- read.csv("GSE132040_190214_A00111_0269_AHH3J3DSXX_190214_A00111_0270_BHHMFWDSXX.csv")

#here's a trick to load large files using the data.table function.  It's much faster than read.csv and even works on zipped files
rnadata <- data.frame(fread(paste0(filepath,"/GSE132040_190214_A00111_0269_AHH3J3DSXX_190214_A00111_0270_BHHMFWDSXX.csv.gz")))
dim(rnadata)

#look at the first 5 rows and first 10 columns to get an idea of the data
#View(rnadata[1:5, 1:10])
#print(rnadata[1:5, 1:10])

rnadata[1:5, 1:10]

Take a look at column 1 in the data.frame above.

The first column is the Gene id, the rest are the rna counts from each sample.

You'll see why later one, but for now, it would be easier to work with if we 
use the first column as the rownames of the data.frame and only use
the counts.

In [None]:
table(duplicated(rnadata$gene)) # check if any duplication gene names 


In [None]:


rownames(rnadata) <- rnadata$gene  # Assign the first column to the rownames
rnadata$gene <- NULL #remove the column from the rest of the matrix

class(rnadata)



In [None]:
rnadata[1:5, 1:10]

## Matching phenotype data to expression data

In order to link the two data sources, we need to find an identifier that is common to both.

Notice the columns of `rnadata` all have a naming structure that looks like:

In [None]:
cbind(head(colnames(rnadata)))

If you look through the columns of `phenodata` that we got from GEO, the only thing that comes close is the *title*.

In [None]:
cbind(head(phenodata$title))

The phenodata titles do not match the csv file exactly, so we need to do some string manipulation 
to transform the column names of the csv to match a section of the title in phenodata.

If we chop off the very last part ".genecode.vM19" of the colnames in the csv file
they will match what's inside the brackets of phenodata$title.

In [None]:
# we can use gsub to manipulate the string. in this case, replace a part of a string.
colnames(rnadata) <- gsub(colnames(rnadata), pattern=".gencode.vM19", replace="")
                     
cbind(head(colnames(rnadata))) # after

In [None]:




# create a new column in phenodata called "tmp_id" 
# that has the sample identifier extracted from the title
# we can use replace again noticed I used \\ in front of ( ) and [ ], because they are special characters 
# a more general approach is to use regular expression 
phenodata$tmp_id <- gsub(phenodata$title, pattern="Tabula Muris Senis \\(bulk RNA-seq\\) \\[", replace="")
phenodata$tmp_id <- gsub(phenodata$tmp_id, pattern="\\]", replace="")

cbind(head(phenodata$tmp_id))



In [None]:
# lets also get rid of :chr1 in some colnames
colnames(phenodata) <- gsub(colnames(phenodata), pattern=":ch1", replace="")

We have to make sure that the sample order of the csv sample matches the sample order of the phenotype data.

As of now, this is not the case...you have to reorder the samples in the two matrices so they correspond to each other.

It's very easy to mess up here!

In [None]:
#is every sample accounted for? yes
all(phenodata$tmp_id %in% colnames(rnadata))


In [None]:
#but they don't match
head(cbind(phenodata$tmp_id, colnames(rnadata)))

In [None]:
#better to work with a temporary variable so you can check your work

#reorder the rows of phenodata to match the columns of rnadata
phenodata_reordered <- phenodata[match(colnames(rnadata), phenodata$tmp_id),]

#check it again (now they match)
head(cbind(phenodata_reordered$tmp_id, colnames(rnadata)))

In [None]:


#from now on we will replace phenodata with phenodata_reordered
phenodata <- phenodata_reordered
rm(phenodata_reordered)

## Selecting a smaller subset

The full dataset from GSE132040, has `r ncol(rnadata)` samples.  DESeq2 is capable of handling datasets this large, 
but it could take hours to run on a dataset like this.  

For our example, we'll be using a smaller subset of just the bone and brain tissues from 1 month postnatal samples.

In [None]:
#run deseq on a subset of rnadata for demonstration purposes
table(phenodata$tissue, phenodata$age)

In [None]:
#lets just look at bone vs brain
#in 1 months postnatal mice

#extract the relevant rows from phenotype data
phenodata_small <- phenodata[phenodata$tissue %in% c("Bone", "Brain"),]
phenodata_small <- phenodata_small[phenodata_small$age %in% c("1 months postnatal"),]


#extract the relevant columns from count data
rnadata_small <- rnadata[,colnames(rnadata) %in% phenodata_small$tmp_id]

dim(phenodata_small)
dim(rnadata_small)

In [None]:
#make sure the IDs match up
table(phenodata_small$tmp_id == colnames(rnadata_small))

In [None]:


#use the new ids instead of the GSM ids
#deseq requires the rownames from the info match the colnames of the counts
rownames(phenodata_small) <- phenodata_small$tmp_id


#make sure the column names don't have any special characters like ":"
colnames(phenodata_small) <- make.names(colnames(phenodata_small))

# Deseq2

There's a great [tutorial](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) 
by Michael Love, the developer of DESeq2, that is hands down the best resource out there
for learning how to do differential expression analysis in R.  It's a long read,
but it's full of details and explanations of every procedure in the package along with all the
code needed to run them.  If you're using DESeq2 for your own analyses, you should read it over at least once.

To summarize the tutorial, DE is carried out by running a regression of the gene counts on
one or more variables, e.g. gender, age, treatment condition.  If you think of the usual linear regression as `Y ~ X`
on a 2D scatter plot, the independent variable on the *Y* axis is the gene count and the dependent variable 
on the *X* axis is the variable and each scatter plot dot is a sample.  

General linear modelling assumes the residuals are normally distributed, 
but with our integer gene counts, it's not the best method to use.  Poisson regression works well with
integer counts, but built into the model's distribution is the assumption that the mean is the same as the variance.
Negative binomial regression is like Poisson, but the relationship between the mean and variance is
governed by a dispersion parameter, so it can be higher or lower than the mean.  

DESeq2 uses negative binomial regression to look for association
between the X and Y variables and uses info from all genes to estimate a proper dispersion parameter.

The main DESeq2 function requires 3 inputs:

* `colDat`:  a data.frame with `nSamples` rows, and `nVariables` columns.  
This holds the phenotypic information of every sample in your study.

* `countData`:  a matrix with `nGenes` rows and `nSamples` columns.  These are your 
raw gene counts.  Note the order of the rows/columns is transposed from the usual R standard of
having one observation per row.

* `design`:  a formula describing the regression of the gene counts on specific columns of `colDat`. 

The first two parameters are straightforward, the last can be tricky if you're not used to R's syntax. 

A `formula` has a left hand side and a right hand side separated by a `~`.  
The LHS is the dependent variable, and in our case, the gene counts.
The RHS are the variables you want to regress upon.  You can use more than one by separating them 
by `+`.  If you want to include interaction terms between two variables you use a `:`. 
As shorthand, if you want to include both main effects and interaction terms between variables you use a `*`.

It's easier to understand with examples:

* `Y ~ X1 + X2` :  A model with Y as the dependent var and X1 & X2 as independent vars:
* `Y ~ X1 * X2` :  A model with Y as the dependent var and X1 and X2 as independent vars that includes interaction between X1 and X2:
* `Y ~ X1 + X2 + X1:X2` :  Same as above

Most times, samples in an experiment are split by a condition like ctrl vs drug.  In this case you will want to regress on
a *grouping* variable that's either dummy coded to 0 or 1, or a categorical factor with each level of your condition. 

If the RHS of a formula has more than 1 term, the default behavior of DESeq2 only reports stats on the very last one. 
This makes it very straightforward to control for confounding effects in your model.

When using DESeq2, we only need to specify the RHS of the formula.

## Running the DE

It doesn't matter where your data comes from, whether it was processed by hand or downloaded from a repository, DESeq2
treats it all the same.  For a basic analysis, you just need to supply those 3 inputs, run 3 functions, and the package will take care
of the rest.

For a more involved analysis that prefilters genes and considers alternative shrinkage estimators,
you should refer to the DESeq2 manual/tutorial for details.

In [None]:

#it's always a good idea to explicitly set a categorical variable as a factor
#so you can control which is the baseline reference.  Otherwise, it sorts 
#alphabetically and chooses the first as the ref.
phenodata_small$tissue <- factor(phenodata_small$tissue, levels = c("Bone", "Brain"))

dds <- DESeqDataSetFromMatrix(countData=rnadata_small, colDat=phenodata_small, design = ~tissue)
dds <- DESeq(dds) #takes a minute
res <- results(dds)
res <- as.data.frame(res) #easier to work with data.frames

#take a look at the results in excel
write.table(res, file="res.csv",sep = ",",quote = F,row.names = T, col.names = NA)

summary(res)



In [None]:
head(res, 30)

The result `res` is a data.frame with 6 columns.

* baseMean:  The average expression for this gene
* log2FoldChange:  The difference in log2 expression between groups
* lfcSE:  A measure of the standard error of the log2 fold change
* stat:  The statistic used to determine significance
* pvalue:  raw p-value
* padj:  The adjusted p-value, corrected for multiple comparisons with FDR 

That's it for the basics of running DESeq2!


## Selecting significant up/down regulated genes

Once you've written out your results, further downstream analysis can be done in R, 
or any other environment you're comfortable with.  Excel is often used
to filter for interesting genes, which are then copy/pasted directly into
an online pathway analysis website such as [Reactome](https://reactome.org/PathwayBrowser/#TOOL=AT).


In [None]:
#remove results with no counts
res <- res[res$baseMean > 0,]
#sort by fold change and 
res <- res[order(res$log2FoldChange),] # turn decreasing = TRUE if want from the largest to the smallest 
#write the output to a spreadsheet

write.table(res,file="de_boneVsbrain_age1month.csv",sep = ",",quote = F,row.names = T,col.names = NA)

dim(res)

In [None]:
#how many genes show significant DE after adjusting for multiple comparisons?
sum(res$padj < .01, na.rm=T)

In [None]:


#what are the genes that have  a log2FC > 2 and an adjusted pvalue < 0.01?
res_up <- res[!is.na(res$log2FoldChange),]
res_up <- res_up[res_up$log2FoldChange > 2,]
res_up <- res_up[!is.na(res_up$padj),]
res_up <- res_up[res_up$padj < 0.01,]
res_up <- res_up[order(res_up$log2FoldChange,decreasing = TRUE),]

res_down <- res[!is.na(res$log2FoldChange),]
res_down <- res_down[res_down$log2FoldChange < -2,]
res_down <- res_down[!is.na(res_down$padj),]
res_down <- res_down[res_down$padj < 0.01,]


nrow(res_up)
nrow(res_down)



In [None]:
#look at the top 20 in each
head(rownames(res_up), 20)
head(rownames(res_down), 20)

## PCA plot

It's a good idea to visualize the global expression over the first
few principal components to check if there are any outliers or other
interesting patterns in your data.  A PCA plot can be used
to determine if there is batch effect that needs to be accounted for,
or some other condition that needs to be addressed.

In [None]:
# when the expression data is alinged to the phenotype data
# it's easy to run all the usual informatics procedures 
dim(rnadata_small)

In [None]:

# the prcomp function complains when there are columns with 0 variance
# to fix it, remove the genes that have no variance
gene_var <- apply(rnadata_small, 1, var) > 0 
table(gene_var)

rnadata_filtered <- rnadata_small[gene_var,]
dim(rnadata_filtered)



In [None]:
pca1 <- prcomp(t(rnadata_filtered), scale=T)

#plot the first 2 principal components, colored by tissue type
#pdf("test1.pdf")
autoplot(pca1, data=phenodata_small, colour='tissue', shape="Sex")
#dev.off()

## Volcano plot
Overall results from a DE analysis are usually shown in a *volcano plot* that 
has the log2FC in the x-axis, and a `-log*()` transformation of the p-value in the y-axis.
Genes that have both high log2FC and low p-values should be examined further.

It's not difficult to make the plots yourself in base R or ggplot, but a better
option may be to to use the 
[EnhancedVolcano](https://bioconductor.org/packages/release/bioc/vignettes/EnhancedVolcano/inst/doc/EnhancedVolcano.html)
package.

In [None]:

vlnPlot1 <- EnhancedVolcano(res,
    lab = rownames(res),
    #subtitle = NULL, #get rid of the subtitle
    #colCustom = keyvals.colour, #give customized color
    x = 'log2FoldChange',
    y = 'padj',
    #xlim = c(-30,30), #x axis range
    #ylim = c(0,300), # y axis range
    title = 'DE_Bone_vs_Brain: 1month', # title label
    xlab= bquote(~Log[2]~ 'fold change'), # x axis label
    ylab= bquote(~-Log[10]~ 'Padj'), # y axis label
    caption = NULL,
    pCutoff = 0.05,
    FCcutoff = 2,
    pointSize = 2.0)

#pdf("test2.pdf")
vlnPlot1
#dev.off()