<a href="https://colab.research.google.com/github/GoekeLab/sg-nex-data/blob/update_tutorials/docs/colab/Introduction_Genomics_2_GoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Genomics Workshop 2: Transcript Discovery and Quantification

In this workshop we will learn how to quantify gene expression and transcript expression from RNA-Seq data. We will also learn how to identify new genes and transcripts. We will be using long read Nanopore RNA-Seq data from the Singapore Nanpore Expression Project (SG-NEx). This workshop follows the [online tutorial on Bambu](https://github.com/GoekeLab/sg-nex-data/blob/master/docs/SG-NEx_Bambu_tutorial.md)


### Using Google Colab

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). If you do not have access to any shell, you can run this tutorial on Google Colab by clicking the badge on top.

If you use Google Colab, you have to add `!` before any shell command to execute it in a subshell. Changing working directories requires to add `%` instead, which executes the command globally.


To execute R, you can run the following code:

In [None]:
! pip install rpy2==3.4.2
%load_ext rpy2.ipython

You can now access R through Google Colab as illustrated with this small example code:

In [None]:
%%R
x<-2
show(x^2)

## Installation

In [None]:
%%R
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("bambu", update=FALSE)

In [None]:
! sudo apt install awscli

Test that bambu can be loaded in R

In [None]:
%%R
library(bambu)


### Data Download 

The Singapore Nanopore Expression Project (SG-NEx) has generated a comprehensive resource of long read RNA-Sequencing data using the Oxford Nanopore Sequencing third generation sequencing platform. The data is hosted on the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/) and described in detail here: <https://github.com/GoekeLab/sg-nex-data>

For this workshop we will be using a reduced data set which only includes data from the human chromosome 22. The data can be accessed using the AWS command line interface (or using direct links, which you can find in the online documentation).

In [None]:
! aws s3 ls --no-sign-request s3://sg-nex-data/data/data_tutorial/

In [None]:
! mkdir -p workshop/reference
! mkdir workshop/fastq
! mkdir workshop/bam
! mkdir workshop/bambu

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa.fai workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.gtf workshop/reference/


In [None]:
! aws s3 sync --no-sign-request s3://sg-nex-data/data/data_tutorial/bam/ workshop/bam/

 ### Transcript Discovery and Quantification with Bambu

In [None]:
%%R
library(bambu)
fa.file <- 'workshop/reference/hg38_chr22.fa'
gtf.file <- 'workshop/reference/hg38_chr22.gtf'
annotations <- prepareAnnotations(gtf.file) # This function creates a reference annotation object which is used for transcript discovery and quantification in Bambu.
samples.bam <- list.files("workshop/bam/", pattern = ".bam$", full.names = TRUE)

In [None]:
%%R
se <- bambu(reads = samples.bam, annotations = annotations, genome = fa.file, ncore = 2)  


### The SummarizedExperiment object

Bambu returns a SummarizedExperiment object. The SummarizedExperiment object stores the quantification results and transcript annotations for the analysis. In addition to the main data matrix, additional information is stored to describe the rows (transcripts or genes) and columns (samples). You can find out more [here]())https://bioconductor.org/help/course-materials/2019/BSS2019/04_Practical_CoreApproachesInBioconductor.html)


In [None]:
%%R
se

In [None]:
%%R
colData(se)

In [None]:
%%R
rowRanges(se) #returns a GRangesList (with genomic coordinates) with all annotated and newly discovered transcripts.

In [None]:
%%R
rowData(se) #returns additional information about each transcript such as the gene name and the class of the newly discovered transcript.

In [None]:
%%R
assays(se) #returns the transcript abundance estimates as counts or CPM.

In [None]:
%%R
head(assays(se)$CPM) #returns the first 6 rows of the CPM matrix.

### Novel transcripts and genes

In [None]:
%%R
show(table(mcols(se)$novelTranscript))
show(which(mcols(se)$novelGene)) # lists new gene candidates
show(rowRanges(se)[which(mcols(se)$novelGene)[1]]) # shows the ranges of a transcript from a novel gene candidate

### Which BCR transcript is expressed?

>Exercise: Which transcript form the BCR gene (ENSG00000186716) is most highly expressed? How many full length reads support this transcript? Visualise the data in the [UCSC Genome Browser](https://genome.ucsc.edu/cgi-bin/hgGateway) using these custom tracks for the HepG2 direct RNA-Seq data set:

```
track type=bigWig name="SGNex_HepG2_directRNA_replicate1_run3.bigwig" description="SGNex_HepG2_directRNA_replicate1_run3.bigwig" bigDataUrl=http://sg-nex-data.s3.amazonaws.com/data/sequencing_data_ont/genome_browser_data/bigwig/SGNex_HepG2_directRNA_replicate1_run3.bigwig

track type=bigBed name="SGNex_HepG2_directRNA_replicate1_run3.bigbed" description="SGNex_HepG2_directRNA_replicate1_run3.bigbed" bigDataUrl=http://sg-nex-data.s3.amazonaws.com/data/sequencing_data_ont/genome_browser_data/bigbed/SGNex_HepG2_directRNA_replicate1_run3.bigbed

In [None]:
%%R
round(assays(se)$counts[grep('ENSG00000186716', rowData(se)$GENEID),],4)

In [None]:
%%R
round(assays(se)$fullLengthCounts[grep('ENSG00000186716', rowData(se)$GENEID),],4)

### Transcript and Gene Expression

In [None]:
%%R
se_gene <- transcriptToGeneExpression(se)

### Export the output

In [None]:
%%R
writeBambuOutput(se, path = "./workshop/bambu/")

In [None]:
! ls workshop/bambu/
! head workshop/bambu/*