<a href="https://colab.research.google.com/github/GoekeLab/sg-nex-data/blob/master/docs/colab/Introduction_genomics_3_GoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Genomics Workshop 3: A long read RNA-Seq pipeline in Nextflow

Bioinformatics pipelines often consist of multiple tools that are used to generate the final output. In this workshop we will use a workflow manager (Nextflow) to automatically execute the long read RNA-Seq workflow. We will be using long read Nanopore RNA-Seq data from the Singapore Nanpore Expression Project (SG-NEx).


### Using Google Colab

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). If you do not have access to any shell, you can run this tutorial on Google Colab by clicking the badge on top.

If you use Google Colab, you have to add `!` before any shell command to execute it in a subshell. Changing working directories requires to add `%` instead, which executes the command globally.

## Installation


We will use the AWS command line interface to access and download the SG-NEx data. We will use minimap2 for read alignment, samtools for sam to bam file conversion, and Bambu for quantification and transcript discovery. We will use Nextflow to run the workflow.

### Using Google Colab

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). When using Google Colab, you have to add ! before any shell command to execute it in a subshell. Changing working directories requires to add % instead, which executes the command globally.

This script enables the execution of R commands from Google Colab (using the Python template)

In [None]:
%load_ext rpy2.ipython

First we will create a folder `software`, which will be used to download software that we want to install:

In [None]:
! mkdir software 
%cd software


**Nextflow** is a workflow management system that enables users to write reproducible and portable pipelines. Nextflow provides many features that can be very helpful when developing complex pipelines, such as the ability to restart a disrupted workflow run, and the generation of a workflow execution report. You can read more about Nextflow here: https://www.nextflow.io/. 

In [None]:
! curl -s https://get.nextflow.io | bash
! sudo ln -s /content/software/nextflow /usr/bin/nextflow


The **AWS Command Line Interface** can be used to access data stored on the AWS cloud S3 objects.

You can install `awscli` using the following command:

In [None]:
! python -m pip install awscli

**Minimap2** is a software to align the sequencing reads (fastq files) to a reference genome. Here we will use pre-compiled binaries, for detailed installation instructions you can refer to the [Minimap2 website] (https://github.com/lh3/minimap2)

In [None]:
! curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
! sudo ln -s /content/software/minimap2-2.26_x64-linux/minimap2 /usr/bin/minimap2

Reads which are aligned to a reference genome (for example using Minimap2) are stored in sam files (or bam files, which are compressed sam files). **Samtools** is a collection of tools to handle sam and bam files. You can install the latest version of Samtools as described online (http://www.htslib.org/download/). Depending on the operating system, you can also install Samtools using the following command (which might not be the latest version):

In [None]:
! sudo apt install samtools

**Bambu** is a R package which requires a recent version of R (>4.0). Installation guidelines can be found online: <https://www.r-project.org/>

R is already installed on Google Colab. [Bambu](https://github.com/GoekeLab/bambu) can be installed either through Github or through Bioconductor (recommended). This step might take 30 minutes.

In [None]:
%%R
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("bambu", update=FALSE)


After all software is installed, we change back to the parent directory:

In [None]:
%cd ..

### Data Download 

For this workshop, we will create a `workshop` directory and three sub-directories (`reference`, `fastq`, and `nextflow`) that will store the human genome sequence and annotations (reference), the sequencing data reads(fastq), and the nextflow script that describes the complete workflow.



In [None]:
! mkdir -p workshop/reference
! mkdir workshop/fastq
! mkdir workshop/nextflow


The Singapore Nanopore Expression Project (SG-NEx) has generated a comprehensive resource of long read RNA-Sequencing data using the Oxford Nanopore Sequencing third generation sequencing platform. The data is hosted on the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/) and described in detail here: <https://github.com/GoekeLab/sg-nex-data>

**Downloading the human genome sequence and annotations (fa, fa.fai, and gtf)**

For this workshop we will be using a reduced data set which only includes data from the human chromosome 22. The data can be accessed using the AWS command line interface (or using direct links, which you can find in the online documentation).


In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa.fai workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.gtf workshop/reference/

**Downloading the sequencing reads**

Here we will use the fastq files from the SG-NEx project that contain reads from chromosome 22:


In [None]:
! aws s3 sync --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/ workshop/fastq/

### Workflow execution


Here we will run the workflow that combines workshop 1 (read alignment and sam to bam conversion) with workshop 2 (transcript discovery and quantification with Bambu). 

The workflow will be executed using Nextflow (please refer to the lecture slides for additional details). Workflow results are cached in a directory `$PWD/work`, where `$PWD` is the path to the current directory. Here we will execute the workflow from the workshop directory:


In [None]:
%cd workshop

The following command downloads the nextflow script into the `nextflow/` directory:

In [None]:
! wget "https://raw.githubusercontent.com/GoekeLab/sg-nex-data/master/docs/colab/workflow_longReadRNASeq.nf"  -P nextflow/

In [None]:
! ls nextflow/

To run the workflow we use the nextflow command with the option `-with-report` to provide a summary report about the workflow run, and the option `-resume` that allows us to resume a run with existing intermediate results if it was disrupted or modified. The other arguments are defined in the workflow script, and provide the path to the reads, reference genome files, and the output directory:


In [None]:
! nextflow run nextflow/workflow_longReadRNASeq.nf -with-report -resume \
      --reads $PWD/fastq/A549_directRNA_sample2.fastq.gz \
      --refFa $PWD/reference/hg38_chr22.fa \
      --refGtf $PWD/reference/hg38_chr22.gtf \
      --outdir $PWD/results/

Once the run is complete, you can list all results that are generated and stored in the output directory using

In [None]:
! ls -lh results/

You can see the results from transcript discovery (stored in the *.gtf file), and the results from transcript and gene expression quantification. 

With the following command you can view the read count for some of the transcripts:

In [None]:
! head results/counts_transcript.txt

The file shows the transcript id, the corresponding gene id, and the number of aligned reads. The prefix "Bambu" indicates that a gene or transcript is newly discovered, prefix "ENS" corresponds to gene and transcript IDs from the annotations.

### Cache and resume

Nextflow stores results from the workflow execution for each process in the `$PWD/work` directories, which allows us to modify parts of the workflow and resume the run without recomputing results from processes which were not changed. 

Here we will change the transcript discovery argument in bambu to NDR=0 (no transcript discovery). 


In [None]:
! cp nextflow/workflow_longReadRNASeq.nf nextflow/workflow_longReadRNASeq_original.nf
! sed -i 's/'NDR=1'/'NDR=0'/g' nextflow/workflow_longReadRNASeq.nf 
! diff nextflow/*

This change will modify the Bambu process, but not the process for alignment or sam to bam conversion. Using the `-resume` option, we can now execute the workflow using the cached results:

In [None]:
! nextflow run nextflow/workflow_longReadRNASeq.nf -with-report -resume \
      --reads $PWD/fastq/A549_directRNA_sample2.fastq.gz \
      --refFa $PWD/reference/hg38_chr22.fa \
      --refGtf $PWD/reference/hg38_chr22.gtf \
      --outdir $PWD/results/

The results directory now includes results from the modified Bambu process, where only annotated transcripts and genes will be quantified:

In [None]:
! head results/counts_transcript.txt

### Clean working directories when the run is complete


Once the run is completed and the results are obtained, the work directories should be cleaned. This can be done manually, or using the `nextflow clean` command. First, we will list the nextflow runs:

In [None]:
! nextflow log -q

We can now either specify to remove cached data from a specific run, or we can delete the data from the last run (default option). the `-n` argument indicated a dry-run:

In [None]:
! nextflow clean -n

With the `-f` argument, the files will be removed:

In [None]:
! nextflow clean -f

We can again list the nextflow runs after this clean step:

In [None]:
! nextflow log -q

And we can repeat the steps to list files that will be deleted when we run nextflow clean (`-n` option for dryrun), and finally delete these files (`-f`):

In [None]:
! nextflow clean -n

In [None]:
! nextflow clean -f


### 1.3.4. The execution report

The `-with-report` option generates an execution report that contains useful information about resources that were used by each process. The report is stored in the execution directory as a html file.

>**Exercise:** Download and view the reports that were generated. How many processes were executed? How many CPUs were used by them, and how much memory? Which process took the longest to complete?

