<a href="https://colab.research.google.com/github/GoekeLab/sg-nex-data/blob/master/docs/colab/Introduction_Genomics_3_GoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Genomics Workshop 3: A long read RNA-Seq pipeline in Nextflow

Bioinformatics pipelines often consist of multiple tools that are used to generate the final output. In this workshop we will use a workflow manager (Nextflow) to automatically execute the long read RNA-Seq workflow. We will be using long read Nanopore RNA-Seq data from the Singapore Nanpore Expression Project (SG-NEx).


### Using Google Colab

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). If you do not have access to any shell, you can run this tutorial on Google Colab by clicking the badge on top.

If you use Google Colab, you have to add `!` before any shell command to execute it in a subshell. Changing working directories requires to add `%` instead, which executes the command globally.

## Installation

This script enables the execution of R commands from Google Colab (using the Python template)

In [None]:
%load_ext rpy2.ipython

software will be downloaded into the software directory:

In [None]:
! mkdir software 
%cd software


installation of nextflow

In [None]:
! curl -s https://get.nextflow.io | bash
! sudo ln -s /content/software/nextflow /usr/bin/nextflow


installation of AWS CLI

In [None]:
! python -m pip install awscli

installation of Minimap2:

In [None]:
! curl -L https://github.com/lh3/minimap2/releases/download/v2.26/minimap2-2.26_x64-linux.tar.bz2 | tar -jxvf -
! sudo ln -s /content/software/minimap2-2.26_x64-linux/minimap2 /usr/bin/minimap2

installation of samtools:

In [None]:
! sudo apt install samtools

installation of Bambu (can take 30 minutes):

In [None]:
%%R
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("bambu", update=FALSE)


In [None]:
%cd ..

### Data Download 

The Singapore Nanopore Expression Project (SG-NEx) has generated a comprehensive resource of long read RNA-Sequencing data using the Oxford Nanopore Sequencing third generation sequencing platform. The data is hosted on the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/) and described in detail here: <https://github.com/GoekeLab/sg-nex-data>

For this workshop we will be using a reduced data set which only includes data from the human chromosome 22. The data can be accessed using the AWS command line interface (or using direct links, which you can find in the online documentation).

In [None]:
! mkdir -p workshop/reference
! mkdir workshop/fastq
! mkdir workshop/nextflow


In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa.fai workshop/reference/
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.gtf workshop/reference/

In [None]:
! aws s3 sync --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/ workshop/fastq/

### Workflow execution

In [None]:
%cd workshop

In [None]:
! wget "https://raw.githubusercontent.com/GoekeLab/sg-nex-data/master/docs/colab/workflow_longReadRNASeq.nf"  -P nextflow/

In [None]:
! ls nextflow/

In [None]:
! nextflow run nextflow/workflow_longReadRNASeq.nf -with-report -resume \
      --reads $PWD/fastq/A549_directRNA_sample2.fastq.gz \
      --refFa $PWD/reference/hg38_chr22.fa \
      --refGtf $PWD/reference/hg38_chr22.gtf \
      --outdir $PWD/results/

In [None]:
! ls -lh results/

In [None]:
! head results/counts_transcript.txt

### Cache and resume

In [None]:
! cp nextflow/workflow_longReadRNASeq.nf nextflow/workflow_longReadRNASeq_original.nf
! sed -i 's/'NDR=1'/'NDR=0'/g' nextflow/workflow_longReadRNASeq.nf 
! diff nextflow/*

In [None]:
! nextflow run nextflow/workflow_longReadRNASeq.nf -with-report -resume \
      --reads $PWD/fastq/A549_directRNA_sample2.fastq.gz \
      --refFa $PWD/reference/hg38_chr22.fa \
      --refGtf $PWD/reference/hg38_chr22.gtf \
      --outdir $PWD/results/

In [None]:
! head results/counts_transcript.txt

### Clean working directories when the run is complete

In [None]:
! nextflow log -q

In [None]:
! nextflow clean -n

In [None]:
! nextflow clean -f

In [None]:
! nextflow log -q

In [None]:
! nextflow clean -n

In [None]:
! nextflow clean -f