[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GoekeLab/sg-nex-data/blob/master/docs/SG_NEx_FLAIR_tutorial_notebook.ipynb)



## **Isoform discovery and Quantification with FLAIR**

In this tutorial, you will learn how to use FLAIR to identify novel and known transcripts and create a custom transcriptome for your samples. We will then quantify the transcripts in each sample and compare the cell lines.

### **Using Google Colab**

This tutorial requires access to a shell (i.e. Linux, MacOS, or the Windows Subsystem for Linux/WSL). If you do not have access to any shell, you can run this tutorial on Google Colab by clicking the badge on top.

If you use Google Colab, you have to add `!` before any shell command to execute it in a subshell. Changing working directories requires to add `%` instead, which executes the command globally.

## **Content**

- [Installation](#installation)
- [Data download](#data-download)
- [Running software](#running-software)
- [Reference](#reference)


## **Installation**

First to get the SGNex data on Google Colab, we need to install the AWS command line interface

In [None]:
!pip install awscli

In [None]:
! aws --version

We also need to install minimap2, which we require to align samples

In [None]:
! curl -L https://github.com/lh3/minimap2/releases/download/v2.24/minimap2-2.24_x64-linux.tar.bz2 | tar -jxvf -
! ./minimap2-2.24_x64-linux/minimap2

In [None]:
! sudo ln -s /content/minimap2-2.24_x64-linux/minimap2 /usr/bin/minimap2

In [None]:
! minimap2

We reccommend installing FLAIR through conda. To do this in Google Colab, we need to install conda as below:

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

In [None]:
##This actually installs FLAIR (note the installation takes about  7.5 mins)
! conda create -n flair -c conda-forge -c bioconda flair

Normally, we would next run

conda activate flair or !source activate flair

but since activating a conda environment does not carry across cells in Google Colab, we will activate the flair environment each time we run a FLAIR command

## **Data download**

The Singapore Nanopore Expression Project (SG-NEx) has generated a comprehensive resource of long read RNA-Sequencing data using the Oxford Nanopore Sequencing third generation sequencing platform. The data is hosted on the [AWS Open Data Registry](https://registry.opendata.aws/sgnex/) and described in detail here: <https://github.com/GoekeLab/sg-nex-data>

For this workshop we will be using a reduced data set which only includes data from the human chromosome 22. The data can be accessed using the AWS command line interface (or using direct links, which you can find in the online documentation).

In [None]:
! aws s3 ls --no-sign-request s3://sg-nex-data/data/data_tutorial/

In [None]:
! aws s3 ls --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/

In [None]:
! mkdir tutorial
! mkdir tutorial/reference
! mkdir tutorial/fastq

### **Download the reference genome and annotations**

In [None]:
%cd tutorial/reference/

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.fa.fai .

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/annotations/hg38_chr22.gtf .


### **Download fastq files**

In [None]:
%cd /content/tutorial/fastq/

In [None]:
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/HepG2_directRNA_sample1.fastq.gz .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/HepG2_directRNA_sample2.fastq.gz .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/HepG2_directRNA_sample3.fastq.gz .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/A549_directRNA_sample1.fastq.gz .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/A549_directRNA_sample2.fastq.gz .
! aws s3 cp --no-sign-request s3://sg-nex-data/data/data_tutorial/fastq/A549_directRNA_sample3.fastq.gz .
! gzip -d HepG2_directRNA_sample1.fastq.gz
! gzip -d HepG2_directRNA_sample2.fastq.gz
! gzip -d HepG2_directRNA_sample3.fastq.gz
! gzip -d A549_directRNA_sample1.fastq.gz
! gzip -d A549_directRNA_sample2.fastq.gz
! gzip -d A549_directRNA_sample3.fastq.gz
! ls -lh

In [None]:
%cd /content/tutorial/

In order for FLAIR to create a comprehensive transcriptome that contains the transcripts in all of your samples and allows you to compare between them, we need to start by combining all of our reads

In [None]:
! cat fastq/*.fastq > combined_samples.fastq

## **Running software**

Now that we have combined our reads, we can use FLAIR align to align them. If you prefer, you can align them yourself using minimap2 to align and bedtools to convert the aligned bam file to a bed file. FLAIR align just carries what we consider to be optimal parameters for long read RNA transcript alignment.

In [None]:
! source activate flair && flair align -g ./reference/hg38_chr22.fa -r combined_samples.fastq -o combined_samples.flair.aligned
! ls -lh

Next we run FLAIR correct which takes the bed file of aligned reads and corrects alignments to the annotated splice sites. If you have short read junctions, you can pass those to FLAIR with --shortread for even more precision. You can run FLAIR without the correct step, but you may find more novel transcripts that are different by only a couple of bases at the splice site, likely due to initial misalignment of reads.
Note that this step takes about 7.5 mins.

In [None]:
! source activate flair && flair correct -g ./reference/hg38_chr22.fa --gtf ./reference/hg38_chr22.gtf -q combined_samples.flair.aligned.bed -o combined_samples.flair
! ls -lh

FLAIR collapse is next, which does the actual isoform identification. We will run it with a couple of non-standard options which are reccommended

--generate_map generates a file showing which reads support which isoforms

--stringent requires that all reads supporting an isoform cover >= 80% of the isoform, best id you are confident you don't have too many truncated reads

--annotation_reliant first checks for reads that match well to the reference annotation, then identifies novel transcripts from the rest. This reduces novel isoform discovery.

--check_splice might be the most important one - it requires good coverage of the bases at the splice sites and ensures that reads confidently support the correct intron chain


Your output from FLAIR-collapse will consist of a number of files, but the most important are:

prefix.isoforms.gtf - your custom transcriptome which you can align to if you want

prefix.isoforms.bed - the easiest way to visualize your isoforms on the UCSC genome browser or IGV, can also be useful for FLAIR-quantify

prefix.combined.isoform.read.map.txt - all detected isoforms associated with the reads that support them


In [None]:
! source activate flair && flair collapse -g ./reference/hg38_chr22.fa --gtf ./reference/hg38_chr22.gtf -q combined_samples.flair_all_corrected.bed -r combined_samples.fastq --annotation_reliant generate --generate_map --check_splice --stringent --output combined_samples.flair.collapse

! ls -lh

Here we are building a sample manifest file that contains the locations of the reads belonging to our individual samples. If you are planning on running diffExp or diffSplice, make sure your second column (the condition column) matches the different conditions you want to test. Also make sure your sample identifiers (first column) don't contain underlines.

In [None]:
samples = ['HepG2_directRNA_sample1.fastq', 'HepG2_directRNA_sample2.fastq', 'HepG2_directRNA_sample3.fastq', 'A549_directRNA_sample1.fastq', 'A549_directRNA_sample2.fastq', 'A549_directRNA_sample3.fastq']

path = "sample_manifest.tsv"
with open(path, 'w') as f:
  for s in samples:
    f.write('\t'.join([s.split('_')[0] + '-' + s.split('.')[0].split('sample')[1], s.split('_')[0], 'batch1', '/content/tutorial/fastq/' + s]) + '\n')
  #  f.write(sample_manifest)

! cat sample_manifest.tsv

FLAIR quantify will realign the reads from each sample to your custom transcriptome that you created with FLAIR-collapse and quantify them accordingly. --stringent and --check_splice work as described above - if you have truncated or shorter reads, these may be more stringent than you'd like.

In [None]:
! source activate flair && flair quantify -r sample_manifest.tsv -i combined_samples.flair.collapse.isoforms.fa --generate_map --isoform_bed combined_samples.flair.collapse.isoforms.bed --stringent --check_splice
! ls -lht | head -n 2

In [None]:
! head flair.quantify.counts.tsv

Normally after FLAIR quantify, we would run FLAIR diffExp and/or diffSplice to identify transcripts and splice sites with differential usage. However, since we are currently running on small subsets, these modules won't work. Therefore I will quickly show a comparison between these two cell lines using t-tests. Please keep in mind that you will only be seeing a small subset of the differences between these samples

In [None]:
! pip install pandas
! pip install scipy

In [None]:
import pandas as pd
from scipy import stats

df = pd.read_csv("flair.quantify.counts.tsv", sep="\t")
df

In [None]:
hepcol = df.columns[df.columns.str.startswith('HepG2')]
a549col = df.columns[df.columns.str.startswith('A549')]

In [None]:
# newcol = df.apply(lambda row: row[hepcol], axis=1)
df['ttest_pval'] = df.apply(lambda row: stats.ttest_ind(list(row[hepcol]), list(row[a549col])).pvalue, axis=1)
df

Below we identify isoforms that are differentially expressed (p<0.05) according to our t-test. We can see that this filters them down significantly. I hope that after going through this tutorial, you feel ready to run FLAIR on bigger datasets and discover novel insights about transcriptomic changes and diversity!

In [None]:
df_filtered = df.loc[df['ttest_pval'] < 0.05].copy()
df_filtered[['IsoName', 'GeneName']] = df_filtered.ids.str.split("_", expand = True)
df_filtered = df_filtered.sort_values(by=['GeneName'])
df_filtered

## **Reference**

The paper describing FLAIR can be found at: https://rdcu.be/djHvm

The FLAIR documentation can be found at: https://flair.readthedocs.io/en/latest/index.html

If you use the dataset from SG-NEx in your work, please cite the following paper.

Chen, Ying, et al. “A systematic benchmark of Nanopore long read RNA
sequencing for transcript level analysis in human cell lines.” bioRxiv
(2021). doi: <https://doi.org/10.1101/2021.04.21.440736>