GitHub - IGDRion/ANNEXA: Nextflow pipeline to extend reference annotation with nanopore reads, classify novel genes (mRNAs vs lncRNAs).

Introduction

ANNEXA is an all-in-one reproductible pipeline, written using the Nextflow workflow system, which allows users to analyze LR-RNAseq data (Long-Read RNASeq), and to reconstruct and quantify known and novel genes and transcript isoforms.

Pipeline summary

ANNEXA works by using only three parameter files (a reference genome, a reference annotation and mapping files) and provides users with an extended annotation distinguishing between novel protein-coding (mRNA) versus long non-coding RNAs (lncRNA) genes. All known and novel gene/transcript models are further characterized through multiple features (length, number of spliced transcripts, normalized expression levels,...) available as graphical outputs.

Check if the input annotation contains all the information needed.
Transcriptome reconstruction and quantification with bambu or StringTie.
Novel classification with FEELnc.
Retrieve information from input annotation and format final gtf with 3 level structure: gene -> transcript -> exon.
Predict the CDS of novel protein-coding transcripts with TransDecoder.
Classify novel transcripts with class codes from GffCompare.
Filter novel transcripts based on bambu NDR (Novel Discovery Rates) and/or TransforKmers TSS validation to assess fulllength transcripts.
Perform a quality control of both the full and filtered extended annotations (see example).
Optional: Check gene body coverage with RSeQC.

This pipeline has been tested with reference annotation from Ensembl and NCBI-RefSeq.

Usage

Install Nextflow
Test the pipeline on a small dataset

nextflow run IGDRion/ANNEXA \
    -profile test,singularity

Run ANNEXA on your own data (change input, gtf, fa with path of your files).

nextflow run IGDRion/ANNEXA \
    -profile {test,docker,singularity,conda,slurm} \
    --input samples.txt \
    --gtf /path/to/ref.gtf \
    --fa /path/to/ref.fa

The input parameter takes a file listing the bam path files to analyze (see example below)

/path/to/1.bam
/path/to/2.bam
/path/to/3.bam

Options

Required options
  --input               [string]  Path to file listing paths to bam files.
  --fa                  [string]  Path to reference genome.
  --gtf                 [string]  Path to reference annotation.

Profile options
  --profile test        [string]  Run annexa on toy dataset.
  --profile slurm       [string]  Run annexa on slurm executor.
  --profile singularity [string]  Run annexa in singularity container.
  --profile conda       [string]  Run annexa in conda environment.
  --profile docker      [string]  Run annexa in docker container.

Main options
  --tx_discovery        [string]  Specify which transcriptome reconstruction tool to use. (accepted: bambu, stringtie2) [default: bambu]
  --filter              [boolean] Perform or not the filtering step. [default: true]
  --withGeneCoverage    [boolean] Run RSeQC (can be long depending on annotation and bam sizes). [default: false]

Bambu options
  --bambu_strand        [boolean] Run bambu with stranded data [default: true]
  --bambu_singleexon    [boolean] Include single exon transcripts in Bambu output or not. These are known to have a high frequency of false positives.
                                  [default: true]
  --bambu_threshold     [integer] bambu NDR threshold below which new transcripts are retained. [default: 0.2]
  --bambu_rec_ndr       [boolean] Use NDR threshold recommended by Bambu instead of preset threshold. [default: false]

Filtering options
  --tfkmers_tokenizer   [string]  Path to TransforKmers tokenizer. Required if filter option is activated.
  --tfkmers_model       [string]  Path to TransforKmers model. Required if filter activated.
  --tfkmers_threshold   [integer] TransforKmers prediction threshold below which new transcripts are retained. [default: 0.2]
  --operation           [string]  Operation to retained novel transcripts. 'union' retain tx validated by either bambu or transforkmers, 'intersection' retain
                                  tx validated by both. (accepted: union, intersection) [default: intersection]

Performance options
  --maxCpu              [integer] Max cpu threads used by ANNEXA. [default: 8]
  --maxMemory           [integer] Max memory (in GB) used by ANNEXA. [default: 40]

Nextflow options
  --resume              [null]    Resume task from cached work (useful for recovering from errors when using singularity).
  --with-report         [null]    Create an HTML execution report with metrics such as resource usage for each workflow process.

If the filter argument is set to true, TransforKmers model and tokenizer paths have to be given. They can be either downloaded from the TransforKmers official repository or trained in advance by yourself on your own data.

Filtering step

By activating the filtering step (--filter), ANNEXA proposes to filter the generated extended annotation according to 2 methods:

By using the NDR proposed by bambu. This threshold includes several information such as sequence profile, structure (mono-exonic, etc) and quantification (number of samples, expression). Each transcript with an NDR below the classification threshold will be retained by ANNEXA (default: 0.2).
By analysing the Transcription Start Sites (TSS) of each new transcripts using the TransforKmers deep-learning based tool. Each TSS validated below a certain threshold will be retained (default: 0.2). We already provide 2 trained models for filtering TSS with TransforKmers.

To use them, extract the zip, and point --tfkmers_model and --tfkmers_tokenizer to the subdirectories.

The filtered annotation can be the union of these 2 tools, i.e. all the transcripts validated by one or both of these tools; or the intersection, i.e. the transcripts validated by both tools (the latter being the default). Please, feee free to see the dedicated wiki page.

At the end, the QC steps are performed both on the full and filtered extended annotations.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github/workflows		.github/workflows
ANNEXA		ANNEXA
assets		assets
bin		bin
examples		examples
img		img
modules		modules
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Pipeline summary

Usage

Options

Filtering step

About

Releases 10

Packages

Contributors 5

Languages

IGDRion/ANNEXA

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline summary

Usage

Options

Filtering step

About

Resources

Stars

Watchers

Forks

Releases 10

Packages 0

Contributors 5

Languages

Packages