TAGADA: Transcript And Gene Assembly, Deconvolution, Analysis

TAGADA is a Nextflow pipeline that processes RNA-Seq data. It parallelizes multiple tasks to control reads quality, align reads to a reference genome, assemble new transcripts to create a novel annotation, and quantify genes and transcripts.

Dependencies

To use this pipeline you will need:

Nextflow >= 21.04.1
Docker >= 19.03.2 or Singularity >= 3.7.3

Usage

A small dataset is provided to test this pipeline. To try it out, use this command:

nextflow run FAANG/analysis-TAGADA -profile test,docker -revision 2.1.2 --output directory

Nextflow options

The pipeline is written in Nextflow, which provides the following default options:

Option	Example	Description	Required
`-profile`	`profile1,profile2,etc.`	Profile(s) to use when running the pipeline. Specify the profiles that fit your infrastructure among `singularity`, `docker`, `kubernetes`, `slurm`.	Required
`-config`	`custom.config`	Configuration file tailored to your infrastructure and dataset. To find a configuration file for your infrastructure, browse nf-core configs. Some large datasets require more computing resources than the pipeline defaults. To specify custom resources for specific processes, see the custom resources section.	Optional
`-revision`	`version`	Version of the pipeline to launch.	Optional
`-work-dir`	`directory`	Work directory where all temporary files are written.	Optional
`-resume`		Resume the pipeline from the last completed process.	Optional

For more Nextflow options, see Nextflow's documentation.

Input and output options

Option	Example	Description	Required
`--output`	`directory`	Output directory where all results are written.	Required
`--reads`	`'path/to/reads/*'`	Input `fastq` file(s) and/or `bam` file(s). For single-end reads, your files must end with: `.fq[.gz]` For paired-end reads, your files must end with: `_[R]{1,2}.fq[.gz]` For aligned reads, your files must end with: `.bam` If the provided path includes a wildcard character like `*`, you must enclose it with quotes to prevent Bash glob expansion, as per Nextflow's requirements. If the files are numerous, you may provide a `.txt` sheet with one path or url per line.	Required
`--annotation`	`annotation.gtf`	Input reference annotation file or url. Be careful this file should contain both exon and transcript rows and should include gene_id and transcript_id in the 9th field.	Required
`--genome`	`genome.fa`	Input genome sequence file or url.	Required
`--index`	`directory`	Input genome index directory or url.	Optional, to skip genome indexing
`--metadata`	`metadata.tsv`	Input tabulated metadata file or url.	Required if `--assemble-by` or `--quantify-by` are provided

Merge options

Option	Example	Description	Required
`--assemble-by`	`factor1,factor2,etc.`	Factor(s) defining groups in which transcripts are assembled. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details.	Optional
`--quantify-by`	`factor1,factor2,etc.`	Factor(s) defining groups in which transcripts are quantified. Aligned reads of identical factors are merged and each resulting merge group is processed individually. See the merging inputs section for details.	Optional

Assembly options

Option	Example	Description	Required
`--min-transcript-occurrence`	`2`	After transcripts assembly, rare novel transcripts that appear in few assembly groups are removed from the final novel annotation. By default, if a transcript occurs in less than `2` assembly groups, it is removed. If there is only one assembly group, this option defaults to `1`.	Optional
`--min-monoexonic-occurrence`	`2`	If specified, rare novel monoexonic transcripts are filtered according to the provided threshold. Otherwise, this option takes the value of `--min-transcript-occurrence`.	Optional
`--min-transcript-tpm`	`0.1`	After transcripts assembly, novel transcripts with low TPM values in every assembly group are removed from the final novel annotation. By default, if a transcript's TPM value is lower than `0.1` in every assembly group, it is removed.	Optional
`--min-monoexonic-tpm`	`1`	If specified, novel monoexonic transcripts with low TPM values are filtered according to the provided threshold. Otherwise, this option takes the value of `--min-transcript-tpm * 10`.	Optional
`--coalesce-transcripts-with`	`tmerge`	Tool used to coalesce transcripts assemblies into a non-redundant set of transcripts for the novel annotation. Can be `tmerge` or `stringtie`. Defaults to `tmerge`.	Optional
`--tmerge-args`	`'--endFuzz 10000'`	Custom arguments to pass to tmerge when coalescing transcripts.	Optional
`--feelnc-filter-args`	`'--size 200'`	Custom arguments to pass to FEELnc's filter script when detecting long non-coding transcripts.	Optional
`--feelnc-codpot-args`	`'--mode shuffle'`	Custom arguments to pass to FEELnc's coding potential script when detecting long non-coding transcripts.	Optional
`--feelnc-classifier-args`	`'--window 10000'`	Custom arguments to pass to FEELnc's classifier script when detecting long non-coding transcripts.	Optional

Skip options

Option	Example	Description	Required
`--skip-assembly`		Skip transcripts assembly with StringTie and skip all subsequent processes working with a novel annotation.	Optional
`--skip-lnc-detection`		Skip detection of long non-coding transcripts in the novel annotation with FEELnc.	Optional

Resources options

Option	Example	Description	Required
`--max-cpus`	`16`	Maximum number of CPU cores that can be used for each process. This is a limit, not the actual number of requested CPU cores.	Optional
`--max-memory`	`64GB`	Maximum memory that can be used for each process. This is a limit, not the actual amount of allotted memory.	Optional
`--max-time`	`24h`	Maximum time that can be spent on each process. This is a limit and has no effect on the duration of each process.	Optional

Custom resources

With large datasets, some workflow processes may require more computing resources than the pipeline defaults. To customize the amount of resources allotted to specific processes, add a process scope to your configuration file. Resources provided in the configuration file override the resources options.

Example configuration

-config custom.config

custom.config

process {

  withName: TRIMGALORE_trim_adapters {
    cpus = 8
    memory = 18.GB
    time = 36.h
  }

  withName: STAR_align_reads {
    cpus = 16
    memory = 64.GB
    time = 2.d
  }

}

Metadata

Using --metadata, you may provide a file describing your inputs with tab-separated factors. The first column must contain file names without file type extensions or paired-end suffixes. There are no constraints on column names or number of columns.

Example metadata

--reads reads.txt --metadata metadata.tsv

reads.txt

path/to/A_R1.fq
path/to/A_R2.fq
path/to/B.fq.gz
path/to/C.bam
path/to/D.fq

metadata.tsv

input    tissue     stage
A        liver      30 days
B        liver      30 days
C        liver      60 days
D        muscle     60 days

Merging inputs

When using --assemble-by and/or --quantify-by, your inputs are merged into experiment groups that share common factors. With --assemble-by, transcripts assembly is done individually for each assembly group, and consensus transcripts are kept to generate a novel annotation. With --quantify-by, quantification values are given individually for each quantification group.

Merging inputs by a single factor

--assemble-by tissue --quantify-by stage

Metadata			Transcripts assembly by tissue	Annotation	Quantification by stage
input	tissue	stage	Transcripts assembly by tissue	Annotation	Quantification by stage
A	liver	30 days	A, B, C ↓ liver	liver, muscle ↓ novel annotation	A, B ↓ 30 days
B	liver	30 days			A, B ↓ 30 days
C	liver	60 days			C, D ↓ 60 days
D	muscle	60 days	D ↓ muscle		C, D ↓ 60 days

Merging inputs by an intersection of factors

--assemble-by tissue,stage

Metadata			Transcripts assembly by tissue and stage	Annotation	Quantification by input
input	tissue	stage	Transcripts assembly by tissue and stage	Annotation	Quantification by input
A	liver	30 days	A, B ↓ liver at 30 days	liver at 30 days, liver at 60 days, muscle at 60 days ↓ novel annotation	A
B	liver	30 days	A, B ↓ liver at 30 days		B
C	liver	60 days	C ↓ liver at 60 days		C
D	muscle	60 days	D ↓ muscle at 60 days		D

Workflow and results

The pipeline executes the following processes:

FASTQC_control_reads
Control reads quality with FastQC.
TRIMGALORE_trim_adapters
Trim adapters with Trim Galore.
STAR_index_genome
Index genome with STAR.
The indexed genome is saved to output/index.
STAR_align_reads
Align reads to the indexed genome with STAR.
Aligned reads are saved to output/alignment in .bam files.
BEDTOOLS_compute_coverage
Compute genome coverage with Bedtools.
Coverage information is saved to output/coverage in .bed files.
SAMTOOLS_merge_reads
Merge aligned reads by factors with Samtools.
See the merging inputs section for details.
STRINGTIE_assemble_transcripts
Assemble transcripts in each individual assembly group with StringTie.
TAGADA_filter_transcripts
Filter rare transcripts that appear in few assembly groups and poorly-expressed transcripts with low TPM values.
STRINGTIE_coalesce_transcripts or TMERGE_coalesce_transcripts
Create a novel annotation with StringTie or Tmerge.
The novel annotation is saved to output/annotation in a .gtf file.
FEELNC_classify_transcripts
Detect long non-coding transcripts with FEELnc.
The annotation saved to output/annotation is updated with the results.
STRINGTIE_quantify_expression
Quantify genes and transcripts with StringTie.
Counts and TPM matrices are saved to output/quantification in .tsv files.
MULTIQC_generate_report
Aggregate quality controls into a report with MultiQC.
The report is saved to output/control in a .html file.

Novel annotation

The novel annotation contains information from StringTie, Tmerge, and FEELnc. It is provided in gtf format with exon, transcript and gene rows. Row attributes vary depending on which tool was used to coalesce transcripts.

--coalesce-transcripts-with tmerge

gene_id
All rows. The Tmerge gene_id starting with LOC.
ref_gene_id
All rows. A comma-separated list of reference annotation gene_id when a Tmerge transcript is made of at least one reference transcript, otherwise a dot.
transcript_id
Exon and transcript rows. The Tmerge transcript_id starting with TM, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation transcript_id is provided.
tmerge_tr_id
Exon and transcript rows. Optional. A comma-separated list of Tmerge transcript_id if the current transcript_id is from the reference annotation, to list which initial Tmerge transcripts it is made of.
transcript_biotype
Exon and transcript rows. Optional. The reference annotation transcript_biotype of the transcript_id.
feelnc_biotype
Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified.
contains, contains_count, 3p_dists_to_3p, 5p_dists_to_5p, flrpm, longest, longest_FL_supporters, longest_FL_supporters_count, mature_RNA_length, meta_3p_dists_to_5p, meta_5p_dists_to_5p, rpm, spliced
Transcript rows. Attributes provided by Tmerge.

--coalesce-transcripts-with stringtie

gene_id
All rows. The StringTie gene_id starting with MSTRG.
ref_gene_id
All rows. Optional. The reference annotation gene_id.
ref_gene_name
All rows. Optional. The reference annotation gene_name.
transcript_id
Exon and transcript rows. The StringTie transcript_id starting with MSTRG, unless the transcript is exactly identical to a reference transcript, in which case the reference annotation transcript_id is provided.
transcript_biotype
Exon and transcript rows. Optional. The reference annotation transcript_biotype of the transcript_id.
feelnc_biotype
Exon and transcript rows. Optional. The transcript biotype determined by FEELnc (lncRNA, mRNA, noORF, or TUCp) if the transcript has been classified.
exon_number
Exon rows. The StringTie exon_number starting from 1 within a given transcript.

Funding

The GENE-SWitCH project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No 817998.

This repository reflects only the listed contributors views. Neither the European Commission nor its Agency REA are responsible for any use that may be made of the information it contains.

Citing

If you use TAGADA in a publication, please cite this:

Kurylo C, Guyomar C, Foissac S, Djebali S. TAGADA: a scalable pipeline to improve genome annotations with RNA-seq data. NAR Genomics and Bioinformatics. 2023 Dec 1;5(4):lqad089.

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
assets		assets
bin		bin
conf		conf
modules		modules
workflows		workflows
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

License

FAANG/analysis-TAGADA

Folders and files

Latest commit

History

Repository files navigation