Skip to content

A Snakemake pipeline to go from raw .subreads.bam PacBio Iso-Seq to assembled mRNA isoforms (FASTA format)

License

Notifications You must be signed in to change notification settings

SilkeAllmannLab/pacbio_snakemake

Repository files navigation

Snakemake workflow: PacBio Iso-Seq processing pipeline

Snakemake GitHub actions status

Introduction

A Snakemake workflow for processing PacBio raw subreads.bam into polished mRNA isoforms in FASTA format.
Optionnally, long assembled mRNAs can be aligned against a genomic reference to generate a genomic annotation in the GFF3 format.

Steps

The workflow follows Iso-Seq standard analysis that consists of the following steps:

  1. Get Circular Consensus Sequence (CCS) reads.
  2. Get Full Length (FL) reads.
  3. Get refined Full-Length, Non-Concatemer (FLNC) reads.
  4. Get transcript isoforms from (refined and clustered) FLNC reads.
  5. Optionally, align these transcript isoforms to a genome reference and create a GFF3 annotation file.

PacBio Iso-Seq terminology

| name | abbreviation | explanation | |----------------------------------- |-------------- |------------------------------------------------------------------------------------------------ | | Full-Length Reads | FL reads | CCS reads with 5’ and 3’ cDNA primers removed. | | Full-Length, Non-Concatemer Reads | FLNC reads | Reads FLNC Reads CCS reads with 5’ and 3’ cDNA primers, polyA tail, and concatemers removed. | | High-Quality Isoforms | HQ isoforms | Polished transcript sequences with predicted accuracy ≥99% & ≥2 FLNC | | Low-Quality Isoforms | LQ isoforms | Polished transcript sequences with predicted accuracy <99% & ≥2 FLNC |

Usage

The usage of this workflow is described in the Snakemake Workflow Catalog and also here.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and its DOI (see above).

Install conda and mamba

For each rule, a dedicated Conda/Mamba environment On the crunchomics cluster,

To install the 'conda' package manager from the lightweight miniconda distribution, follow instructions here.

To install the mamba package manager, follow the instructions here.

Create a ''snakemake' environment

This will be your starting environment with:

To create it, run mamba env create -f config/environment.yaml to install these three Python dependencies.

Run Snakemake with conda

Snakemake will use the rule conda environments defined in envs/ for each given rule. It will install the conda environment using mamba so be sure mamba is available by running either which mamba.

If using Snakemake interactively execute: snakemake --use-conda -j X where X is your number of cores.
Otherwise submit your jobs using SLURM job manager: sbatch pacbio_snakemake_sbatch.sh.

Pipeline maintainers

  • Tijs Bliek, technician, Plant Development and Epigenetics, SILS, University of Amsterdam.
  • Marc Galland, support data scientist, Plant Physiology, SILS, University of Amsterdam.

References

PacBio conda tools

https://github.com/PacificBiosciences/pbbioconda

PacBio Iso-Seq workflow

TODO

  • Replace <owner> and <repo> everywhere in the template (also under .github/workflows) with the correct <repo> name and owning user or organization.
  • Replace <name> with the workflow name (can be the same as <repo>).
  • Replace <description> with a description of what the workflow does.
  • The workflow will occur in the snakemake-workflow-catalog once it has been made public. Then the link under "Usage" will point to the usage instructions if <owner> and <repo> were correctly set.

About

A Snakemake pipeline to go from raw .subreads.bam PacBio Iso-Seq to assembled mRNA isoforms (FASTA format)

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages