Scalable De Novo Isoform Discovery
IsoSeq3 contains the newest tools to identify transcripts in PacBio single-molecule sequencing data. Starting in SMRT Link v6.0.0, those tools power the IsoSeq3 GUI-based analysis application. A composable workflow of existing tools and algorithms, combined with a new clustering technique, allows to process the ever-increasing yield of PacBio machines with similar performance to IsoSeq1 and IsoSeq2.
Latest version can be installed via bioconda package
Please refer to our official pbbioconda page for information on Installation, Support, License, Copyright, and Disclaimer.
- 3.0.0: Initial release, included in SMRT Link 6.0.0
PacBio supports three different SMRTbell designs for IsoSeq libraries. In all designs, transcripts are labelled with asymmetric primers, whereas a polyA tail is optional. Barcodes may be optionally added.
For each cell, the
are needed for processing.
Circular Consensus Sequence calling
Each sequencing run is processed by ccs to generate one representative circular consensus sequence (CCS) for each ZMW. Only ZMWs with at least one full pass (at least once subread with SMRT adapter on both ends) are used for the subsequent analysis. Polishing is not necessary in this step and is by default deactivated through `.
ccs movie.subreads.bam ccs.bam --noPolish --minPasses 1
Primer removal and demultiplexing
Removal of cDNA primers and identification of barcodes (if given) is performed using lima,
which offers a specialized
More information about how to name input primer(+barcode) sequences in this FAQ.
lima --isoseq --dump-clips --no-pbi ccs.bam primers.fasta demux.bam
The following is the
primer.fasta for the Clontech SMARTer cDNA library prep, which is the officially recommended protocol:
>primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGG >primer_3p GTACTCTGCGTTGATACCACTGCTT
The following are examples for barcoded samples using a 16bp barcode followed by Clontech primer:
>primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >brain_3p CGCACTCTGATATGTGGTACTCTGCGTTGATACCACTGCTT >liver_3p CTCACAGTCTGTGTGTGTACTCTGCGTTGATACCACTGCTT
lima will remove unwanted combinations and orient sequences to 5' -> 3' orientation.
From here on, execute the following steps for each output BAM file.
Clustering and polishing
IsoSeq3 wraps all tools into one fat binary.
$ isoseq3 isoseq3 - De Novo Transcript Reconstruction Tools: cluster - Cluster CCS reads to transcripts polish - Polish the clustering output summarize - Create a barcode overview CSV file Examples: isoseq3 cluster movie.consensusreadset.xml unpolished.bam isoseq3 polish unpolished.bam movie.subreadset.xml polished.bam isoseq3 summarize polished.bam summary.csv
Clustering and transcript clean up
Compared to previous IsoSeq approaches, IsoSeq3 performs a single clustering technique. Due to the nature of the algorithm, it can't be efficiently parallelized. It is advised to give this step as many cores as possible. The individual steps of cluster are as following:
- Trimming of polyA tails
- Rapid concatmer identification and removal
- Clustering using hierarchical n*log(n) alignment and iterative cluster merging
- Unpolished POA sequence generation
The input file for cluster is one demultiplexed CCS file:
The following output files of cluster contain unpolished isoforms:
<prefix>.bam.pbi<- Only generated with
<prefix>.transcriptset.xml<- Only relevant for pbsmrtpipe
<prefix>.consensusreadset.xml<- Only relevant for pbsmrtpipe
isoseq3 cluster demux.P5--P3.bam unpolished.bam -j 32 [--split-bam 24]
The algorithm behind polish is the arrow model that also used for CCS
generation and polishing of de-novo assemblies. This step can be massively
parallelized by splitting the
unpolished.bam file. Split BAM files can be
generated by cluster.
The input files for polish are:
The following output files of polish contain polished isoforms:
<prefix>.bam.pbi<- Only generated with
<prefix>.hq.fasta.gzwith predicted accuracy ≥ 0.99
<prefix>.lq.fasta.gzwith predicted accuracy < 0.99
<prefix>.hq.fastq.gzwith predicted accuracy ≥ 0.99
<prefix>.lq.fastq.gzwith predicted accuracy < 0.99
isoseq3 polish unpolished.bam m54020_171110_2301211.subreads.bam polished.bam
This is an example of an end-to-end cmd-line-only workflow to get from subreads to polished isoforms.
$ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam $ wget https://downloads.pacbcloud.com/public/dataset/RC0_1cell_2017/m54086_170204_081430.subreads.bam.pbi $ ccs --version ccs 3.1.0 (commit v3.1.0) $ time ccs m54086_170204_081430.subreads.bam m54086_170204_081430.ccs.bam \ --noPolish --minPasses 1 real 50m43.090s user 3531m35.620s sys 24m36.884s $ cat primers.fasta >primer_5p AAGCAGTGGTATCAACGCAGAGTACATGGGG >primer_3p AAGCAGTGGTATCAACGCAGAGTAC $ lima --version lima 1.7.1 (commit v1.7.1) $ time lima m54086_170204_081430.ccs.bam primers.fasta demux.bam \ --isoseq --no-pbi --dump-clips real 0m6.543s user 0m51.170s $ ls demux* demux.json demux.lima.counts demux.lima.report demux.lima.summary demux.primer_5p--primer_3p.bam demux.primer_5p--primer_3p.subreadset.xml $ time isoseq3 cluster demux.primer_5p--primer_3p.bam unpolished.bam --verbose Read BAM : (200740) 8s 313ms India : (197869) 9s 204ms Save flnc file : 35s 366ms Convert to reads : 36s 967ms Sort Reads : 69ms 756us Aligning Linear : 42s 620ms Read to clusters : 7s 506ms Aligning Linear : 37s 595ms Merge by mapping : 37s 645ms Consensus : 1m 47s Merge by mapping : 8s 861ms Consensus : 12s 633ms Write output : 3s 265ms Complete run time : 5m 12s real 5m12.888s user 58m35.243s $ ls unpolished* unpolished.bam unpolished.bam.pbi unpolished.cluster unpolished.fasta unpolished.flnc.bam unpolished.flnc.bam.pbi unpolished.flnc.consensusreadset.xml unpolished.transcriptset.xml $ time isoseq3 polish unpolished.bam m54086_170204_081430.subreads.bam polished.bam --verbose 14561 real 60m37.564s user 2832m8.382s $ ls polished* polished.bam polished.bam.pbi polished.hq.fasta.gz polished.hq.fastq.gz polished.lq.fasta.gz polished.lq.fastq.gz polished.transcriptset.xml
If you have multiple cells, you should run
--split-bam in the cluster step which will produce chunked cluster results. Each chunked cluster result can be run as a parallel polish job and merged at the end. The following example splits into 24 chunks.
sample.subreadset.xml is the dataset containing all the input cells. The
isoseq3 polish jobs can be run in parallel.
$ isoseq3 cluster demux.primer_5p--primer_3p.bam unpolished.bam --split-bam 24 $ isoseq3 polish unpolished.0.bam sample.subreadset.xml polished.0.bam $ isoseq3 polish unpolished.1.bam sample.subreadset.xml polished.1.bam $ ...
Why IsoSeq3 and not the established IsoSeq1 or IsoSeq2?
The ever-increasing throughput of the Sequel system gave rise to the need for a scalable software solution that can handle millions of CCS reads, while maintaining sensitivity and accuracy. Internal benchmarks have shown that IsoSeq3 is orders of magnitude faster than currently employed solutions and SQANTI attributes IsoSeq3 a higher number of perfectly annotated isoforms. Additional benefit, single linux binary that requires no dependencies.
Why is the number of transcripts much lower with IsoSeq3?
Even though we also observe fewer polished transcripts with IsoSeq3, the overall quality is much higher. Most of the low-quality transcripts are lost in the demultiplexing step. Isoseq1/2 classify is too relaxed and is not filtering junk molecules to a satifactory level. In fact, lima calls are spot on and effectively removes most molecules that are wrongly tagged, as in two 5' or two 3' primers. Only a proper 5' and 3' primer pair allows to identify a full-length transcript and its orientation.
I can't find the classify step
Classify has been replaced with PacBio's standard demultiplexing tool lima. Lima does not remove polyA tails, nor detects concatmers. See the next Q.
Can I perform "classify only" to get FLNC reads?
One of the early outputs of the
cluster step is a
*.flnc.bam file. Feel
free to abort after this file has been written. This will be addressed in the
How long will it take until my data has been processed?
There is no ETA feature. Depending on the sample type, whole transcriptome or targeted amplification, run time varies. The same number of reads from a whole transcriptome sample can finish clustering in minutes, whereas a single gene amplification of 10kb transcripts can take a couple of hours.
Which clustering algorithm is used?
In contrast to its predecessors, IsoSeq3 does not rely on NP-hard clique
finding, but uses a hierarchical alignment strategy with
Recent advances in rapid alignment of long reads make this this approach
How many CCS reads are used for the unpolished cluster sequence representation?
Cluster uses up to 10 CCS reads to generate the unpolished cluster consensus.
How many subreads are used for polishing?
Polish uses up to 60 subreads to polish the cluster consensus.
When are two reads clustered?
IsoSeq3 deems two reads to stem from the same transcript, if they meet following criteria:
There is no upper limit on the number of gaps.
My sample has poly(A) tails, how can I remove them?
--require-polya. This filters for FL reads that have a poly(A) tail
with at least 20 base pairs and removes identified tail.
BAM tags explained
Following BAM tags are being used:
ibBarcode summary: triplets delimited by semicolons, each triplet contains two barcode indices and the ZMW counts, delimited by comma. Example:
imZMW names associated with this isoform
isNumber of ZMWs associated with this isoform
izMaximum number of subreads used for polishing
rqPredicted accuracy for polished isoform
Quality values are capped at
The binary does not work on my linux system!
Binaries require SSE4.1 CPU support; CPUs after 2008 (Penryn) include it.
THIS WEBSITE AND CONTENT AND ALL SITE-RELATED SERVICES, INCLUDING ANY DATA, ARE PROVIDED "AS IS," WITH ALL FAULTS, WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE. YOU ASSUME TOTAL RESPONSIBILITY AND RISK FOR YOUR USE OF THIS SITE, ALL SITE-RELATED SERVICES, AND ANY THIRD PARTY WEBSITES OR APPLICATIONS. NO ORAL OR WRITTEN INFORMATION OR ADVICE SHALL CREATE A WARRANTY OF ANY KIND. ANY REFERENCES TO SPECIFIC PRODUCTS OR SERVICES ON THE WEBSITES DO NOT CONSTITUTE OR IMPLY A RECOMMENDATION OR ENDORSEMENT BY PACIFIC BIOSCIENCES.