Understanding PacBio transcriptome data

Magdoll edited this page Apr 9, 2015 · 81 revisions
Clone this wiki locally

Last Updated: 02/02/2015

About the Iso-Seq™ Application

Iso-Seq™ (Isoform Sequencing) is the PacBio trademark name for its developed applications for transcriptome sequencing. Please refer to the resources below as well as the wiki for more information:

Reads Of Insert Explained

Starting with SMRT Pipe v2.1.0, CCS (Circular Consensus Sequence) is replaced by the more flexible Reads Of Insert protocol. To explain what Reads of Insert (RoIs) are, first consider what the raw sequence from a single ZMW looks like:

roi_explained

The "number of full passes" for a raw read is the number of times both ends of the SMRTbell™ adapter (black) are observed. In this example, the number of full passes is 2.

Previously, CCS reads were generated if the raw sequence had a minimum of two full passes. If a ZMW outputs a raw read with only one full pass, it would not generate a CCS. Instead, the data would be available through the subreads - that is, the sequences in between adapters.

With the new Reads Of Insert protocol (which is part of the module for the RS_IsoSeq protocol), users can define the minimum required number of full passes (an integer between 0 and 10) and predicted consensus accuracy (a number between 70 and 100). Then every raw sequence that passes the criterion generates a Reads Of Insert consensus sequence.

roi_explained2

The result is a cleaner, user-tunable output where each ZMW produces exactly one Reads Of Insert.

The Reads Of Insert sequence IDs follow the same format as the old CCS:

<movieName>/<ZMW number>

Note that, a ReadsOfInsert sequence could either be a full-length transcript (as defined by the presence of 5' primer, 3' primer, and the polyA tail if applicable) or a non-full-length transcript. The RS_IsoSeq protocol contains a submodule (classify) that automatically divides ReadsOfInsert into full-length and non-full-length bins. For more information, please see the RS_IsoSeq tutorials.

CCS and Subreads explained

WARNING: INFORMATION IN THIS SECTION IS NO LONGER BEING UPDATED. Subreads/CCS ARE NO LONGER BEING GENERATED BY THE MOST RECENT SMRT ANALYSIS SOFTWARE. INFORMATION IN THIS SECTION MAY BE OBSOLETE.

If you are running SMRT Pipe v2.1.0 and later, skip this section as CCS and subreads are replaced by ReadsOfInsert. However, we highly recommend that you upgrade to the latest SMRT Pipe versions! Transcriptome analysis using subreads and CCS are no longer being actively developed at Pacific Biosciences.

At the end of a sequencing run, you'll get .bas.h5 (and maybe .bax.h5) files in the Analysis_Results/ subfolder.

bash5_folder.png

NOTE: Prior to the PacBio RS II, only a single .bas.h5 file was produced per movie. Since then, a single .bas.h5 and three .bax.h5 files are produced per movie, with the .bas.h5 being a virtual encapsulation of the three .bax.h5 files. Read here for a detailed explanation.

The .bas.h5 (or .bax.h5) file contains the raw base calls along with quality metrics (QVs). Unless you need access to quality values (Phred scores indicating how likely the base is a mismatch/substitution, insertion, deletion, and so on), you should not have to deal with these files directly.

m121208_091821_42175_c000441702559900001500000112311424_s1_p0 is the movie name. If there are multiple movies in the same run, they are distinguished by the prefix s, so the second and third movies would have the prefix:

m121208_091821_42175_c000441702559900001500000112311424_s2_p0
m121208_091821_42175_c000441702559900001500000112311424_s3_p0

and so on.

With older versions of the PacBio RS, you may also see the file m121208_091821_42175_c000441702559900001500000112311424_s1_p0.fasta, which contains the raw reads. If there are 150k ZMWs in a SMRT Cell, then there would be about 150k sequences in the raw read FASTA file. The raw read FASTA file should not be used directly for analysis as it does not represent the single sequence of a transcript. Rather, it is the sequencing result of going around and around the circular SMRTbell™ structure. Below is an example of a raw sequence:

raw sequence unrolled

Secondary Analysis removes the SMRTbell™ adapters, trims away low quality regions, and breaks the raw read up into (in this case, three) separate subreads.

You can run Secondary Analysis either through SMRT Portal or through the command-line (smrtpipe.py). Either way, in the output data/ subdirectory:

smrtpipe output data subfolder

filtered_subreads.fasta is the filtered subreads. The sequence ID are named by:

<movieName>/<ZMW number>/<subread start_subread end> (for subreads)

The subread start is the 0-based, end is 1-based (same as used by UCSC) offset on the raw read.

Suppose the raw read was 2,850 bp. The first "pass" of the sequencing (before reaching the adapter the first time) went on for 1000 bp. The next 50bp is the adapter sequence. Then it goes to the other side of the SMRTbell™ and sequences the transcript again, this time in reverse complement. Then, at 2,100, it hits the adapter again, and goes on to its third pass of sequencing, but time ran out or the polymerase got tired and stopped there.

The corresponding subreads would have the sequence IDs:

m120530_022750_42142_c100391630010000001523040611021240_s2_p0/14/0_1000 m120530_022750_42142_c100391630010000001523040611021240_s2_p0/14/1050_2100 m120530_022750_42142_c100391630010000001523040611021240_s2_p0/14/2150_2850

If you are counting transcripts, it is important to count by ZMW, and not by subread. All subreads from the same ZMW come from the same transcript.

ccs_reads.fasta (or filtered_CCS_subreads.fasta) contains the filtered CCS (circular consensus sequence) reads. The sequence ID naming convention is:

<movieName>/<ZMW number>

Which is only the movie name and ZMW number. As every subread from the same ZMW is from the same molecule (but with random sequencing errors), we can align them and output one consensus sequence which would give us higher base accuracy than individual subreads.

Not all ZMWs get a corresponding CCS read. Currently, the criterion is that a ZMW has to have at least two "full-pass" subreads, defined by being flanked on both sides by adapters. In the example above, because only the second pass is a full-pass subread, it does not get a CCS.

If your transcript lengths are short (1-2k), chances are a good number (if not most) of the ZMWs will get a CCS. In that case, you can probably just work on the ccs_reads.fasta (or filtered_CCS_subreads.fasta) file. CCS has higher accuracy (> 95%) than raw sequences (~85%) and can be easily mapped to known transcripts or genomes using alignment tools such as BLAST, BLAT, GMAP, or PacBio's own BLASR (see NOTE 2). In my experience, at ~85% single pass accuracy, subreads can be mapped with BLAST, BLAT, and GMAP, but might require a bit more post-processing. BLASR is designed to work with PacBio's read lengths and single- and multi-pass accuracy, and works well on both CCS and subreads.

NOTES:

  1. Currently, if a ZMW gets a CCS read, it is output both as single CCS read in ccs_reads.fasta (or filtered_CCS_subreads.fasta) and as separate subreads in filtered_subreads.fasta. To create a subread file that is exclusively non-CCS, you can use this script from the cDNA project on GitHub.

  2. BLASR is recommended only for mapping subreads or CCS reads to transcript sequences, not to a genomic reference. This is because BLASR is not splicing-aware; it simply treats short introns as long deletions and will fail to detect distant junctions. GMAP is designed to work with both spliced and non-spliced transcripts. BLAT, although not originally designed for splice mapping, has shown in practice to be capable of mapping spliced transcripts back to the genome, though it usually requires extra output processing.

How to determine if a subread/CCS/ReadsOfInsert is a full-length transcript

With PacBio's long read length, most transcripts have a chance of being sequenced in full-length, with no assembly required.

Currently, we determine transcripts to be full-length using the 5'/3' primer information from the cDNA kit. Seeing both the 5' and 3' primer and maybe even a poly A tail, is the best guarantee of an FL transcript.

A side benefit of using 5' and 3' primer for FL identification is that, if the primers are not symmetric, the orientation of the sequence (in the absence of poly A tail) can be determined. In the example, by seeing the reverse complement of the 3' primer, followed by poly Ts, and ending with the reverse complement of the 5' primer, we can tell this is the reverse complement of the actual transcript.

Both the official release of the RS_IsoSeq protocol (SMRT Analysis v2.2.0 and later) and the GitHub code (pbtranscript-tofu) support identification of full-length reads.


For Research Use Only. Not for use in diagnostic procedures. © Copyright 2010 - 2015, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html.