Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Barcoding - RSII and SMRT Analysis 2.3.0 or older
This page describes general guidelines and suggestions for using the barcoding protocol in SMRT Analysis v1.4.0 and later (currently: v2.3.0). The emphasis here is on the bioinformatics aspects of barcoding protocol, including some considerations for sample preparation of barcoded insert sequences. More detail on sample preparation can be found in the shared protocol.
SMRT Analysis v1.4.0 and later include barcoding functionality which allows for detection and identification of unique barcodes and to subsequently split reads into separate files per barcode.
See the Secondary Analysis links below for information on the SMRT Portal barcoding protocols.
New Sample Prep Protocols
On April 2015, Pacific Biosciences introduced 96-well barcoding kits for enhanced multiplexed sample preparation. Two kits are available, both of which use 16 bp barcodes lbc001 through lbc096:
- 96-well Barcoded Adapters
SMRTbell adapters with barcodes for appending during sample prep ligation.
- 96-well Barcode Universal Primers
PCR primers with a 30bp universal sequence for appending during PCR amplification.
FORWARD (U1) 5’-GCAGTCGAACATGTAGCTGACTCAGGTCAC-3’
REVERSE (U2) 5’-TGGATCACTTGTGCAAGCATCACATCGTAG-3’
These barcoding kits both generate data sets with a symmetric barcode design, and should be demultiplexed in the symmetric mode in the SMRT Analysis barcoding protocols.
For more information, see the following:
There is a known bug in SMRT Analysis v2.0.0 using sequence data from the 150k RS II. Unordered
barcode.fofn files will cause the barcoded fastq files to sometimes have no data (SMRT Portal fails) or partial data (No failure warning). Additionally, only about one third of barcoded reads will be labelled with a barcode in the aligned reads.
Please update to SMRT Analysis v2.0.1 or later to use barcoding with the PacBio RS II.
A set of 384 16bp barcodes which has been custom designed for the Pacific Biosciences error mode is recommended for use in SMRT barcoding protocols. This barcode set was designed to be used symmetrically (same barcode on either end of insert) or asymmetrically (different barcodes on either end).
Links to downloadable files for ordering primers with the recommended barcodes can be found on samplenet. Corresponding barcode FASTA files for analysis in SMRT Portal can be found here.
- To use only a subset of the barcodes provided, simply select the desired barcodes and also truncate the FASTA file to reflect the barcodes present in the sample set.
- Additionally, an HTML tool for generating FASTA files with asymmetric barcode pairs can be found here.
###Low Multiplex Barcode Subset
For customers who want to perform experiments at lower plex PacBio has performed a limited number of experiments to recommend the best 12 or 24 barcodes. These experiments used Barcoded Adapters with a fixed 2Kb insert. The criteria to select the best were 2-fold:
- We looked at a 'scoreRatio' metric which measures the ratio of best scoring barcode to second best barcode for each read. Barcodes were selected which have a high mean scoreRatio, indicating better separation from other barcodes when scored together.
- The second metric determined barcodes with high number of subreads with the group having the lowest standard deviation in number of subreads.
This means that we may not chose the barcodes with the highest number of subreads or the highest scoreRatio. The best 24 barcoodes, as defined by this experimental system and these metrics are:
lbc1--lbc1 lbc32--lbc32 lbc46--lbc46 lbc56--lbc56 lbc10--lbc10 lbc34--lbc34 lbc48--lbc48 lbc62--lbc62 lbc17--lbc17 lbc35--lbc35 lbc51--lbc51 lbc70--lbc70 lbc19--lbc19 lbc38--lbc38 lbc52--lbc52 lbc75--lbc75 lbc21--lbc21 lbc40--lbc40 lbc54--lbc54 lbc82--lbc82 lbc29--lbc29 lbc41--lbc41 lbc55--lbc55 lbc9--lbc9
Following this same logic, the best 12 from these 24 are:
lbc1--lbc1 lbc48--lbc48 lbc17--lbc17 lbc52--lbc52 lbc29--lbc29 lbc54--lbc54 lbc34--lbc34 lbc62--lbc62 lbc38--lbc38 lbc70--lbc70 lbc40--lbc40 lbc9--lbc9
###Custom Barcode Sequences
Users designing their own barcodes should be aware of the Pacific Biosciences error mode - specifically, homopolymer sequences should be avoided. For correct identification of barcodes during analysis, be sure to include all sequences between the SMRTbell adapter and the insert side end of the barcode sequence in the FASTA file which is passed to SMRT Portal.
Secondary Analysis with Barcodes
Please click on links below for specific instructions for demultiplexing with your installed version of SMRT Analysis.
##Additional command-line tools for accessing barcoded data
Another useful tool for inspecting barcoded results is the latest
cmph5tools.py stats and
cmph5tools.py select scripts in your SMRT Analysis v2.0.0 or later installation. This tool provides an "SQL-like" query language for PacBio native
cmp.h5 alignment files. For example, we can get a per-barcode count of alignments with the following commands:
ALN=aligned_reads.cmp.h5 cmph5tools.py stats --what "Count(Barcode)" \ --where "Barcode != '--'" \ --groupBy Barcode \ $ALN Group Count(Barcode) F_14--R_14 11573 F_15--R_15 13699 F_16--R_16 13930 ... ...
Many other metrics are also possible (see
cmph5tools.py listMetrics for a list), including compound clauses, e.g.:
cmph5tools.py stats --what "Tbl( ml=Mean(ReadLength), c=Count(Barcode))" \ --where "((Barcode == 'F_14--R_14') | (Barcode == 'F_44--R_44')) \ & (Reference == 'EGFR_600_t11_1')" \ --groupBy "Barcode*Reference" \ $ALN Group ml c F_14--R_14:EGFR_600_t11_1 488.09 34 F_44--R_44:EGFR_600_t11_1 429.77 73
The results from
cmph5tools.py stats can also be written to a .csv text file using the
cmph5tools.py select will split a cmp.h5 file into arbitrary cmp.h5 subset file(s) based on the
--groupBy clauses. For example, the following command would split a barcoded cmp.h5 dataset into separate cmp.h5 files named by barcode:
cmph5tools.py select --where "Barcode != '--'" --groupBy Barcode $ALN
Barcode Variant Analysis
Once barcoded subsets are split into independent cmp.h5 files using the above command, making variant calls using Quiver or the MinorVariants tool can be done with two additional steps. The first step is to sort each .cmp.h5 file using the
sort subtool (in-place):
cmph5tools.py sort <barcode>.cmp.h5
Quiver can then be called on each file using the
quiver wrapper from your SMRTanalysis environment:
quiver -j8 <barcode>.cmp.h5 \ -r <reference.fasta> \ -o <barcode>_variants.gff -o <barcode>_consensus.fasta
Before running the MinorVariants tool, make sure you are starting with an
alignedCCS.cmp.h5 file which are alignments of single molecule (CCS or Reads of Insert) consensuses. This can be generated using the
RS_ReadsOfInsert_Mapping protocol in SMRT Portal.
Also be sure to label the alignment file before breaking it up with
cmph5tools.py as in the previous example.
pbbarcode labelAlignments barcode.fofn alignedCCS.cmp.h5
Finally, run the protocol on each sorted .cmp.h5 file.
ConsensusTools.sh MinorVariants -r <reference.fasta> [options] <barcode>.cmp.h5
More information on the MinorVariants tool can be found here.
A sample dataset containing 384-barcoded randomly mutated constructs of the Phi29 DNA Polymerase is described [here] (https://github.com/PacificBiosciences/DevNet/wiki/P4-C2%20384-barcoded%20dataset)