Barcoding

John Harting edited this page Nov 13, 2015 · 85 revisions
Clone this wiki locally

Barcoding

This page describes general guidelines and suggestions for using the barcoding protocol in SMRT Analysis v1.4.0 and later (currently: v2.3.0). The emphasis here is on the bioinformatics aspects of barcoding protocol, including some considerations for sample preparation of barcoded insert sequences. More detail on sample preparation can be found in the shared protocol.

Introduction

SMRT Analysis v1.4.0 and later include barcoding functionality which allows for detection and identification of unique barcodes and to subsequently split reads into separate files per barcode.

See the Secondary Analysis links below for information on the SMRT Portal barcoding protocols.

New Sample Prep Protocols

On April 2015, Pacific Biosciences introduced 96-well barcoding kits for enhanced multiplexed sample preparation. Two kits are available, both of which use 16 bp barcodes lbc001 through lbc096:

  • 96-well Barcoded Adapters
    SMRTbell adapters with barcodes for appending during sample prep ligation.
  • 96-well Barcode Universal Primers
    PCR primers with a 30bp universal sequence for appending during PCR amplification.
    FORWARD (U1) 5’-GCAGTCGAACATGTAGCTGACTCAGGTCAC-3’
    REVERSE (U2) 5’-TGGATCACTTGTGCAAGCATCACATCGTAG-3’

These barcoding kits both generate data sets with a symmetric barcode design, and should be demultiplexed in the symmetric mode in the SMRT Analysis barcoding protocols.

For more information, see the following:

Known Issues

There is a known bug in SMRT Analysis v2.0.0 using sequence data from the 150k RS II. Unordered barcode.fofn files will cause the barcoded fastq files to sometimes have no data (SMRT Portal fails) or partial data (No failure warning). Additionally, only about one third of barcoded reads will be labelled with a barcode in the aligned reads.

Please update to SMRT Analysis v2.0.1 or later to use barcoding with the PacBio RS II.

Barcode Sequences

A set of 384 16bp barcodes which has been custom designed for the Pacific Biosciences error mode is recommended for use in SMRT barcoding protocols. This barcode set was designed to be used symmetrically (same barcode on either end of insert) or asymmetrically (different barcodes on either end).

Click here for FASTA of PacBio 384 barcodes

Links to downloadable files for ordering primers with the recommended barcodes can be found on samplenet. Corresponding barcode FASTA files for analysis in SMRT Portal can be found here.

  • To use only a subset of the barcodes provided, simply select the desired barcodes and also truncate the FASTA file to reflect the barcodes present in the sample set.
  • Additionally, an HTML tool for generating FASTA files with asymmetric barcode pairs can be found here.

Low Multiplex Barcode Subset

For customers who want to perform experiments at lower plex PacBio has performed a limited number of experiments to recommend the best 12 or 24 barcodes. These experiments used Barcoded Adapters with a fixed 2Kb insert. The criteria to select the best were 2-fold:

  • We looked at a 'scoreRatio' metric which measures the ratio of best scoring barcode to second best barcode for each read. Barcodes were selected which have a high mean scoreRatio, indicating better separation from other barcodes when scored together.
  • The second metric determined barcodes with high number of subreads with the group having the lowest standard deviation in number of subreads.

This means that we may not chose the barcodes with the highest number of subreads or the highest scoreRatio. The best 24 barcoodes, as defined by this experimental system and these metrics are:

lbc1--lbc1    lbc32--lbc32  lbc46--lbc46  lbc56--lbc56 
lbc10--lbc10  lbc34--lbc34  lbc48--lbc48  lbc62--lbc62 
lbc17--lbc17  lbc35--lbc35  lbc51--lbc51  lbc70--lbc70
lbc19--lbc19  lbc38--lbc38  lbc52--lbc52  lbc75--lbc75
lbc21--lbc21  lbc40--lbc40  lbc54--lbc54  lbc82--lbc82           
lbc29--lbc29  lbc41--lbc41  lbc55--lbc55  lbc9--lbc9

Following this same logic, the best 12 from these 24 are:

lbc1--lbc1    lbc48--lbc48
lbc17--lbc17  lbc52--lbc52
lbc29--lbc29  lbc54--lbc54
lbc34--lbc34  lbc62--lbc62
lbc38--lbc38  lbc70--lbc70
lbc40--lbc40  lbc9--lbc9

Custom Barcode Sequences

Users designing their own barcodes should be aware of the Pacific Biosciences error mode - specifically, homopolymer sequences should be avoided. For correct identification of barcodes during analysis, be sure to include all sequences between the SMRTbell adapter and the insert side end of the barcode sequence in the FASTA file which is passed to SMRT Portal.

Secondary Analysis with Barcodes

Please click on links below for specific instructions for demultiplexing with your installed version of SMRT Analysis.

Additional command-line tools for accessing barcoded data

cmph5tools.py

Another useful tool for inspecting barcoded results is the latest cmph5tools.py stats and cmph5tools.py select scripts in your SMRT Analysis v2.0.0 or later installation. This tool provides an "SQL-like" query language for PacBio native cmp.h5 alignment files. For example, we can get a per-barcode count of alignments with the following commands:

ALN=aligned_reads.cmp.h5

cmph5tools.py stats --what "Count(Barcode)" \
                    --where "Barcode != '--'" \
                    --groupBy Barcode \
                    $ALN

Group             Count(Barcode)
F_14--R_14                 11573
F_15--R_15                 13699
F_16--R_16                 13930
...                        ...

Many other metrics are also possible (see cmph5tools.py listMetrics for a list), including compound clauses, e.g.:

cmph5tools.py stats --what "Tbl( ml=Mean(ReadLength), c=Count(Barcode))" \
                    --where "((Barcode == 'F_14--R_14') | (Barcode == 'F_44--R_44')) \
                      & (Reference == 'EGFR_600_t11_1')" \
                    --groupBy "Barcode*Reference" \
                    $ALN

Group                           ml              c
F_14--R_14:EGFR_600_t11_1       488.09          34
F_44--R_44:EGFR_600_t11_1       429.77          73

The results from cmph5tools.py stats can also be written to a .csv text file using the --outFile option.

The tool cmph5tools.py select will split a cmp.h5 file into arbitrary cmp.h5 subset file(s) based on the --where and --groupBy clauses. For example, the following command would split a barcoded cmp.h5 dataset into separate cmp.h5 files named by barcode:

cmph5tools.py select --where "Barcode != '--'" --groupBy Barcode $ALN

Barcode Variant Analysis

Once barcoded subsets are split into independent cmp.h5 files using the above command, making variant calls using Quiver or the MinorVariants tool can be done with two additional steps. The first step is to sort each .cmp.h5 file using the sort subtool (in-place):

cmph5tools.py sort <barcode>.cmp.h5

Quiver

Quiver can then be called on each file using the quiver wrapper from your SMRTanalysis environment:

quiver -j8 <barcode>.cmp.h5                 \
       -r <reference.fasta>           \
       -o <barcode>_variants.gff -o <barcode>_consensus.fasta

Additional information on how to use Quiver can be found in the Quiver Howto wiki and Quiver FAQ.

Minor Variants

Before running the MinorVariants tool, make sure you are starting with an alignedCCS.cmp.h5 file which are alignments of single molecule (CCS or Reads of Insert) consensuses. This can be generated using the RS_ReadsOfInsert_Mapping protocol in SMRT Portal.

Also be sure to label the alignment file before breaking it up with cmph5tools.py as in the previous example.

pbbarcode labelAlignments barcode.fofn alignedCCS.cmp.h5

Finally, run the protocol on each sorted .cmp.h5 file.

ConsensusTools.sh MinorVariants -r <reference.fasta> [options] <barcode>.cmp.h5

More information on the MinorVariants tool can be found here.

384-barcoded dataset

A sample dataset containing 384-barcoded randomly mutated constructs of the Phi29 DNA Polymerase is described here