Skip to content

Tutorial: Installing and Running Iso Seq 3 using Conda

Elizabeth Tseng edited this page Sep 18, 2019 · 39 revisions

Last Updated: 09/18/2019

Latest BioConda Iso-Seq3 version: v3.2.2

This tutorial describes how to use the Linux developer's version of Iso-Seq 3 and its related downstream analysis under the Anaconda environment. Please refer to our official pbbioconda page for further information on Support, License, Copyright, and Disclaimer. Please report any issues using PBBioconda Issues

See here for: Iso-Seq 3 presentation.

Who is this tutorial for?

  • Existing Iso-Seq users who are comfortable using the command line under Linux environment
  • Users who are currently using ToFU1 and ToFU2; both of which are deprecated.
  • Advanced users with previous experience using conda packages.

Who is this tutorial NOT for?

Iso-Seq 3 is officially available through SMRT Analysis.



Installing Iso-Seq 3 using Anaconda

(1) Download the latest version of Anaconda. (2) Install Anaconda according to the tutorial.

bash ~/Downloads/Anaconda2-5.2.0-Linux-x86_64.sh
export PATH=$HOME/anaconda5.2/bin:$PATH

Add export PATH=$HOME/anaconda5.2/bin:$PATH line to .bashrc or .bash_profile in your home directory or you will need to type it everytime you log in.

(3) Confirm that conda is installed and update conda:

conda -V
conda update conda

(4) Create a virtual environment (tutorial). I will call it anaCogent5.2. Type y to agree to the interactive questions.

conda create -n anaCogent5.2 python=2.7 anaconda
source activate anaCogent5.2

Once you have activated the virtualenv, you should see your prompt changing to something like this:

(anaCogent5.2)-bash-4.1$

(5) Install additional required libraries:

conda install -n anaCogent5.2 biopython
conda install -n anaCogent5.2 -c http://conda.anaconda.org/cgat bx-python

(6) Install Iso-Seq 3 using bioconda. This will also install LIMA, PacBio's demultiplexing tool, as part of the dependency. Note that Iso-Seq 3 works only under Linux environment (Mac OS not supported).

conda install -n anaCogent5.2 -c bioconda isoseq3=3.2
conda install -n anaCogent5.2 -c bioconda pbccs=4.0

Specifying the version to be isoseq3=3.2 allows patches to be automatically installed which would be any version with 3.2.x.

The packages below are optional:

conda install -n anaCogent5.2 -c bioconda pbcoretools # for manipulating PacBio datasets
conda install -n anaCogent5.2 -c bioconda bamtools    # for converting BAM to fasta
conda install -n anaCogent5.2 -c bioconda pysam       # for making CSV reports

Check your isoseq3 version:

$ isoseq3 --version
isoseq3 3.2.x (commit v3.2.x)
$ ccs --version
ccs 4.0.x (commit v4.0.x)
$ lima --version
lima 1.9.0 (commit v1.9.0)

Running Iso-Seq 3

workflow

Please follow the Iso-Seq 3 tutorial. Here we list each step as described in the tutorial and explain the output.

0. Generate CCS

If you don't already have CCS, run

ccs [movie].subreads.bam [movie].ccs.bam --min-rq 0.9

Note that for isoseq3 starting version 3.2, we run Polish for CCS!

1. Classify full-length reads:

Command:

lima --isoseq --dump-clips --no-pbi --peek-guess -j 24 ccs.bam primers.fasta demux.bam       

lima identifies and removes the 5' and 3' cDNA primers. If the sample is barcoded, include the barcode as part of the primer. See Iso-Seq 3: Primer removal and demultiplexing.

Use --peek-guess to remove spurious matches (only applicable if you supply multiple primer pairs).

The dumped clips (via --dump-clips) show the clipped primers. bq is the barcode score and bc is the primer index. Here, bc:0 is the Clontech 5' primer including the ATGGG overhang and bc:1 is the Clontech 3' primer. Note that the clips could be in either orientation, but the lima output will orient the output FL read to 5' -> 3'.

>m54254_171121_005529/73335088/0_30 bq:100 bc:0
AAGCAGTGGTATCAACGCAGAGTACATGGG
>m54254_171121_005529/73335088/1953_1978 bq:100 bc:1
GTACTCTGCGTTGATACCACTGCTT
>m54254_171121_005529/73335094/0_24 bq:88 bc:1
AAGCAGTGTATCAACGCAGAGTAC
>m54254_171121_005529/73335094/3386_3415 bq:80 bc:0
CCCATGTACGCTGCGTTGATACACTGCTT

If multiple 5'/3' pairs of primers are given, lima will output one <prefix>.<5p>--<3p>.bam for each pair. If you want to analyze all the demultiplexed FL reads together to increase transcript recovery (Example: Same species, different tissues), you must make a combined data set:

dataset create --type ConsensusReadSet combined_demux.consensusreadset.xml \
    prefix.5p--barcode1_3p.bam \
    prefix.5p--barcode2_3p.bam \
    prefix.5p--barcode3_3p.bam ...

To remove polyA tails and artificial concatemers, run isoseq3 refine next.

isoseq3 refine --require-polya combined_demux.consensusreadset.xml primers.fasta flnc.bam

Use --require-polya if your transcripts have a polyA tail.

An intermediate flnc.bam file is produced which contains the FLNC reads. To convert to FASTA format, run:

bamtools convert -format fastq -in flnc.bam > flnc.fastq

Special: What to do for TeloPrime primers

The next version of isoseq3 supports variable polyA length. For TeloPrime, we recommend running lima without the As, then running isoseq3 refine with a smaller than default polyA length.

lima --isoseq --dump-clips ccs.bam primers.fasta output.bam

isoseq3 refine --require-polya --min-polya-length 12 output.5p--3p.bam primers.fasta flnc.bam

where primers.fasta is

>5p
TGGATTGATATGTAATACGACTCACTATAG
>3p
CGCCTGAGA

2. Cluster FLNC reads:

Command:

isoseq3 refine --require-polya demux.P5--P3.bam barcodes.fasta flnc.bam
isoseq3 cluster flnc.bam polished.bam --verbose --use-qvs

Note: Because the ccs was run with Polish, the isoseq3 cluster output is already polished! No additional polishing step is required.

After completion, you will see the following files:

polished.bam       
polished.bam.pbi   
polished.hq.fasta.gz
polished.lq.fasta.gz
polished.cluster   
polished.transcriptset.xml

NOTE: QVs will not be available for the polished HQ fasta coming out of isoseq3 cluster. If you wish to have per-base QV, you will need to run the much slower full isoseq3 polish step that involves the subreads.bam files. See the isoseq3 v3.1 tutorial on serial polish here.

isoseq3 polish polished.bam [movie.]subreads.bam polished_with_qv.bam

5. Understanding flnc.report.csv and polished.cluster_report.csv

flnc.report.csv is a CSV file showing which barcode/primers each FLNC read belongs to. It is the output of the isoseq3 refine step.

polished.cluster_report.csv is a CSV file showing which clusters each FLNC read belongs to. It is the output of the isoseq3 cluster step.

Having these two CSV files enables you to run Cupcake scripts such as collapse, get FL counts for each transcript, demux scripts the same way you did for Iso-Seq 1 and 2 output.

An example for flnc.report.csv (previously named classify_report.csv) is below:

id,strand,fivelen,threelen,polyAlen,insertlen,primer_index,primer
m54020_170625_150952/69664969/ccs,-,31,39,57,2627,0--7,Clontech--bc7
m54020_170625_150952/69664996/ccs,-,31,40,59,990,0--6,Clontech--bc6
m54020_170625_150952/69664999/ccs,+,30,38,59,1724,0--1,Clontech--bc1

And for polished.cluster_report.csv:

cluster_id,read_id,read_type
transcript/0,m54020_170625_150952/23200299/ccs,FL
transcript/0,m54020_170625_150952/21562272/ccs,FL
transcript/2690,m54020_170625_150952/51708466/ccs,FL
transcript/2690,m54020_170625_150952/22151725/ccs,FL

4. Which part of Iso-Seq3 can be parallelized for speed up?

The following parts can be done in parallel:

  • ccs
  • lima

The following step cannot be done in parallel:

  • isoseq3 cluster

As an example, let's say you have three movies, you can run CCS in parallel to get three output: movie1.ccs.bam, movie2.ccs.bam, movie3.ccs.bam.

Now you can run lima and isoseq3 refine on each separately:

lima --isoseq --dump-clips -j 24 movie1.ccs.bam primers.fasta demux1.bam
lima --isoseq --dump-clips -j 24 movie2.ccs.bam primers.fasta demux2.bam 
lima --isoseq --dump-clips -j 24 movie3.ccs.bam primers.fasta demux3.bam 
isoseq3 refine --require-polya demux1.5p--3p.bam primers.fasta flnc1.bam
isoseq3 refine --require-polya demux2.5p--3p.bam primers.fasta flnc2.bam
isoseq3 refine --require-polya demux3.5p--3p.bam primers.fasta flnc3.bam

Now you can create a dataset XML that references the three output:

dataset create --type ConsensusReadSet combined.flnc.xml \
  flnc1.bam flnc2.bam flnc3.bam

Then run the cluster step using as much cores as possible:

isoseq3 cluster combined.flnc.xml polished.bam --verbose --use-qvs

The split cluster output can then be run in parallel again and combined later.

What to do after Iso-Seq3?

If you have a reference genome, you can follow this tutorial to map the transcripts back to the genome, remove redundancy, and generate GFF output.

If you do not have a reference genome, you may be interested in using Cogent to create gene families and reconstruct the coding portions of the genome.

You can’t perform that action at this time.