Skip to content

Improved Phased Assembler

Zev Kronenberg edited this page Jun 30, 2020 · 36 revisions

Authors: Ivan Sovic, Zev Kronenberg, Christopher Dunn, Derek Barnett, Sarah Kingan, James Drake

Abstract: Improved Phased Assembler (IPA) is the official PacBio software for HiFi genome assembly. IPA was designed to utilize the accuracy of PacBio HiFi reads to produce high-quality phased genome assemblies. IPA is an end-to-end solution, starting with input reads and resulting in a polished assembly. IPA is fast, providing an easy to use local run mode or a distributed pipeline for a cluster.

Quick Start

Setup environment:

conda create -n ipa -c bioconda -c conda-forge -c defaults
conda activate ipa
conda install pbipa

Verify installation:

ipa validate

Run IPA!

ipa local --nthreads 8 --njobs 1 -i {myreads.fasta}

The resulting polished primary and associate contigs are in the 14-final folder.

Note:

  • --nthreads - specifies number of threads to use per job.
  • --njobs specifies how many jobs should be launched at once for the parallel tasks (overlapping, phasing and polishing). This parameter applies to both the ipa local and ipa dist run modes. Be mindful that for ipa local there will be a maximum of nthreads * njobs cores utilized for parallel jobs. For ipa dist, this specifies the number of jobs submitted to the cluster.

For example, if you're running locally on a 24-core machine, this might be a good choice of threads/jobs:

ipa local --nthreads 24 --njobs 1 -i {myreads.fasta}

And if you are running locally on a 80-core machine, this might be more optimal:

ipa local --nthreads 20 --njobs 4 -i {myreads.fasta}

Allowing multiple jobs to run at once can make the workflow more efficient.

IPA run modes

IPA can be run locally, or distributed on a cluster. The local mode is the easiest for new users to get started.
Local mode is appropriate for genomes up to human size. Multi Gb genomes or high coverage genomes should be run with higher CPU count for efficiency (>=64 cores, if possible).

The IPA workflow is implemented using Snakemake. When IPA is run in the cluster mode (ipa dist), the cluster submission parameters are directly passed to Snakemake for running. Below we provide an example command line to submit jobs to an SGE cluster. For other cluster types, please consult the Snakemake documentation.

If the assembly is run on a single local node with high CPU count, such as 64 cores, we recommend a combination of parameters like this to be specified:

--nthreads 16 --njobs 4

This will utilize 16 threads per workflow step, and will allow parallel steps to run as much as 4 jobs simultaneously on the node. We found this to be more efficient than specifying --nthreads 64 --njobs 1 because many steps are very data I/O dependant.
Analogous principle can be applied for machines with more/less cores (e.g. for a machine with 80 cores, one can use --nthreads 20 --njobs 4, and similar).

Assembly is very IO heavy, many files are read and written. In local mode, it is best to run on a local drive. Ideally, Solid Disk Drives (SSD), if available.

Examples

A. Phased run with polishing

IPA accepts many input formats:

ipa local --nthreads 24 --njobs 1 -i <myreads.fasta/fastq/bam/xml/fofn/fasta.gz/fastq.gz>

B. Phased run with polishing and multiple input files

ipa local --nthreads 24 --njobs 1 -i <myreads1.fastq> -i <myreads2.fasta>

C. "Dry run" to test environment and command, specify run directory name

ipa local --dry-run --nthreads 24 --njobs 1 --run-dir <myoutdir> -i <myreads.fastq>

D. Fast draft assembly (no polishing or phasing)

ipa local --no-polish --no-phase --nthreads 24 --njobs 1 -i <myreads.fastq>

E. Haploid assembly with polishing

ipa local --no-phase -i <myreads.fastq>

F. Phased run in distributed mode (on cluster)

The "--cluster-args" string may vary on your system. The following example works on an SGE clusters. For other cluster submission strings please consult the Snakemake documentation.

mkdir -p <myoutdir>/qsub_log
ipa dist -i <myreads.fastq> --nthreads 24 --njobs 20 --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp {params.num_threads} -e qsub_log/ -o qsub_log/ -V"

G. Downsampling the input dataset

IPA provides the options to automatically downsample the input dataset. To do so, specify these parameters:

--genome-size <my_genome_size> --coverage <desired_coverage>

Both parameters either need to be specified, or both should not be specified. If they are not specified, full coverage is used.
Example run:

ipa --genome-size <my_genome_size_in_bp> --coverage <desired_coverage> --no-phase -i <myreads.fastq>

Common Solutions

A. Bioconda installation or activation issues

IPA requires Python 3.7+. Please use Miniconda3, not Miniconda2. We currently only support Linux 64-bit, not MacOS or Linux 32-bit.

B. Specify run directory and tmp directory - in case of disk space issues.

ipa local --nthreads 24 --njobs 1 --run-dir <myoutdir> --tmp-dir ${PWD} -i <myreads.fastq>

C. Resume run in local mode

ipa local --nthreads 24 --njobs 1 --run-dir <myoutdir> -i <myreads.fast> --resume

Note that to resume an existing run, you need to specify all the inputs and parameters as in the initial run.

D. Resume run in dist mode

After killing the Snakemake job and removing processes running on cluster, the first command unlocks the directory and the second relaunches ipa.

ipa dist -i <myreads.fastq> --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp <params.num_threads> -e qsub_log/ -o qsub_log/ -V" --unlock

ipa dist -i <myreads.fastq> --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp {params.num_threads} -e qsub_log/ -o qsub_log/ -V"

Note that to resume an existing run, you need to specify all the inputs and parameters as in the initial run.

E. Samtools OpenSSL issue

Some users may run into an issue with Samtools after their ipa installation. This can be checked by running ipa validate. If this happens, you will see an error like this:

Checking dependencies ...
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
...

Solution 1 Try updating your Samtools version to 1.9 and then rerun ipa validate:

conda update samtools

** Solution 2** If the above solution doesn't work, try installing Samtools with OpenSSL 1.0 like this:

conda install -c bioconda samtools openssl=1.0

Solution 3 Other users were successful by specifying this exact order of channels to search for the update, and then install Samtools:

conda update -c conda-forge -c bioconda -c defaults samtools