Improved Phased Assembler
Authors: Ivan Sovic, Zev Kronenberg, Christopher Dunn, Derek Barnett, Sarah Kingan, James Drake
Abstract: Improved Phased Assembler (IPA) is the official PacBio software for HiFi genome assembly. IPA was designed to utilize the accuracy of PacBio HiFi reads to produce high-quality phased genome assemblies. IPA is an end-to-end solution, starting with input reads and resulting in a polished assembly. IPA is fast, providing an easy to use local run mode or a distributed pipeline for a cluster.
Setup environment:
conda create -n ipa -c bioconda -c conda-forge -c defaults
conda activate ipa
conda install pbipa
Verify installation:
ipa validate
Run IPA!
ipa local --nthreads 8 --njobs 1 -i {myreads.fasta}
The resulting polished primary and associate contigs are in the 14-final
folder.
Note:
-
--nthreads
- specifies number of threads to use per job. -
--njobs
specifies how many jobs should be launched at once for the parallel tasks (overlapping, phasing and polishing). This parameter applies to both theipa local
andipa dist
run modes. Be mindful that foripa local
there will be a maximum ofnthreads * njobs
cores utilized for parallel jobs. Foripa dist
, this specifies the number of jobs submitted to the cluster.
For example, if you're running locally on a 24-core machine, this might be a good choice of threads/jobs:
ipa local --nthreads 24 --njobs 1 -i {myreads.fasta}
And if you are running locally on a 80-core machine, this might be more optimal:
ipa local --nthreads 20 --njobs 4 -i {myreads.fasta}
Allowing multiple jobs to run at once can make the workflow more efficient.
IPA can be run locally, or distributed on a cluster. The local mode is the easiest for new users to get started.
Local mode is appropriate for genomes up to human size. Multi Gb genomes or high coverage genomes should be run with higher CPU count for efficiency (>=64 cores, if possible).
The IPA workflow is implemented using Snakemake.
When IPA is run in the cluster mode (ipa dist
), the cluster submission parameters are directly passed to Snakemake for running. Below we provide an example command line to submit jobs to an SGE cluster. For other cluster types, please consult the Snakemake documentation.
If the assembly is run on a single local node with high CPU count, such as 64 cores, we recommend a combination of parameters like this to be specified:
--nthreads 16 --njobs 4
This will utilize 16 threads per workflow step, and will allow parallel steps to run as much as 4 jobs simultaneously on the node. We found this to be more efficient than specifying --nthreads 64 --njobs 1
because many steps are very data I/O dependant.
Analogous principle can be applied for machines with more/less cores (e.g. for a machine with 80 cores, one can use --nthreads 20 --njobs 4
, and similar).
Assembly is very IO heavy, many files are read and written. In local mode, it is best to run on a local drive. Ideally, Solid Disk Drives (SSD), if available.
IPA accepts many input formats:
ipa local --nthreads 24 --njobs 1 -i <myreads.fasta/fastq/bam/xml/fofn/fasta.gz/fastq.gz>
ipa local --nthreads 24 --njobs 1 -i <myreads1.fastq> -i <myreads2.fasta>
ipa local --dry-run --nthreads 24 --njobs 1 --run-dir <myoutdir> -i <myreads.fastq>
ipa local --no-polish --no-phase --nthreads 24 --njobs 1 -i <myreads.fastq>
ipa local --no-phase -i <myreads.fastq>
The "--cluster-args" string may vary on your system. The following example works on an SGE clusters. For other cluster submission strings please consult the Snakemake documentation.
mkdir -p <myoutdir>/qsub_log
ipa dist -i <myreads.fastq> --nthreads 24 --njobs 20 --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp {params.num_threads} -e qsub_log/ -o qsub_log/ -V"
IPA provides the options to automatically downsample the input dataset. To do so, specify these parameters:
--genome-size <my_genome_size> --coverage <desired_coverage>
Both parameters either need to be specified, or both should not be specified. If they are not specified, full coverage is used.
Example run:
ipa --genome-size <my_genome_size_in_bp> --coverage <desired_coverage> --no-phase -i <myreads.fastq>
IPA requires Python 3.7+. Please use Miniconda3, not Miniconda2. We currently only support Linux 64-bit, not MacOS or Linux 32-bit.
ipa local --nthreads 24 --njobs 1 --run-dir <myoutdir> --tmp-dir ${PWD} -i <myreads.fastq>
ipa local --nthreads 24 --njobs 1 --run-dir <myoutdir> -i <myreads.fast> --resume
Note that to resume an existing run, you need to specify all the inputs and parameters as in the initial run.
After killing the Snakemake job and removing processes running on cluster, the first command unlocks the directory and the second relaunches ipa.
ipa dist -i <myreads.fastq> --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp <params.num_threads> -e qsub_log/ -o qsub_log/ -V" --unlock
ipa dist -i <myreads.fastq> --run-dir <myoutdir> --cluster-args "qsub -S /bin/bash -N ipa.{rule} -cwd -q default -pe smp {params.num_threads} -e qsub_log/ -o qsub_log/ -V"
Note that to resume an existing run, you need to specify all the inputs and parameters as in the initial run.
Some users may run into an issue with Samtools after their ipa installation. This can be checked by running ipa validate
.
If this happens, you will see an error like this:
Checking dependencies ...
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory
...
Solution 1
Try updating your Samtools version to 1.9 and then rerun ipa validate
:
conda update samtools
** Solution 2** If the above solution doesn't work, try installing Samtools with OpenSSL 1.0 like this:
conda install -c bioconda samtools openssl=1.0
Solution 3 Other users were successful by specifying this exact order of channels to search for the update, and then install Samtools:
conda update -c conda-forge -c bioconda -c defaults samtools