NextDenovo requires at least one read file (option: input_fofn
) as input, it works with gzip'd FASTA and FASTQ formats and uses a config file
to pass options.
input_fofn
(one file one line)ls reads1.fasta reads2.fastq reads3.fasta.gz reads4.fastq.gz ... > input.fofn
config file
A config file is a text file that contains a set of parameters (key=value pairs) to set runtime parameters for NextDenovo. The following is a typical config file, which is also located in
doc/run.cfg
.[General] job_type = local job_prefix = nextDenovo task = all rewrite = yes deltmp = yes parallel_jobs = 20 input_type = raw read_type = clr # clr, ont, hifi input_fofn = input.fofn workdir = 01_rundir [correct_option] read_cutoff = 1k genome_size = 1g # estimated genome size sort_options = -m 20g -t 15 minimap2_options_raw = -t 8 pa_correction = 3 correction_options = -p 15 [assemble_option] minimap2_options_cns = -t 8 nextgraph_options = -a 1
workdir/03.ctg_graph/nd.asm.fasta
Contigs with fasta format, the fasta header includes ID, type, length, node count, a consecutive lowercase region in the sequence implies a weak connection, and a low quality base is marked with a single lowercase base.
workdir/03.ctg_graph/nd.asm.fasta.stat
Some basic statistical information (N10-N90, Total size et al.).
job_type = sge
local, sge, pbs, lsf, slurm... (default: sge)
job_prefix = nextDenovo
prefix tag for jobs. (default: nextDenovo)
task = <all, correct, assemble>
task need to run, correct = only do the correction step, assemble = only do the assembly step (only work if
input_type
= corrected orread_type
= hifi), all = correct + assemble. (default: all)rewrite = no
overwrite existed directory [yes, no]. (default: no)
deltmp = yes
delete intermediate results. (default: yes)
rerun = 3
re-run unfinished jobs untill finished or reached
rerun
loops, 0=no. (default: 3)parallel_jobs = 10
number of tasks used to run in parallel. (default: 10)
input_type = raw
input reads type [raw, corrected]. (default: raw)
input_fofn = input.fofn
input file, one line one file. (required)
read_type = {clr, hifi, ont}
reads type, clr=PacBio continuous long read, hifi=PacBio highly accurate long reads, ont=NanoPore 1D reads. (required)
workdir = 01.workdir
work directory. (default: ./)
usetempdir = /tmp/test
temporary directory in compute nodes to avoid high IO wait. (default: None)
nodelist = avanode.list.fofn
a list of hostnames of available nodes, one node one line, used with usetempdir for non-sge job_type.
submit = auto
command to submit a job, auto = automatically set by Paralleltask.
kill = auto
command to kill a job, auto = automatically set by Paralleltask.
check_alive = auto
command to check a job status, auto = automatically set by Paralleltask.
job_id_regex = auto
the job-id-regex to parse the job id from the out of
submit
, auto = automatically set by Paralleltask.use_drmaa = no
use drmaa to submit and control jobs.
read_cutoff = 1k
filter reads with length <
read_cutoff
. (default: 1k)
genome_size = 1g
estimated genome size, suffix K/M/G recognized, used to calculate
seed_cutoff
/seed_cutfiles
/blocksize
and average depth, it can be omitted when manually settingseed_cutoff
.seed_depth = 45
expected seed depth, used to calculate
seed_cutoff
, co-use withgenome_size
, you can try to set it 30-45 to get a better assembly result. (default: 45)seed_cutoff = 0
minimum seed length, <=0 means calculate it automatically using
bin/seq_stat <seq_stat>
.seed_cutfiles = 5
split seed reads into
seed_cutfiles
subfiles. (default:pa_correction
)blocksize = 10g
block size for parallel running, split non-seed reads into small files, the maximum size of each file is
blocksize
. (default: 10g)pa_correction = 3
number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage, overwrite
parallel_jobs
only for this step. (default: 3)minimap2_options_raw = -t 10
minimap2 options, used to find overlaps between raw reads, see
minimap2-nd <minimap2-nd>
for details.sort_options = -m 40g -t 10
sort options, see
ovl_sort <ovl_sort>
for details.correction_options = -p 10
correction options, see following:
-p, --process, set the number of processes used for correcting. (default: 10) -b, --blacklist, disable the filter step and increase more corrected data. -s, --split, split the corrected seed with un-corrected regions. (default: False) -fast, 0.5-1 times faster mode with a little lower accuracy. (default: False) -dbuf, disable caching 2bit files and reduce ~TOTAL_INPUT_BASES/4 bytes of memory usage. (default:False) -max_lq_length, maximum length of a continuous low quality region in a corrected seed, larger max_lq_length will produce more corrected data with lower accuracy. (default: auto [pb/1k, ont/10k])
minimap2_options_cns = -t 8 -k17 -w17
minimap2 options, used to find overlaps between corrected reads.
minimap2_options_map = -t 10
minimap2 options, used to map reads back to the assembly.
nextgraph_options = -a 1
nextgraph options, see
nextgraph <nextgraph>
for details.