Skip to content
Philipp Rescheneder edited this page May 7, 2016 · 27 revisions

Command line interface

Usage:  ngm [-c <path>] -q <reads> -r <reference> [-o <output>] [parameter]

Input/Output:


-c/--config Path to the config file. The config file contains all advanced options. If this parameter is omitted, default values will be used.
-r/ --ref Path to the reference genome (format: (gzipped) FASTA). NGM will preprocess the reference file if no other option is given.
-q/ --qry Path to the read file. Path to the read file. If the file contains interleaved mates use -p/--paired. Valid input formats are: (gzipped) FASTA/Q, SAM/BAM.
-1/--qry1 Path to the read file containing mates 1. Valid input formats are: FASTA/Q (gzipped), SAM/BAM
-2/--qry2 Path to the read file containing mates 2. Valid input formats are: FASTA/Q (gzipped), SAM/BAM
-p/ --paired Input data is paired end. (default: single end)
-I/ --min-insert-size The minimum insert size for paired end reads (default: 0)
-X/ --max-insert-size The maximum insert size for paired end reads (default: 1000)

Output:

-o/--output Path to output file.
-b/--bam Output BAM instead of SAM. (default: sam)
--hard-clip Use hard instead of soft clipping for BAM/SAM output
--silent-clip Hard clip reads but don’t add clipping information to CIGAR string

General:

-t/--threads <int> Number of threads used for candidate search
-s/--sensitivity <float> A threshold (0 (sensitive)-1 (fast)) that triggers the alignment computation of a genomic region given the seed word frequency. (default: estimated from input data)
-n/--topn Prints the best alignments sorted by alignment score. Available ONLY for single-end data (default: 1)
--strata Only output the highest scoring mappings for any given read, up to mappings per read. NOTE: If a read has more than mappings with the same score, it is discarded and reported as unmapped.
-i/--min-identity <0-1> All reads mapped with an identity lower than this threshold will be reported as unmapped (default: disabled)
-R/--min-residues <int> All reads mapped with a lower number than n residues will be reported as unmapped (default: disabled))
-g/--gpu [int,...] Use GPU for alignment computation (GPU Ids are zero-bassed!).
With -g or --gpu, GPU 0 will be used.
With -g 1 or --gpu, 1 GPU 1 will be used.
With -g 0,1 or --gpu, 0,1 GPU 0 and 1 will be used.
If -g/--gpu is ommitted, alignments will be computed on the CPU (not supported for bs-mapping!)
--bs-mapping Enables bisulfite mapping.
--bs-cutoff <int> Max. number of Ts in a k-mer. All k-mers were the number of Ts is higher than <int> are ignored (default: 8)
-h/--help Prints help

Advanced settings:

-k/--kmer [10-15] Seed length in bases. (default: 13)
--kmer-skip <int> Number of seeds to skip when building the lookup table from the reference (default: 2)
--kmer-min <int> Minimal number of k-mer hits required to consider a region a CMR. For data sets that show a very high divergence to the reference sequence this can be set to 1 to speed up the mapping process. (default: 0)
--max-cmrs <int> Reads that have more than <int> CMRs are ignored. (default: infinite)
-m/mode [0,1] Alignment mode: 0 = local, 1 = semi-global. (default: 0)
-C/--max-consec-indels <int> Maximum number of consecutive indels allowed. (default: estimated from input)
--fast-pairing Mates are mapped individually. If the best alignments for the mates give a proper pair they are marked as paired in the output. If not they are reported as broken pair.
--pair-score-cutoff <0-1> All pairs with a score in the range [top score; top score * pair-score-cutoff] are considered equal scoring pairs. For equal scoring pairs insert-size is used to break the ties (default: 0.9).
--match-bonus <int> Match score (default: 10, affine: 10, bs-mapping: 4)
--mismatch-penatly <int> Mismatch penalty (default: 15, affine: 15, bs-mapping: 2)
--gap-read-penalty <int> Penalty for a gap in the read (default: 20, affine: 33, bs-mapping: 10)
--gap-ref-penalty <int> Penalty for a gap in the reference (default: 20, affine: 33, bs-mapping: 10)
--match-bonus-tt <int> Only for bs-mapping (default: 4)
--match-bonus-tc <int> Only for bs-mapping (default: 4)
--color Colored text output (default: off)

Understanding the command line output:

[Progress] Mapped: 18000, CMR/R: 18, CS: 16 (0), R/S: 12248, Time: 1.30 1.24 0.40, Pairs: 98.24 400.64
Mapped: Number of already mapped reads
CMR/R: Average number of CMRs found for one read
CS: Number of bits used for hashing positions (first),
Number of overflows in the last candidate search (CS) batch (second)
Time: Time spent on identifying CMRs (first), computing scores (second) and computing alignments and printing results (third) for the most recent batch of reads
Pairs: Percentage of pairs NextGenMap was able to align with correct orientation and distance (first),
average insert size (second)

General notes

NextGenMap consists of four main steps:

  • Step 1: Indexing the reference genome (only once)
  • Step 2: Identification of possible mapping regions on the reference genome (candidate mapping region (CMR) search)
  • Step 3: Computation of alignment scores for all CMRs found in step 2
  • Step 4: Computation of the full alignment for the best scoring CMR from step 3

NextGenMap is designed to align short and long Next Generation Sequencing reads. NextGenMap can use a GPU to reduce runtime. As a consequence the overall architecture of NextGenMap is different from the architecture of CPU only read mapping programs. In order to fully utilize a GPU, a large number (>500,000) alignment (score) computations are required. Therefore, NextGenMap employs two buffers. One for the score computation and another for the alignment calculation. Both buffers are used to “collect” sequences before submitting them to the GPU. Another important point is to avoid idle time on the CPU while sequence alignments are computed on the GPU. These two characteristics directly influence the memory consumption and the CPU usage of NextGenMap. From version 0.4.5 NextGenMap use a different approach that scales better with different kinds of input data.

Memory consumption

The average memory consumption of NextGenMap (using 4 threads) is 5-7 GB for human Illumina data. However, the memory usage also depends on how many reads are stored in the two buffers. Typically the buffer load is between 0 and 50%. However, if throughout the program, the number of possible mapping locations is higher than the number of alignments that can be computed, the load of the buffer can reach 100%. In such a case memory usage might be increased by 20% to 30%. If this poses a problem, the best solution is to reduce the number of CPU threads.

CPU usage

The average number of threads active during runtime is equal to the number specified threads (-t) by the user. Still, in rare cases NextGenMap may exceed this number due to threads that are responsible for managing the data during runtime.

For example, running NextGenMap using 4 threads (-t 4) on data sets that requires a very thorough search the number of used CPU threads might increase to 6 or above. Please note, if a computer supports the execution of only 4 threads, NextGenMap uses only 4 threads. From version 0.4.5 on NextGenMap uses a different parallelization approach. Thus, NextGenMap doesn’t use more CPU threads than specified by -t anymore. Please note that the number of CPU threads used still influences the memory consumption of NextGenMap.

Parameter estimation

NextGenMap is able to estimate the most important parameters, like the sensitivity parameter, based on the input data. Currently, NextGenMap uses the first one million reads in the data set for this purpose. Note that a increased number of reads having a high proportion of N’s in their sequence may hinder the correct estimation of the parameters. Therefore, if possible, this should be avoided.

Input

NextGenMap supports FASTA, FASTQ, SAM and BAM files as input. NextGenMap can also read compressed data (gzipped). It is not necessary to specify the format of the input file explicitly. NextGenMap will automatically determine the format.

Paired end data

Currently, NextGenMap only supports paired end data in a single file. From version 0.4.5 on NextGenMap supports paired-end data either in two seperate files (use -1 and -2) or interleaved in one file (use =-q= and =-p=)

IMPORTANT: NextGenMap does not check/compare read names. It assumes that the reads are in the correct order. Thus, the input file must only contain pairs of reads. Mixing single end and paired end reads is not supported and will corrupt the mapping results. Please see link for more information. If you still want to mix single and paired end data, we suggest concatenating the mapped SAM files together.

The interleaved format looks as following:


>Read_1/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Read_1/2
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Read_2/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
>Read_2/2
.
.
.

If you want to convert your data to one interleaved file, please use “ngm-utils interleave” to convert them:

ngm-utils interleave -1 forward_reads.fq -2 reverse_reads.fq -o suffled_reads.fq

The benefit of using ngm-utils interleave is that the read names of the pairs are checked. If there are reads without a mate present, they will be filtered.

Output

Currently NextGenMap output is, by default in SAM format. For your convenience BAM output is also supported. Please note that currently BAM output is considerably slower than SAM. For small genomes mapping can take twice as long when using BAM output. We will address this issue in a later release.
NextGenMap, in contrast to many other methods, always reports the MD string.

Mapping Quality

NextGenMap estimates mapping quality as follows:

MQ = 60* (ASb – ASs) / ASb

ASb … alignment score of best alignment
ASs … alignment score of second best alignment

Custom SAM Tags

AS Alignment score
NM Number of mismatches in the alignment (does not include insertions and deletions)
XI Identity of the alignment
X0 Number of equal scoring hits
X1 Number of suboptimal hits found (not available in version 0.4.5)
XE Number of supported seeds
XR Number of aligned residues
MD Mismatched and deleted positions/bases

Equal scoring mapping positions

If a read aligns to two or more regions on the genome having the same alignment score, NextGenMap randomly pics one position that is reported and sets the mapping quality to zero.

Please note that the possibility to print all equal scoring locations as distinct mappings (one line per location) will be available in the next release. From version 0.4.5 for single-end data on -n/--top can be used to specify how many alignments per reads should be reported. Per default, the top n alignments found are reported also including alignments with a lower score. Note that in many cases (e.g. reads with mapping quality 60) NextGenMap only finds one good alignment.
When using --strata NextGenMap will only report the top-scoring alignments (up to n). Reads that have more than n top-scoring alignments will be reported as unmapped. Consequently, using --top 1 and --strata will cause NextGenMap to output only uniquely mapped reads.

For paired-end data -n/--topn is not supported yet.

Clipping

Per default NextGenMap uses a local alignment algorithm. Thus it discards parts at the beginning/end of the reads that contain a high number of errors (clipping). The SAM format distinguishes between hard and soft clipping.

For soft clipping the SAM file contains the full unclipped reads. The regions of the read that were discarded during the alignment are indicated with the S character in the cigar string (e.g. 10S85M5S).
For hard clipping the SAM file contains the clipped reads and in the CIGAR string clipping is indicated with the H character (e.g. 10H85M5H).
Besides these two approaches, NextGenMap also support “silent clipping”. If NextGenMap is set to silent clipping, it outputs the clipped reads to the SAM file and omits the S/H character in the cigar string (e.g. 85M). Although this is not officially part of the SAM format it is sometimes required if down stream analysis programs/scripts don’t support clipping.

Parameters: --hard-clip, --silent-clip

Indexing

NextGenMap uses a memory efficient index structure (hash table) to store the positions of every third 13-mers (defaul) present in the reference genome. With these settings the index for the human genome uses about 3.3 GB of memory. The size of the seed word and the number of bases between two adjacent seed words can be modified with the —kmer and the —kmer-skip parameter. Please note that changing these parameters directly influences the memory consumption and the runtime of NextGenMap. Simulations have shown that indexing every third 13-mer is sufficient for high sensitive mapping. Thus, it is not recommended to change these parameters.

Parameters: --kmer, --kmer-skip

Candidate search

During the candidate search NextGenMap retrieves the genomic positions of all 13-mers of a given read from the index. A small hash-table finds short regions on the genome where the number of matching 13-mers from the read exceeds a threshold t. In contrast, NextGenMap estimates t individually for each read based on the number of 13-mers b that match to the best matching position (the position with the highest number of 13-mer matches) on the genome: t = b * s, 0 <= s <= 1 For other programs (e.g. SHRiMP2) t is fixed for all reads.

All regions on the genome that match more than t 13-mers from the read are considered possible mapping locations (candidates) and will be evaluated using a sequence alignment.

If s is set to 0, NextGenMap will evaluate all regions that match 1 or more 13-mers from the read. If s is set to 1 NextGenMap will only evaluate the regions of the genome that match b 13-mers from the read. Thus, increasing s reduces the mapping sensitivity but also greatly reduces the runtime. In general, values between 0.5 and 0.8 give the best trade-off between speed and sensitivity.

If s is not specified, NextGenMap estimates s from the input data (recommended).

Parameters: --sensitivity

Score and alignment computation

NextGenMap uses a banded alignment algorithm to evaluate possible mapping location. So instead of the full alignment matrix only the corridor surrounding the main diagonal is computed.

Note that the corridor does not limit the number of insertion and deletions within a read. Only the number of consecutive insertions or deletions are limited. For example, a corridor width of 10 allows for 10 consecutive insertions or deletions.

The width of the corridor is computed based on the read length. However, it can also be changed manually by using the —corridor parameter.
In addition, the —mode parameter can be used to switch between the local (default) and the end-free alignment mode.

Alignment computations are done either on the CPU or on the GPU. Please see Documentation#General for more details.

Parameter: --mode, --corridor, --gpu

Paired end mapping

NextGenMap uses paired end information to improve the mapping sensitivity. Currently, NextGenMap only accepts paired-end data in a single file (see link)
To activate paired-end mapping use the -p/--paired parameter or use -1 and -2 to specify the input files. The parameters --min-insert-size and --max-insert size can be used to specify the expected range of insert size.

If NextGenMap is unable to map a pair of reads with the correct orientation and distance both mates will be mapped in single-end mode.

Parameters: --min-insert-size, --max-insert-size, --paired

This documentation is still in an early stage. If you have any problems/questions/comments please don’t hesitate to contact fritz.sedlazeck@univie.ac.at or philipp.rescheneder@univie.ac.at!