MetaMaps is tool specifically developed for the analysis of long-read (PacBio/Oxford Nanopore) metagenomic datasets.
It simultaenously carries out read assignment and sample composition estimation.
It is faster than classical exact alignment-based approaches, and its output is more information-rich than that of kmer-spectra-based methods. For example, each MetaMaps alignment comes with an approximate alignment location, an estimated alignment identity and a mapping quality.
The approximate mapping algorithm employed by MetaMaps is based on MashMap. MetaMaps adds a mapping quality model and EM-based estimation of sample composition.
(28 August 2018) We're adding more flexible tools to construct your own databases - see here for details. Feedback welcome!
INSTALL.txt to compile and install MetaMaps.
Then download a database, e.g. miniSeq+H (~8G compressed, microbial genomes and the human reference genome). Extract the downloaded database into the
Analysis of a dataset with MetaMaps consists of two steps: mapping and classification:
./metamaps mapDirectly --all -r databases/miniSeq+H/DB.fa -q input.fastq -o classification_results ./metamaps classify --mappings classification_results --DB databases/miniSeq+H
You can use the '--maxmemory' parameter to specify a target for maximum memory consumption (in gigabytes). Note that this feature is implemented heuristically; actual memory usage will fluctuate and execeed the target. We recommend using around 70% of the available memory as a target amount (for example, 20G when you have a 32G machine).
./metamaps mapDirectly --all -r databases/miniSeq+H/DB.fa -q input.fastq -o classification_results --maxmemory 20 ./metamaps classify --mappings classification_results --DB databases/miniSeq+H
You can use the parameter
-t to speed up mapping and classification.
./metamaps mapDirectly -t 5 --all -r databases/miniSeq+H/DB.fa -q input.fastq -o classification_results ./metamaps classify -t 5 --mappings classification_results --DB databases/miniSeq+H
Important: if you encounter problems with multithreading efficiency, try
unset MALLOC_ARENA_MAX immediately before calling MetaMaps.
MetaMaps outputs both an overall compositional assignment and per-read taxonomic assignments. Specifically, it will (for
-o classification_results) produce the following files:
classification_results.EM.WIMP: Sample composition at different taxonomic levels (WIMP = "What's in my pot"). The level "definedGenomes" represents strain-level resolution (i.e., the defined genomes in the classification database). The EM algorithm is carried out at this level.
Absolutespecifies the number of reads assigned (by their maximum likelihood mapping estimate) to the taxonomic entity;
EMFrequencyspecifies the estimated frequency of the taxonomic entity prior to taking into account unmapped reads;
PotFrequencyspecifies the estimated final frequency of the taxonomic entity (i.e. after correcting for unmapped reads).
classification_results.EM.reads2Taxon: One line per read, and each line consists of the read ID and the taxon ID of the genome that the read was assigned to. Taxon IDs beginning with an 'x' represent MetaMaps-internal taxon IDs that disambiguate between source genomes attached to the same 'species' node. These can be interpreted using the extended database taxonomy (sub-directory
taxonomyin the directory of the utilized database).
classification_results.EM.reads2Taxon, but in Krona format. Each line has an additional quality value, and only taxon IDs from the standard NCBI taxonomy are used.
classification_results.EM.contigCoverage: Read coverage for contigs that appear in the final set of maximum-likelihood mappings. Contigs are split into windows of 1000 base pairs. Each line represents one window and specifies the MetaMaps taxonID of the contig (
taxonID), a species/strain label (
equalCoverageUnitLabel), the ID of the contig in the classification database FASTA file (
contigID), start and stop coordinates of the window (
stop), the number of read bases overlapping the window (
nBases), and the number of read bases overlapping the window divided by window length, i.e. coverage (
classification_results.EM.lengthAndIdentitiesPerMappingUnit: Read length and estimated identity for all reads, sorted by the contig they are mapped to. Each line represents one read and has the fields
AnalysisLevel, which is always equal to
ID, which is the ID of the contig in the classification database FASTA file;
readI, which is a simple numerical read ID;
Identity, which is the estimated alignment identity; and
Length, specifying the length of the read.
classification_results.EM: The final and complete set of approximate read mappings. Each line represents one mapping in a simple space-delimited format. Fields: read ID, read length, beginning of the mapping in the read, end of the mapping in the read, strand, contig ID, contig length, beginning of the mapping in the contig, end of the mapping in the contig, estimated alignment identity using a Poisson model, MinHash intersection size, MinHash union size, estimated alignment identity using a binomial approximation, mapping quality. The mapping qualities of all mappings for one read sum up to 1.
classification_results.EM.evidenceUnknownSpecies: Various statistics on read identities and zero-coverage regions of identified genomes. These are not particularly useful in their current state; visual inspection of coverage and identity patterns should take precedence.
You can download example output files.
The 'miniSeq+H' database is a good place to start. It contains >12000 microbial genomes and the human reference genome. We provide miniSeq+H as a download (see above for the link).
You can also download and construct your own reference databases. For example, this is how to construct the miniSeq+H database:
Download the genomes you want to include. The easiest way to do this is by copying the RefSeq/Genbank directory structure of the taxonomic branches you're interested in. This can be done with the
downloadRefSeq.plscript, which is easily customizable (e.g.,
--targetBranches archaea,bacteria,fungito download these three branches). Example:
mkdir download perl downloadRefSeq.pl --seqencesOutDirectory download/refseq --taxonomyOutDirectory download/taxonomy
We need to make sure that each contig ID is annotated with a correct and unique taxon ID and we want the whole database as one file.
perl annotateRefSeqSequencesWithUniqueTaxonIDs.pl --refSeqDirectory download/refseq --taxonomyInDirectory download/taxonomy --taxonomyOutDirectory download/taxonomy_uniqueIDs
By default this script will only process complete (finished) assemblies - if you want to modify this behaviour, uncomment the line
next unless($assembly_level eq 'Complete Genome');.
We might also manually want to include additional genomes, for example the human reference genome. Obtain the genome in one file (e.g.
hg38.primary.fna) and add taxon IDs:
perl util/addTaxonIDToFasta.pl --inputFA hg38.primary.fna --outputFA hg38.primary.fna.with9606 --taxonID 9606
Finally, construct the MetaMaps databasen (here
perl buildDB.pl --DB databases/myDB --FASTAs download/refseq,hg38.primary.fna.with9606 --taxonomy download/taxonomy_uniqueIDs
The NCBI taxonomy changes on a regular basis, and you might not want to repeat the complete database construction process every time that happens. You can update the utilized taxonomy as part of
buildDB.pl, by specifying the "old" taxonomy (used for
--updateTaxonomy 1, and the path to a download of the new taxonomy (e.g. ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz). Example:
perl buildDB.pl --DB databases/myDB --FASTAs download/refseq/ref.fa,hg38.primary.fna.with9606 --taxonomy download/new_taxonomy --oldTaxonomy download/taxonomy_uniqueIDs --updateTaxonomy 1
Plotting spatial genome coverage and alignment identities
MetaMaps comes with a small R script (
plotIdentities_EM.R) that helps visalize spatial genome coverage and alignment identity per genome. You can assess these metrics to identify mismatches between the sample and the database.
Parameters: the classification prefix (i.e. whatever input your provided to
metamaps --classify) of the output.
Rscript plotIdentities_EM.R classification_results
Filtering out WIMP entries with low median identity
If you suspect that your sample contains many genomes not represented in the database (one way to adjudicate this is to examine the identity histograms of the maximum likelihood alignments, e.g. with
plotIdentities_EM.R), the produced WIMP files may contain many false-positive hits.
You can filter your WIMP output with the script
--DB: Database name.
--mappings: Path to main MetaMaps mappings file.
--identityThreshold:: Median identity threshold (between 0 and 1). Default: 0.8, but check the identity histograms to make sure you use a sensible value.
perl filterLowIdentityEntities.pl --DB miniSeq+H --mappings classification_results --identityThreshold 0.8
COG group analysis
You can use MetaMaps to carry out a COG group analysis of your metagenomic sample. Whenever a read overlaps with an annotated gene location, it is counted towards the COG groups (and other annotations) associated with the corresponding gene. Note that a single read can overlap with multiple genes, and that a single gene can be associated with multiple COG groups (or other annotations). Gene locations and amino acid sequences come from GenBank/Refseq, and the annotations are produced with eggnog-mapper.
Download an annotated database.
To carry out this type of analysis, you need an "annotated" database that contains, in addition to the reference FASTA files, the locations and amino acid sequences of encoded genes, as well as a (gene) -> (annotation) mapping file. The refseq_with_annotations database (18.4 GB compressed) is a good place to start. You can untar this file from the main MetaMaps directory. Afterwards, you should see a
Carry out mapping and classification.
Mapping and classification work as before - just make sure to map against the annotated database.
./metamaps mapDirectly --all -r databases/refseq_with_annotations/DB.fa -q input.fastq -o classification_results ./metamaps classify --mappings classification_results --DB databases/refseq_with_annotations
Carry out the gene- and annotation-level analysis.
A gene- and annotation-level analysis is carried out with the script
perl geneLevelAnalysis.pl --DB databases/refseq_with_annotations --mappings classification_results
This command will produce the following files:
classification_results.EM.geneLevelAnalysis: This file contains the names and (for some genes) protein IDs of genes hit by overlapping reads from the input dataset. It also contains the number of overlapping reads and their median identity.
classification_results.EM.geneLevelAnalysis, but agglomerated according to eggnog-mapper-produced gene annotations. For example,
classification_results.EM.proteins.COGwill contain a COG-level analysis of the input data. Note that features are not mutually exclusive, i.e. a single read can overlap with multiple genes, and a single gene can carry multiple annotations.