Skip to content

User manual

newbie edited this page Jan 14, 2019 · 40 revisions

Contents

Introduction

With CAMISIM you can create simulated metagenome data sets out of taxonomic profiles or de novo from a list of genomes. If a taxonomic profile is used as input, the output data set is created from the NCBI complete genomes, reflecting the input profile as closely as possible and will contain the same number of samples as the input profile, if not specified otherwise. If the community is desgined de novo, a user-defined number of the complete genomes is used for creating a community which maximizes genome novelty as well as phylogenetic spread.

CAMISIM can generate four types of simulated metagenome samples de novo:

  • Individual simulate metagenome sample.
    Uses a taxonomic profile sampled from a log-normal distribution
  • A time series of simulated metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution with Gaussian noise, a normal distribution added to successively generated samples.
  • A set of replicate simulated metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution, with Gaussian noise repeatedly added to the original log-normal values.
  • Differential abundance metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution.

Output:

The output of CAMISIM is the read files in fastq format, a mapping of these reads to the used genomes in BAM format, "gold standards" for assembly (fasta), binning (CAMI format) and profiling (CAMI format). All files are described here.

Installation:

From Source

Download CAMISIM from the github repository into any directory and make sure the following Dependencies are fullfilled. The metagenome simulation pipeline can be downloaded using git:

git clone https://github.com/CAMI-challenge/CAMISIM

Please make sure that a software is of the specified version, as outdated (or sadly sometimes too new) software can cause errors. The path to the binaries of those tools can be set in the configuration file.

Docker

You can also use Docker image with necessary tools installed:

docker pull cami/camisim:latest

Please see usage section for usage instructions.

Dependencies

Operation system

The pipeline has only been tested on UNIX systems and might fail to run on any other systems (and no support for these will be provided).

Software

The following software is required for the simulation pipeline to run:

python 2.7.10

Python is the programming language most scripts were written in.

Biopython

Python package used in the Pipeline for reading/writing sequence files.

BIOM

Python package for the handling of BIOM files

NumPy

Python package offering methods for scientific computing.

Matplotlib

Python package offering methods for plotting graphs.

Perl 5 and the library XML::Simple

Perl 5 is a programming language, used in the script that generates a perfect assembly from bam files. The perl library XML::Simple can be installed from the unix package 'libxml-simple-perl'

wgsim

Read simulator which offers error-free and uniform error rates. wgsim is also shipped within CAMISIM and does not have to be installed manually.

NanoSim

Read simulator for the generation of Oxford Nanopore Technologies (ONT) reads. Does not have a random seed in the original version, so cloning and installing the fork by abremges is advised if NanoPore reads should be simulated.

PBsim

Read simulator for generating Pacific Biosciences (PacBio) reads. Has to be cloned/installed manually if PacBio reads should be simulated.

SAMtools 1.0

Microbial ecology software.

Hardware

The simulation will be conducted primarily in the temporary folder declared in the configuration file or system tmp location by default. The results will then be moved to the output directory. Be sure to have enough space at both locations. Required hard drive space and RAM can vary a lot. Very small simulations can be run on a laptop with 4GB RAM, while realistic simulations of several hundred genomes, given a realistic metagenome size, can easily require several hundreds of gigabyte of both RAM and HD space, the chosen metagenome size being the relevant factor.

Resources

A database dump of the NCBI taxonomy is included, current versions can be downloaded from the NCBI FTP-Server.

Genomes

If the community design should be performed de novo, genomes in fasta format to be sampled from are needed. Otherwise they will be downloaded from the NCBI complete genomes.

The de novo community design needs three files to run:

  • A file containing, tab separated, a genome identifier and that path to the file of the genome.

  • A file containing, tab separated, a genome identifier and that path to the gen annotation of genome. This one is uses in case strains are simulated based on a genome

  • A [[meta data file|meta-data-file-format] that contains, tab separated and with header, genome identifier, novelty categorization, otu assignment and a taxonomic classification.

NCBI taxdump

At minimum the following files are required: "nodes.dmp", "merged.dmp", "names.dmp"

USAGE:

>> python metagenome_from_profile.py -p path/to/16Sprofile

>> python metagenomesimulation.py configuration/metagenome_simulation

>> python genomeannotation.py configuration/genome_annotation

Docker

You can also use the camisim docker image

Example call:

docker run -it -v "/path/to/input/directory:/input:rw" -v "/path/to/output/directory:/output:rw"  cami/camisim:latest  metagenome_from_profile.py -p /input/mini.biom -o /output

where

  • /path/to/input/directory contains a biom file.

  • /path/to/output/directory is the output directory.

Note! You can replace the latest tag with any release number listed on the releases page.

CAMISIM proceeds in the following four steps:

Metagenome simulation

Multiple kinds of simulated metagenome data sets can be created, either using an existing 16S rRNA profile or de novo: Single sample, differential abundance or time series data sets with different insert sizes and complexities. All data sets can be simulated as paired-end sequencing from Illumina HiSeq or other machines as well as other technologies (like ONT or PacBio) and error rates. For generation of data sets and their samples, the following steps are undertaken:
After genome data validation (step 1), the community composition is designed according to specified criteria or the given taxonomic profile (step 2) and the metagenome data sets simulated (step 3). Creation of the gold standards (step 4) represents the last part of the pipeline.

Step 1. Data preprocessing and validation

For the de novo metagenome simulation, all sequences shorter than 1000 base pairs are removed from the provided genome assemblies and sequences validated to contain only valid characters. The input genomes can be draft genomes in fasta format and as such contain beside the bases ACGT also ambiguous DNA characters such as 'RYWSMKHBVDN'.

Step 2. Community Design

Genome selection

If the input is a taxonomic profile in CAMI or BIOM format, the taxa appearing in it, are matched to closely related, complete genomes from the NCBI RefSeq or additional reference files with e.g. new genoems, such that the simulated metagenome (genomes and their abundances) reflects the input profile as closely as possible. This is done via the matching of scientific names in the input profile. Given the taxonomic names of the OTUs in the BIOM profile (this is required), a NCBI taxonomy ID is inferred. For all reference genomes, a taxonomy ID is also required. In the next step, the lineages between the OTUs and the genomes are compared and an OTU is mapped to a genome which has the smallest path between OTU taxonomy ID and genome taxonomy ID. Since the shortest path might be really long, a threshold is set to family level: An OTU will not get mapped to a genome, which is not in the same family, i.e. for every mapped genome it is guaranteed that they are at least of the same family. Usually this leaves some OTUs unassigned, either because their scientific name did not yield a NCBI taxonomy ID or because for the taxonomy ID no complete genomes are available. It is possible to assign "random" genomes from the reference collection(s) to the unmapped OTUs to increase complexity via an extra option.

For the de novo community design, a specified number of sequences (either circular elements or genomes) is selected according to certain criteria, see below, from all input sequences (e.g. circular elements or genomes) to generate a specific data set. The random selection of genomes for a particular data set is guided based on their membership in novelty category (Taxonomic annotation step 4.) and OTUs: To increase taxonomic spread, for a specified overall number of genomes to be included in a particular data set, the selection algorithm attempts to draw the same number of genomes from each novelty category (new strain, new species, new genus, new family, new order). If a category does not have enough genomes to thus enable coming up with the specified genome number, more genomes are drawn from the other categories, in equal amounts, if possible. This is repeated until the specified number of genomes has been sampled. To model strain level diversity within a species, for every OTU, all available genomes are included in one particular data set up to five genomes at most. More than five genomes per OTU are only included, if the specified data set size requires more genomes to be added, that are not available otherwise.

To increase strain level diversity in the data sets, a number of additional genomes can be simulated with sgEvolver12, representing related strains to the genomes selected above (see below). To this end, a genome is randomly chosen, and a strain number is drawn from a geometric distribution (p=0.3), which represented the number of related strains to be created for that particular genome. This is repeated, until the specified overall number strains to be generated is reached. For each of these genomes, sgEvolver simulates 40 strains, with an increasing genetic distance (in steps of 0.1%) to the preceding ones, up to a total distance of 4%. From the 40 strains, the specified strain number is randomly selected and added to the genome set.

Creating the community abundances

For data set simulation, genome and circular element abundance distributions are required for every data set. For a single sample data set, the abundance distribution is created by sampling from a log-normal distribution with mu set to 1 and sigma to 2 by default. To reflect a differential abundance experiment, two or more abundance samples can be drawn independently from a log-normal distribution with the same parameter settings. For consecutive samples of a time series, abundances for the next sample are calculated by sampling new values from the log-normal distribution, adding these to the previous value and dividing by two. The absolute abundance of the circular elements can be set to be like 15 times as high as the abundance of the microbial genomes, to replicate high natural plasmid abundances. For each sample, a taxonomic profile for higher ranking taxa is calculated based on the genome abundance values and their taxonomic IDs.

Step 3. Metagenome sample simulation

Based on the genome abundance distribution, the genome sizes and the specified metagenome sample output size, for every genome, the coverage in the simulated samples is calculated. The chosen read simulator then generates reads, using the technology and properties (like read length, insert size or error rates if applicable) as specified by the user, sampling each genome proportionally to the specified abundance in the community. A read simulation of a dataset can be run a second time using a different mean insert size using the same genome abundances as before to replicate the use of a mix of technologies on the same samples. The size of every sample of each dataset can be set to any size.

Outputs of this step are for every sample of a data set, a FASTQ and a BAM file, which specifies the location of every read in the input sequences.

Step 4. Creation of the gold standards and anonymization

A gold standard assembly is made with SAMtools for each sample individually and for all samples of each dataset together (pooled gold standard), which included every sequence position with a coverage of at least one, respectively. Specifically, the SAM file obtained from the read simulation specifying the location of each read on the reference genomes is used with SAMtools for calculation of the per-site coverage. Sequences are then broken into gold standard contig sequences at zero coverage sites. The contig sequences are then labeled with a unique dataset and sample identifying prefix and an increasing index as suffix. In the last step, sequences of the simulated data sets are shuffled using the unix command 'shuf' and anonymized, thus creating the challenge data sets.

#Output

{out}/distributions/distribution_{i}.txt

'{i}' is the index for each sample that is to be generated.

  • Column 1: genome_ID
  • Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

{out}/source_genomes/*.fna

All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed.

{out}/source_genomes/sequence_id_map.txt

This file contains a list of replaced sequence ids.

  • Column 1: genome_ID
  • Column 2: original sequence id
  • Column 3: new sequence id

{out}/internal/genome_locations.tsv

List of genomes paths to the copies in the output directory in the 'source_genomes' folder.

  • Column 1: genome_ID
  • Column 2: file path

{out}/internal/meta_data.tsv

Merged meta data of genomes of each community that are actually used for the simulation.

{out}/internal/unused_c{i}_{original_name}.tsv

Unused meta data of genomes of every community.

{out}/sample_{i}/bam/{genome_id}.bam

bam files generated based on reads generated from the read simulator

{out}/sample_{i}/fastq/*.fq

If no anonymization is not done in which case the original fastq files will be here.

{out}/sample_{i}/fastq/anonymous_reads.fq

If anonymization is done, this will be the only fastq file.

{out}/sample_{i}/fastq/reads_mapping.tsv

Mapping of reads for evaluation

  • Column 1: anonymous read id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: read id

{out}/sample_{i}/anonymous_gsa.fasta

Fasta file with perfect assembly of reads of this sample

{out}/sample_{i}/gsa_mapping.tsv

Mapping of contigs for evaluation

  • Column 1: anonymous contig id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/anonymous_gsa_pooled.fasta

Fasta file with perfect assembly of reads from all samples

{out}/gsa_pooled_mapping.tsv

Mapping of contigs from pooled reads for evaluation.

  • Column 1: anonymous_contig_id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/taxonomic_profile_{i}.txt

Taxonomic profile for each sample

{out}/taxonomic_profile_{i}.txt

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.