Skip to content
joyceFunk123 edited this page Dec 11, 2023 · 3 revisions

USER MANUAL

Contents

Introduction

With CAMISIM you can create simulated metagenome data sets out of taxonomic profiles or de novo from a list of genomes. If a taxonomic profile is used as input, the output data set is created from the NCBI complete genomes, reflecting the input profile as closely as possible and will contain the same number of samples as the input profile, if not specified otherwise. If the community is designed de novo, a user-defined number of the complete genomes is used for creating a community which maximizes genome novelty as well as phylogenetic spread.

CAMISIM can generate four types of simulated metagenome samples de novo:

  • Individual simulate metagenome sample.
    Uses a taxonomic profile sampled from a log-normal distribution
  • A time series of simulated metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution with Gaussian noise, a normal distribution added to successively generated samples.
  • A set of replicate simulated metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution, with Gaussian noise repeatedly added to the original log-normal values.
  • Differential abundance metagenome samples.
    Uses a taxonomic profile sampled from a log-normal distribution.

Output:

The output of CAMISIM is the read files in fastq format, a mapping of these reads to the used genomes in BAM format, "gold standards" for assembly (fasta), binning (CAMI format) and profiling (CAMI format). All files are described here.

Installation:

From Source

Download CAMISIM from the github repository into any directory and make sure the following Dependencies are fullfilled. The metagenome simulation pipeline can be downloaded using git:

git clone https://github.com/CAMI-challenge/CAMISIM

Please make sure that a software is of the specified version, as outdated (or sadly sometimes too new) software can cause errors. The path to the binaries of those tools can be set in the configuration file.

Docker

You can also use Docker image with necessary tools installed:

docker pull cami/camisim:latest

Please see usage section for usage instructions.

Dependencies

Operation system

The pipeline has only been tested on UNIX systems and might fail to run on any other systems (and no support for these will be provided).

Software

The following software is required for the simulation pipeline to run:

Python is the programming language most scripts were written in.

Python package used in the Pipeline for reading/writing sequence files.

Python package for the handling of BIOM files

Python package offering methods for scientific computing.

Python package offering methods for plotting graphs.

Perl 5 and the library XML::Simple

Perl 5 is a programming language, used in the script that generates a perfect assembly from bam files. The perl library XML::Simple can be installed from the unix package 'libxml-simple-perl'

Read simulator which offers error-free and uniform error rates. wgsim is also shipped within CAMISIM and does not have to be installed manually.

Read simulator for the generation of Oxford Nanopore Technologies (ONT) reads. Does not have a random seed in the original version, so cloning and installing the fork by abremges is advised if NanoPore reads should be simulated.

Read simulator for generating Pacific Biosciences (PacBio) reads. Has to be cloned/installed manually if PacBio reads should be simulated.

Microbial ecology software.

Hardware

The simulation will be conducted primarily in the temporary folder declared in the configuration file or system tmp location by default. The results will then be moved to the output directory. Be sure to have enough space at both locations. Required hard drive space and RAM can vary a lot. Very small simulations can be run on a laptop with 4GB RAM, while realistic simulations of several hundred genomes, given a realistic metagenome size, can easily require several hundreds of gigabyte of both RAM and HD space, the chosen metagenome size being the relevant factor.

Resources

A database dump of the NCBI taxonomy is included, current versions can be downloaded from the NCBI FTP-Server.

Genomes

If the community design should be performed de novo, genomes in fasta format to be sampled from are needed. Otherwise they will be downloaded from the NCBI complete genomes.

The de novo community design needs three files to run:

  • A file containing, tab separated, a genome identifier and that path to the file of the genome.

  • A file containing, tab separated, a genome identifier and that path to the gen annotation of genome. This one is uses in case strains are simulated based on a genome

  • A [[meta data file|meta-data-file-format] that contains, tab separated and with header, genome identifier, novelty categorization, otu assignment and a taxonomic classification.

At minimum the following files are required: "nodes.dmp", "merged.dmp", "names.dmp"

USAGE:

from_profile:

>> python metagenome_from_profile.py -p defaults/mini.biom

or de novo:

>> python metagenomesimulation.py defaults/mini_config.ini

To check CAMISIM is working properly, you can perform a test run using the second command above:

It takes about one hour and creates a ~2.4GB folder "out/" in your CAMISIM directory of a small 10 samples, 24 genomes data set. The configuration file as well as the used mapping files genome_to_id and metadata are available in the defaults directory.

Docker

You can also use the camisim docker image

Example call:

docker run -it -v "/path/to/input/directory:/input:rw" -v "/path/to/output/directory:/output:rw"  cami/camisim:latest  metagenome_from_profile.py -p /input/mini.biom -o /output

where

  • /path/to/input/directory contains a biom file.

  • /path/to/output/directory is the output directory.

Note! You can replace the latest tag with any release number listed on the releases page.

CAMISIM proceeds in the following four steps:

Metagenome simulation

Multiple kinds of simulated metagenome data sets can be created, either using an existing 16S rRNA profile or de novo: Single sample, differential abundance or time series data sets with different insert sizes and complexities. All data sets can be simulated as paired-end sequencing from Illumina HiSeq or other machines as well as other technologies (like ONT or PacBio) and error rates. For generation of data sets and their samples, the following steps are undertaken:
After genome data validation (step 1), the community composition is designed according to specified criteria or the given taxonomic profile (step 2) and the metagenome data sets simulated (step 3). Creation of the gold standards (step 4) represents the last part of the pipeline.

Step 1. Data preprocessing and validation

For the de novo metagenome simulation, all sequences shorter than 1000 base pairs are removed from the provided genome assemblies and sequences validated to contain only valid characters. The input genomes can be draft genomes in fasta format and as such contain beside the bases ACGT also ambiguous DNA characters such as 'RYWSMKHBVDN'.

Step 2. Community Design

Genome selection

If the input is a taxonomic profile in CAMI or BIOM format, the taxa appearing in it, are matched to closely related, complete genomes from the NCBI RefSeq or additional reference files with e.g. new genoems, such that the simulated metagenome (genomes and their abundances) reflects the input profile as closely as possible. This is done via the matching of scientific names in the input profile. Given the taxonomic names of the OTUs in the BIOM profile (this is required), a NCBI taxonomy ID is inferred. For all reference genomes, a taxonomy ID is also required. In the next step, the lineages between the OTUs and the genomes are compared and an OTU is mapped to a genome which has the smallest path between OTU taxonomy ID and genome taxonomy ID. Since the shortest path might be really long, a threshold is set to family level: An OTU will not get mapped to a genome, which is not in the same family, i.e. for every mapped genome it is guaranteed that they are at least of the same family. Usually this leaves some OTUs unassigned, either because their scientific name did not yield a NCBI taxonomy ID or because for the taxonomy ID no complete genomes are available. It is possible to assign "random" genomes from the reference collection(s) to the unmapped OTUs to increase complexity via an extra option.

For the de novo community design, a specified number of sequences (either circular elements or genomes) is selected according to certain criteria, see below, from all input sequences (e.g. circular elements or genomes) to generate a specific data set. The random selection of genomes for a particular data set is guided based on their membership in novelty category (Taxonomic annotation step 4.) and OTUs: To increase taxonomic spread, for a specified overall number of genomes to be included in a particular data set, the selection algorithm attempts to draw the same number of genomes from each novelty category (new strain, new species, new genus, new family, new order). If a category does not have enough genomes to thus enable coming up with the specified genome number, more genomes are drawn from the other categories, in equal amounts, if possible. This is repeated until the specified number of genomes has been sampled. To model strain level diversity within a species, for every OTU, all available genomes are included in one particular data set up to five genomes at most. More than five genomes per OTU are only included, if the specified data set size requires more genomes to be added, that are not available otherwise.

To increase strain level diversity in the data sets, a number of additional genomes can be simulated with sgEvolver12, representing related strains to the genomes selected above (see below). To this end, a genome is randomly chosen, and a strain number is drawn from a geometric distribution (p=0.3), which represented the number of related strains to be created for that particular genome. This is repeated, until the specified overall number strains to be generated is reached. For each of these genomes, sgEvolver simulates 40 strains, with an increasing genetic distance (in steps of 0.1%) to the preceding ones, up to a total distance of 4%. From the 40 strains, the specified strain number is randomly selected and added to the genome set.

Creating the community abundances

For data set simulation, genome and circular element abundance distributions are required for every data set. For a single sample data set, the abundance distribution is created by sampling from a log-normal distribution with mu set to 1 and sigma to 2 by default. To reflect a differential abundance experiment, two or more abundance samples can be drawn independently from a log-normal distribution with the same parameter settings. For consecutive samples of a time series, abundances for the next sample are calculated by sampling new values from the log-normal distribution, adding these to the previous value and dividing by two. The absolute abundance of the circular elements can be set to be like 15 times as high as the abundance of the microbial genomes, to replicate high natural plasmid abundances. For each sample, a taxonomic profile for higher ranking taxa is calculated based on the genome abundance values and their taxonomic IDs.

Step 3. Metagenome sample simulation

Based on the genome abundance distribution, the genome sizes and the specified metagenome sample output size, for every genome, the coverage in the simulated samples is calculated. The chosen read simulator then generates reads, using the technology and properties (like read length, insert size or error rates if applicable) as specified by the user, sampling each genome proportionally to the specified abundance in the community. A read simulation of a dataset can be run a second time using a different mean insert size using the same genome abundances as before to replicate the use of a mix of technologies on the same samples. The size of every sample of each dataset can be set to any size.

Outputs of this step are for every sample of a data set, a FASTQ and a BAM file, which specifies the location of every read in the input sequences.

Step 4. Creation of the gold standards and anonymization

A gold standard assembly is made with SAMtools for each sample individually and for all samples of each dataset together (pooled gold standard), which included every sequence position with a coverage of at least one, respectively. Specifically, the SAM file obtained from the read simulation specifying the location of each read on the reference genomes is used with SAMtools for calculation of the per-site coverage. Sequences are then broken into gold standard contig sequences at zero coverage sites. The contig sequences are then labeled with a unique dataset and sample identifying prefix and an increasing index as suffix. In the last step, sequences of the simulated data sets are shuffled using the unix command 'shuf' and anonymized, thus creating the challenge data sets.

File Formats

The following file format definitions are used in data exchange between stages of the CAMI pipeline.

Input

  • BIOM file
    For running the from_profile mode, a BIOM file is required. For details on this format, please consult the link given above. Also make sure that the id column of your OTUs do not contain special characters (for example &, | or ;) and that the taxonomy field is in the "greengenes taxonomy format", e.g. ["g__Escherichia"," s__Escherichia coli"]

  • [[Configuration File|CAMISIM-1.3#Configuration-File-Options]
    This contains the options and values required for the pipeline to run and a path to this file is a required program argument for the de novo simulation and, if not the default config is supposed to be used, also for the from profile mode. The following arguments from the config file expect a file path:

  • id_to_genome_file
    Tab separated data table. It maps genome ids with the file path to genomes. If strain simulation is used, the paths have to be absolute

    • Column 1: Genome id
    • Column 2: file path
  • metadata Tab separated data table It maps genome ids with additional information of their classification

    • Row 1 (header): genome_ID\tOTU\tNCBI_ID\tnovelty_category
    • Column 1: Genome id
    • Column 2: Operational taxonomic unit (OTU) - membership in some taxonomic unit
    • Column 3: NCBI taxonomy identifier
    • Column 4: novelty category - if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (new_strain, new_species, new_genus, ...)
      Also see more information on this file here: Genome selection
  • id_to_gff_file
    Optional, tab separated data table. It maps genome ids with the file path to the gene annotation of a genome. Absolute paths are required.

    • Column 1: Genome id
    • Column 2: file path
  • distributions_file_path Path to optional, tab separated data tables. If this option is not blank, no distribution will be drawn, but the abundance values provided within this file are used. It maps genome ids to their abundance in a certain sample, one file for each sample is required. The individual files per sample should be comma-separated

    • Column 1: Genome id
    • Column 2: Abundance (float)

Output

{out}/distributions/distribution_{i}.txt

'{i}' is the index for each sample that is to be generated.

  • Column 1: genome_ID
  • Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

{out}/source_genomes/*.fa

All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed. Also index files *.fai will be present after simulation

{out}/source_genomes/sequence_id_map.txt

This file contains a list of replaced sequence ids, may be empty if none where replaced.

  • Column 1: genome_ID
  • Column 2: original sequence id
  • Column 3: new sequence id

{out}/internal/genome_locations.tsv

List of genomes paths to the copies in the output directory in the 'source_genomes' folder.

  • Column 1: genome_ID
  • Column 2: file path

{out}/internal/meta_data.tsv

Merged meta data of genomes of each community that are actually used for the simulation.

{out}/internal/unused_c{i}_{original_name}.tsv

Unused meta data of genomes of every community.

{out}/sample_{i}/bam/{genome_id}.bam

bam files generated based on reads generated from the read simulator

{out}/sample_{i}/reads/*.fq

If no anonymization is not done in which case the original fastq files will be here.

{out}/sample_{i}/reads/anonymous_reads.fq

If anonymization is done, this will be the only fastq file.

{out}/sample_{i}/reads/reads_mapping.tsv

Mapping of reads for evaluation

  • Column 1: anonymous read id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: read id

{out}/sample_{i}/contigs/anonymous_gsa.fasta

Fasta file with perfect assembly of reads of this sample

{out}/sample_{i}/contigs/gsa_mapping.tsv

Mapping of contigs for evaluation

  • Column 1: anonymous contig id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/anonymous_gsa_pooled.fasta

Fasta file with perfect assembly of reads from all samples

{out}/gsa_pooled_mapping.tsv

Mapping of contigs from pooled reads for evaluation.

  • Column 1: anonymous_contig_id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/taxonomic_profile_{i}.txt

Taxonomic profile for each sample

Configuration File Options

These is a list of all configuration options that are required for CAMISIM to run. Optional arguments are written in italic. The sections, denoted by the words in [brackets] are also required.

[Main]
seed=

  • An optional seed to get consistent results, if None is used, random seed is chosen

phase=

  • Starting point of the simulation
    0: Full run
    1: Only community design
    2: Start with read simulation

max_processors=

  • Maximum number of processors used

dataset_id=

  • Name of the created sample

output_directory=

  • Output directory
  • Will be created if it does not exist.
  • Make sure directory is empty before running CAMISIM

temp_directory=

  • Path for storing temporary files of CAMISIM
  • Make sure enough space is available

gsa=

  • Whether a gold standard assembly should be created
  • May be True/False or Yes/No

pooled_gsa=

  • Whether a pooled gold standard over all samples is created
  • True/False/Yes/No

anonymous=

  • Whether the output is anonymized
  • Also True/False/Yes/No
  • Can be computationally expensive

compress=

  • Whether the data should be compressed or not
  • 0 is for no comrepssion, 9 is maximum comporession
  • recommended is using compression level 1 at least

[ReadSimulator]
readsim=tools/art_illumina-2.3.6/art_illumina

  • path to the executable of the chosen read simulator
  • ART is shipped within CAMISIM at the above relative path

error_profiles=tools/art_illumina-2.3.6/profiles

  • Folder containing error profiles for the read simulators
  • Might be emoty for some simulators (e.g. NanoSim)

samtools=tools/samtools-1.3/samtools

  • Path to samtools executable
  • Version 1.3 (recommended) is shipped within CAMISIM

profile=mbarc

  • used read simulator error profile
  • mbarc is recommended for bacterial communities
  • choose for ART: mi/hi/hi150/mbarc
  • ignored by other read simulators

size=

  • size of a single sample in Gigabasepairs (Gbp)
  • actual size, including mapping files, might be larger

type=

  • type of the used read simulator
  • choose from art/wgsim/nanosim/pbsim

fragments_size_mean=

  • mean size of the fragments to be created
  • will be ignored by NanoSim

fragment_size_standard_deviation=

  • standard deviation of fragments to be created
  • will be ignored by NanoSim

[CommunityDesign]
distribution_file_paths=

  • optional path to input abundance files
  • files are tsv files of the format genomeTababundance
  • One per sample to be created

ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz

  • Taxonomy dump form the NCBI
  • Working version can be found at the relative path above

strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir

  • Path to a template.tree for the sgEvolver from the mauve suite
  • Example tree is shipped along the sgEvolver itself within CAMISIM

number_of_samples=

  • Number of samples to be created

[community0]
metadata=

  • Path of the input metadata tsv file
  • file has to be in the format: genome_IDTabOTUTabNCBI_IDTabnovelty_category
  • And this line as a header

id_to_genome_file=

  • Path to the input genome_to_id tsv file
  • format is: genome_IDTabgenome path
  • No header

id_to_gff_file=

  • Optional file used by the sgEvolver, mapping togene annotations of the input genomes

genomes_total=

  • Total number of simulated genomes
  • Difference between genomes_total and genomes_real are simulated by sgEvolver

genomes_real=

  • Number of genomes used from the input genomes

max_strains_per_otu=

  • Maximum number of strains drawn from genomes belonging to a single OTU
  • OTU is taken from the metadata file

ratio=

  • ratio between different communities
  • default: 1, i.e. if only one community is present

mode=

  • mode for changing the abundances in different samples
  • one of replicates/timeseries_lognormal/timeseries_normal/differential

log_mu=

  • Mean of the used log-normal distribution
  • 1 is an empirically good mean

log_sigma=

  • standard deviation of the used log-normal distribution
  • 2 is an empirically good sd

gauss_mu=

  • mean of the used normal distribution

gauss_sigma=

  • standard deviation of the used normal distribution

view=

  • Show the used distribution of genomes before simulating
  • default: False