Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I use CAMISIM to simulate a sample with multiple strains of the same species? #86

Closed
NienkeMekkes opened this issue Sep 8, 2020 · 4 comments

Comments

@NienkeMekkes
Copy link

Hello,

I am working on a classifier that is capable of giving a taxonomic, strain-level ID to WGS reads. I know that CAMISIM can be used to generate simulated strains based on 1 genome, but can I also use CAMISIM to simulate a sample consisting of different, but real(!), strains?

I have a number of genomes (fasta) of multiple strains belonging to different species. These strains are present already in the NCBI RefSeq database. Imagine 4 strains for species A, 10 strains for species B, 2 strains for species C, etc. . I also know the taxonomy ID of these strains. I'd like to generate simulated reads based on this community, to see if my classifier can detect those individual strains correctly.

If the answer is yes, could you give me some guidance on how to best go about this? For instance, would you advice generating a .biom profile, or shall I use the de novo option? Would you advice to set the genomes_total equal to the genomes_real in this case?

Regards,
Nienke

@AlphaSquad
Copy link
Collaborator

Hi,
this is absolutely possible, I would even go as far as saying that this is one of the main ideas we had in mind when developing CAMISIM.
If you have the genomes but no BIOM profile already present, I would recommend the de novo mode. Then you can either decide to let CAMISIM choose the abundances based on a log-normal distribution or, if you want to fine-tune the individual abundances of the genomes, provide all the abundances yourself (using the distribution_file_paths option). As you described, if you have all the genomes you want to simulate reads from, then setting genomes_total = genomes_real is the way to go

@NienkeMekkes
Copy link
Author

Perfect!
Thanks for your explanation, I have just run CAMISIN on some closely related strains without errors. I do have a question about the output files that I didn't understand:

A distribution is generated after running the simulation with default parameters (distribution_*.txt). For instance in my case, one of my genomes ("Genome_4") has an abundance value of 0.50, meaning 50% of the genetic data in the simulated metagenome originates from this genome (if all genomes are of equal size at least).

In addition, there is also a reads_mappings.tsv file. In this file I only see 1 read with genome_id Genome_4, and many other reads for the other genomes in my sample. Why is this? Should the amount of reads in the reads_mapping.tsv file reflect the distribution in the distribution_*.txt file, or not? I think I might be mixing some things up.

Clarification:
My desired output would be a big set of closely related reads of which the taxonomic ID is known, so I can see how well my taxonomic read classifier works.

Regards,
Nienke

@AlphaSquad
Copy link
Collaborator

The distribution files are not percentage based, but just as a relative measure to each other. I would assume that if Genome_4 has an abundance of 0.5, the other genomes have a much higher abundance. Apart from that every read appears in the reads_mapping.tsv and should be uniquely mapped to a genome.

@NienkeMekkes
Copy link
Author

Perfect, many thanks

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants