Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is profile-based design with custom genomes possible? #177

Closed
CassandraHjo opened this issue Dec 4, 2023 · 11 comments
Closed

Is profile-based design with custom genomes possible? #177

CassandraHjo opened this issue Dec 4, 2023 · 11 comments

Comments

@CassandraHjo
Copy link

I want to use my own fasta files as input genomes to the simulation. I am wondering if profile-based community design is possible with custom genomes. i.e. I do not want to use genomes from NCBI. Would it be an option to provide my own genome sequence collection in the profile based design, like I can do in the de-novo based design?

@AlphaSquad
Copy link
Collaborator

Actually this is possible, yes! You will need to use the -ar/--additional-references option. This file has to be in a tab-separated format with the 4 columns NCBI_ID Scientific_name genome_path novelty_category (without header). The NCBI ID is required for mapping the scientific name from the profile to your genome, for novelty category you can just use known_strain.
If you do not want to use genomes from the NCBI at all, you will additionally need to use the -ref/--reference-genomes option and point to an empty file, so CAMISIM does not use the default reference list in addition to the one provided for you. The command could look like this:
./metagenome_from_profile --additional-references /your/reference/file.tsv --reference-genomes /path/to/empty/file.tsv -p /your/profile.biom

@CassandraHjo
Copy link
Author

I am not interested in gsa or pooled_gsa. Do I still need to find a unique NCBI ID or can I just use 2?

For example in the reference_file.tsv can an entry look like this?
2 MAG0001 genomes/MAG0001.fasta known_strain

@AlphaSquad
Copy link
Collaborator

You don't need to provide these NCBI IDs then, so yes it could look like this, though just to be safe I'd advise using absolute paths to your genomes.

@CassandraHjo
Copy link
Author

What does the biom file need to contain if an entry in the reference file looks like the one above?

@AlphaSquad
Copy link
Collaborator

Actually, looking at the code right now I was mistaken. For CAMISIM to work, every entry in the reference file needs to have a "correct" NCBI ID and scientific name, if you choose 2 as your taxonomy ID, CAMISIM assume that all your input genomes are on the taxonomic level of superkingdom and the mapping will not work. Additionally, the mapping from BIOM profile to your genome is performed via the scientific name, so using MAG0001 will not work as this will not be recognised as scientific name.
The format of your BIOM profile should be similar to the mini.biom profile provided. The abundances are stored under data and the taxonomy in the same format as in the mini.biom, i.e. they need the metadata and taxonomy keywords - usually QIIME produces these files in the correct format already.

@CassandraHjo
Copy link
Author

I do not have the NCBI ID for all my custom genomes. Is there another way to make the profile-based design work, or is de-novo design (which I am able to run) be the best option?

@AlphaSquad
Copy link
Collaborator

Since you do not use CAMISIM's option to download genomes the de novo design might actually be best (and more accurate). To use the abundances from the input profile you would need to use the distribution_file_paths option to provide them for your genomes, tab-separated with genome ID and abundance from the BIOM file. Note that for the de novo design to work you will still need to provide NCBI taxonomy IDs, but if you do not plan on using the taxonomic profile gold standard any valid NCBI ID should work (e.g. 2 for Bacteria)

@CassandraHjo
Copy link
Author

Do I need to change the phase in the config file if I am using the distribution_file_path option?

@AlphaSquad
Copy link
Collaborator

No, you should not need to change the phase, CAMISIM will automatically use the files if they are provided. Note that for multiple samples these need to be absolute paths and comma-separated without whitespace:
distribution_file_paths=/path/to/sample1.tsv,/path/to/sample2.tsv

@CassandraHjo
Copy link
Author

Should the tsv files include headers?

@AlphaSquad
Copy link
Collaborator

No, these do not need a header, just genome_ID and abundance tab-separated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants