Guidance on abundance.tsv file and output #109

NienkeMekkes · 2021-04-23T08:51:40Z

Dear authors,

I remember asking this question before, but when looking back, I still am having some last doubts I was hoping you could help me with.

This is a summary of organism, abundance.tsv, and genome size
Salmonella bongori | 0.10 | 4487548
Salmonella enterica | 0.4 | 5028552
escherichia coli | 0.5 | 4643559

When I use CAMISIM to simulate my sample, I use the readmappings.tsv output file with grep -c to check my abundance.
In this output file, I have exactly 10% of reads belonging o S bongori, 40% to S enterica, and 50 to e.coli

Is this expected behaviour? I read that the simulation is supposed to take genome length into consideration, so I expected that the amount of reads belonging to S.enterica would be inflated, since the genome is bigger. Why is this not the case?

AlphaSquad · 2021-04-23T10:10:40Z

Hm, that is strange, are you sure that these values fit exactly?
The difference between S. bongori and S. enterica in genome size is only ~10%, so instead of 10% and 40% I would expect CAMISIM to simulated something like 9% and 41% of reads for both:
If the coverage is ratio 4:1 between the two genomes and the genome size ratio is 1.1:1, then we expect a read ratio of 4.4:1 in reads between the two - which translates to ~40.75%:9.25% (here neglecting the third genome which is not exactly halfway between the two Salmonella)

NienkeMekkes · 2021-04-23T10:24:10Z

Hi and thank you for your quick reply. Your calculations are also what I expected to happen.

When I use grep -c on my readmappings.tsv file, I find this file has a total of 1333338 lines, with the counts of lines and percentage:
bongori | 133334 | 0.10000015
enterica | 666666 | 0.4000021
e.coli | 533338 | 0.49999775

So while not exact, my readmappings file does fit closer to the abundance I put in the abundance.tsv file, then what I would expact from making these caclulations myself. Is there a better way to check the fraction of reads simulated besides grep -c on the readmappings.tsv?

AlphaSquad · 2021-04-23T11:08:32Z

I think grep -c on the readmappings file should be fine. I tried the same on a few of my datasets and the results were as I expected it (i.e. the larger genomes having far more reads than the smaller ones, even with lower abundance).
Which read simulator did you use?

NienkeMekkes · 2021-04-23T11:10:40Z

wgsim, so error free simulation. Could this cause the behaviour?

AlphaSquad · 2021-04-23T11:25:09Z

I will have to look into it, but I have the strong suspicion that this is a bug in CAMISIM when using wgsim. Could you run the same data set using ART and look if the distribution of reads is as you (and I 😉 ) would expect?

NienkeMekkes · 2021-04-23T11:26:08Z

sure!

NienkeMekkes · 2021-04-23T12:27:22Z

Hi again,

I ran the simulation again with:

readsim=tools/art_illumina-2.3.6/art_illumina
error_profiles=tools/art_illumina-2.3.6/profiles
samtools=tools/samtools-1.3/samtools
profile=mbarc
size=0.1
type=art
fragments_size_mean=270
fragment_size_standard_deviation=27

This resulted in:

	grep -c %	grep -c count	expected
S bongori	0.094	62884	~0.09
E coli	0.490	326554	~0.49
S enterica	0.416	277222	~0.42
		666660

The expected row is what I (and you :) )expected my distribution to look like. It is a very good match when using art!
For error free simulation, I think I can still use my simulated reads (my only goal is to use them together with kraken/bracken, and check if I can find the abundance simulated with bracken), since the simulated numbers are what I expected.

AlphaSquad · 2021-04-23T13:06:36Z

Okay, that is good to hear. It is still a bug/inconsistency for wgsim (and likely for Nanosim as well) which I will fix.
If you can use your data regardless that is great.

AlphaSquad self-assigned this Apr 23, 2021

AlphaSquad added the bug label Apr 23, 2021

AlphaSquad added this to the v2.0 milestone Aug 25, 2021

This was referenced Mar 12, 2023

Grasping CAMISIM ndreey/ghost-magnet#13

Open

CAMISIM: Generate the abundance profiles ndreey/ghost-magnet#19

Open

AlphaSquad mentioned this issue Jan 19, 2024

Generation from different genome for same specie #186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance on abundance.tsv file and output #109

Guidance on abundance.tsv file and output #109

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021 •

edited

Loading

NienkeMekkes commented Apr 23, 2021 •

edited

Loading

AlphaSquad commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021

Guidance on abundance.tsv file and output #109

Guidance on abundance.tsv file and output #109

Comments

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021 • edited Loading

NienkeMekkes commented Apr 23, 2021 • edited Loading

AlphaSquad commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

NienkeMekkes commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021

AlphaSquad commented Apr 23, 2021 •

edited

Loading

NienkeMekkes commented Apr 23, 2021 •

edited

Loading