Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance on abundance.tsv file and output #109

Open
NienkeMekkes opened this issue Apr 23, 2021 · 8 comments
Open

Guidance on abundance.tsv file and output #109

NienkeMekkes opened this issue Apr 23, 2021 · 8 comments
Assignees
Labels
Milestone

Comments

@NienkeMekkes
Copy link

Dear authors,

I remember asking this question before, but when looking back, I still am having some last doubts I was hoping you could help me with.

This is a summary of organism, abundance.tsv, and genome size
Salmonella bongori | 0.10 | 4487548
Salmonella enterica | 0.4 | 5028552
escherichia coli | 0.5 | 4643559

When I use CAMISIM to simulate my sample, I use the readmappings.tsv output file with grep -c to check my abundance.
In this output file, I have exactly 10% of reads belonging o S bongori, 40% to S enterica, and 50 to e.coli

Is this expected behaviour? I read that the simulation is supposed to take genome length into consideration, so I expected that the amount of reads belonging to S.enterica would be inflated, since the genome is bigger. Why is this not the case?

@AlphaSquad
Copy link
Collaborator

AlphaSquad commented Apr 23, 2021

Hm, that is strange, are you sure that these values fit exactly?
The difference between S. bongori and S. enterica in genome size is only ~10%, so instead of 10% and 40% I would expect CAMISIM to simulated something like 9% and 41% of reads for both:
If the coverage is ratio 4:1 between the two genomes and the genome size ratio is 1.1:1, then we expect a read ratio of 4.4:1 in reads between the two - which translates to ~40.75%:9.25% (here neglecting the third genome which is not exactly halfway between the two Salmonella)

@NienkeMekkes
Copy link
Author

NienkeMekkes commented Apr 23, 2021

Hi and thank you for your quick reply. Your calculations are also what I expected to happen.

When I use grep -c on my readmappings.tsv file, I find this file has a total of 1333338 lines, with the counts of lines and percentage:
bongori | 133334 | 0.10000015
enterica | 666666 | 0.4000021
e.coli | 533338 | 0.49999775

So while not exact, my readmappings file does fit closer to the abundance I put in the abundance.tsv file, then what I would expact from making these caclulations myself. Is there a better way to check the fraction of reads simulated besides grep -c on the readmappings.tsv?

@AlphaSquad
Copy link
Collaborator

I think grep -c on the readmappings file should be fine. I tried the same on a few of my datasets and the results were as I expected it (i.e. the larger genomes having far more reads than the smaller ones, even with lower abundance).
Which read simulator did you use?

@NienkeMekkes
Copy link
Author

wgsim, so error free simulation. Could this cause the behaviour?

@AlphaSquad AlphaSquad self-assigned this Apr 23, 2021
@AlphaSquad AlphaSquad added the bug label Apr 23, 2021
@AlphaSquad
Copy link
Collaborator

I will have to look into it, but I have the strong suspicion that this is a bug in CAMISIM when using wgsim. Could you run the same data set using ART and look if the distribution of reads is as you (and I 😉 ) would expect?

@NienkeMekkes
Copy link
Author

sure!

@NienkeMekkes
Copy link
Author

Hi again,

I ran the simulation again with:

readsim=tools/art_illumina-2.3.6/art_illumina
error_profiles=tools/art_illumina-2.3.6/profiles
samtools=tools/samtools-1.3/samtools
profile=mbarc
size=0.1
type=art
fragments_size_mean=270
fragment_size_standard_deviation=27

This resulted in:

  grep -c % grep -c count expected
S bongori 0.094 62884 ~0.09
E coli 0.490 326554 ~0.49
S enterica 0.416 277222 ~0.42
    666660  

The expected row is what I (and you :) )expected my distribution to look like. It is a very good match when using art!
For error free simulation, I think I can still use my simulated reads (my only goal is to use them together with kraken/bracken, and check if I can find the abundance simulated with bracken), since the simulated numbers are what I expected.

@AlphaSquad
Copy link
Collaborator

Okay, that is good to hear. It is still a bug/inconsistency for wgsim (and likely for Nanosim as well) which I will fix.
If you can use your data regardless that is great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants