Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
For many microbes, it is not clear which transcriptome indices are best for building compendia.
Problem or idea
The first hurdle is that genome assemblies are not in the main Ensemble! database. The second is that many microbes have many genome assemblies without a standard naming scheme to point out the 'best' reference. Part of this comes from the multitude of strains for each microbial species and in turn the multitude of assemblies for each strain.
Solution or next step
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa, EnsemblBacteria"
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa PAO1, EnsemblBacteria"
In summary, if there is a way, for a given microbial organism, to specify the strain and assembly in the accession call, and if we curated a list of preferred strains and assemblies we could begin to tackle this.
Below is a link to a working gsheet of organisms, strains and assemblies that I will curate, in case it is useful.
Hi @georgiadoing! I've finally gotten a chance to dig into this a little. I've come up with a few questions for you.
When you say "Next we would need to decide and specify which strain to use." it makes me think that you're suggesting that we use a single strain per organism, but then for Candida albicans the spreadsheet has two rows with different strains. Are you thinking that we should handle multiple strains per organism or just do one per?
Should we go with multiple strains per organism, what should we do when building compendia? Would we want a compendia to be strain-specific or to combine all strains for a given organism?
Thanks a bunch!!
@georgiadoing : I think there is a tradeoff here and I want to make sure we are on the same page with it. Can you let me know if I have it right?
I don't understand how we would map the various candida experiments to one of those two strains. I think you're proposing that for the other organisms they'd all get mapped to a single strain (the one in your spreadsheet). Is that correct?
I think we should start by choosing one strain per organism. Sorry for the confusion with Candida albicans, I've amended the google sheet to just include SC5314, which is a common laboratory strain.
You are right, there is a trade-off here and I think long term goals keeping in mind ways of accounting for differences in strain background are interesting and important, but for the sake of having largest possible compendia now, I think aligning all experiments for a given organism to a single strain background is the most straightforward start.
Does this address your question? Sorry again!
P.s. Deb and I are just reaching out to experts about S. aureus, E. coli and B. subtilis strains but will update soon