Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microbial transcriptome indices #1722

Closed
georgiadoing opened this issue Oct 2, 2019 · 4 comments
Closed

Microbial transcriptome indices #1722

georgiadoing opened this issue Oct 2, 2019 · 4 comments

Comments

@georgiadoing
Copy link

georgiadoing commented Oct 2, 2019

Context

For many microbes, it is not clear which transcriptome indices are best for building compendia.

Problem or idea

The first hurdle is that genome assemblies are not in the main Ensemble! database. The second is that many microbes have many genome assemblies without a standard naming scheme to point out the 'best' reference. Part of this comes from the multitude of strains for each microbial species and in turn the multitude of assemblies for each strain.

Solution or next step

  1. A first step would be if transcriptome indices can be called for EnsembleBacteria in a manner similar to how the README instructs for Surveyor Jobs like so:

./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa, EnsemblBacteria"

  1. Next we would need to decide and specify which strain to use. This is a bit harder to solve and my only short-term solution is to manually specify for each microbe of interest based on expert knowledge. For example, if our preferred reference strain of Pseudomonas aeruginosa PAO1 could be specified, perhaps like so:

./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa PAO1, EnsemblBacteria"

  1. Thirdly would be to make sure we are getting the best assembly for that strain. For example, for Pseudomonas aeruginosa PAO1 there are 7 genomes in EnsemblBacteria and we would like to specifically get ASM676v1, the most complete reference genome that would be best for building a compendium.

In summary, if there is a way, for a given microbial organism, to specify the strain and assembly in the accession call, and if we curated a list of preferred strains and assemblies we could begin to tackle this.

Below is a link to a working gsheet of organisms, strains and assemblies that I will curate, in case it is useful.

https://docs.google.com/spreadsheets/d/1Lbi68UP2dQtfp-KoxtXpE7jhCOxgP_FweGznbbOiMkw/edit?usp=sharing

@dvenprasad
Copy link
Member

Got a request for bacillus subtilis via hotjar

@kurtwheeler
Copy link
Contributor

Hi @georgiadoing! I've finally gotten a chance to dig into this a little. I've come up with a few questions for you.

When you say "Next we would need to decide and specify which strain to use." it makes me think that you're suggesting that we use a single strain per organism, but then for Candida albicans the spreadsheet has two rows with different strains. Are you thinking that we should handle multiple strains per organism or just do one per?

Should we go with multiple strains per organism, what should we do when building compendia? Would we want a compendia to be strain-specific or to combine all strains for a given organism?

Thanks a bunch!!

@cgreene
Copy link
Contributor

cgreene commented Dec 16, 2019

@georgiadoing : I think there is a tradeoff here and I want to make sure we are on the same page with it. Can you let me know if I have it right?

  1. Many organisms have strains.
  2. We could treat those strains as separate entities (higher quality results within strain experiments). However, we don't yet have the ability to combine data across entities (this would be a substantial research question, so imagine results in 3 years as opposed to now). This means that if we treat them as separate entities there will be fewer samples in each compendium.
  3. Therefore, we need to select some level to combine at.

I don't understand how we would map the various candida experiments to one of those two strains. I think you're proposing that for the other organisms they'd all get mapped to a single strain (the one in your spreadsheet). Is that correct?

@georgiadoing
Copy link
Author

Hi Kurt,

I think we should start by choosing one strain per organism. Sorry for the confusion with Candida albicans, I've amended the google sheet to just include SC5314, which is a common laboratory strain.

You are right, there is a trade-off here and I think long term goals keeping in mind ways of accounting for differences in strain background are interesting and important, but for the sake of having largest possible compendia now, I think aligning all experiments for a given organism to a single strain background is the most straightforward start.

Does this address your question? Sorry again!

Thanks!

P.s. Deb and I are just reaching out to experts about S. aureus, E. coli and B. subtilis strains but will update soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants