Context
For many microbes, it is not clear which transcriptome indices are best for building compendia.
Problem or idea
The first hurdle is that genome assemblies are not in the main Ensemble! database. The second is that many microbes have many genome assemblies without a standard naming scheme to point out the 'best' reference. Part of this comes from the multitude of strains for each microbial species and in turn the multitude of assemblies for each strain.
Solution or next step
- A first step would be if transcriptome indices can be called for EnsembleBacteria in a manner similar to how the README instructs for Surveyor Jobs like so:
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa, EnsemblBacteria"
- Next we would need to decide and specify which strain to use. This is a bit harder to solve and my only short-term solution is to manually specify for each microbe of interest based on expert knowledge. For example, if our preferred reference strain of Pseudomonas aeruginosa PAO1 could be specified, perhaps like so:
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa PAO1, EnsemblBacteria"
- Thirdly would be to make sure we are getting the best assembly for that strain. For example, for Pseudomonas aeruginosa PAO1 there are 7 genomes in EnsemblBacteria and we would like to specifically get ASM676v1, the most complete reference genome that would be best for building a compendium.
In summary, if there is a way, for a given microbial organism, to specify the strain and assembly in the accession call, and if we curated a list of preferred strains and assemblies we could begin to tackle this.
Below is a link to a working gsheet of organisms, strains and assemblies that I will curate, in case it is useful.
https://docs.google.com/spreadsheets/d/1Lbi68UP2dQtfp-KoxtXpE7jhCOxgP_FweGznbbOiMkw/edit?usp=sharing
Context
For many microbes, it is not clear which transcriptome indices are best for building compendia.
Problem or idea
The first hurdle is that genome assemblies are not in the main Ensemble! database. The second is that many microbes have many genome assemblies without a standard naming scheme to point out the 'best' reference. Part of this comes from the multitude of strains for each microbial species and in turn the multitude of assemblies for each strain.
Solution or next step
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa, EnsemblBacteria"
./foreman/run_surveyor.sh survey_all --accession "Pseudomonas aeruginosa PAO1, EnsemblBacteria"
In summary, if there is a way, for a given microbial organism, to specify the strain and assembly in the accession call, and if we curated a list of preferred strains and assemblies we could begin to tackle this.
Below is a link to a working gsheet of organisms, strains and assemblies that I will curate, in case it is useful.
https://docs.google.com/spreadsheets/d/1Lbi68UP2dQtfp-KoxtXpE7jhCOxgP_FweGznbbOiMkw/edit?usp=sharing