All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable.
Follow up to Grace Blackwell's 661k dataset (which covered everything to Nov 2018).
Full details here: https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1 Data here: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.1/ First release contains
- About 2 million Shovill assemblies, identified by ENA sample id
- summary of assembly statistics
- File(s) summarising taxonomic and contamination statistics based on sylph taxnonomic abundance estimation (GTDB r214), and CheckM2
- A filelist specifying "high quality" assemblies
- A README decribing all this. The assembly workflow is in github , but we don't have a distributable container for it yet.
Future releases will include
- The process which is mapping all contigs against the human genome to id contamination is taking some time. We will have to make a new release which removes a small number of contigs from a small proportion of the genomes.
- More search indexes to come.
- Annotation (bakta at least)
- Pan-genomes and harmonised gene names within species (for the top N species) for representative genomes chosen using poppunk clusters and QC metrics.
- MLST, various species specific typing, AMR
Data will be distributed at least by
- EBI ftp which is simutaneously accessible by Globus and Aspera.
- Zenodo would be good to add
Once Release 0.1 is out, anyone/everyone is welcome to use the data and publish with it. There is no expectation that the people who made the release/data should be co-authors on these publications, but we would appreciate citation of the preprint (https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1).
All welcome, contact us via Github, Slack or the monthly zoom calls. Anyone who contributes to the project, through analysis, project management or any other means, ought to be an author of the paper.
22nd March 2024, 9am and 4pm GMT
- What happens if two people want to run their competing methods (bad example, prokka versus bakta or one AMR tool versus another). First, anyone can do anything they like, but to get into the releases, we should discuss on a zoom call and make a decision. We shall tend towards allowing multiple analyses (eg we intend to run bakta on everything but if someone wants to run prokka too, we should we ok to add that to the release too). However, if it starts to get silly with people wanting 4 tools each run with 3 parameters, then I think we get a lot stricter - this compute isn't free (in terms of carbon, or money), so we'll make a decision and do something limited.