AllTheBacteria

All WGS isolate bacterial INSDC data to June 2023uniformly assembled, QC-ed, annotated, searchable.

Follow up to Grace Blackwell's 661k dataset (which covered everything to Nov 2018).

Release 0.1

Full details here: https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1 Data here: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.1/ First release contains

About 2 million Shovill assemblies, identified by ENA sample id
summary of assembly statistics
File(s) summarising taxonomic and contamination statistics based on sylph taxnonomic abundance estimation (GTDB r214), and CheckM2
A filelist specifying "high quality" assemblies
A README decribing all this. The assembly workflow is in github , but we don't have a distributable container for it yet.

Further releases

Future releases will include

The process which is mapping all contigs against the human genome to id contamination is taking some time. We will have to make a new release which removes a small number of contigs from a small proportion of the genomes.
More search indexes to come.
Annotation (bakta at least)
Pan-genomes and harmonised gene names within species (for the top N species) for representative genomes chosen using poppunk clusters and QC metrics.
MLST, various species specific typing, AMR

Distribution

Data will be distributed at least by

EBI ftp which is simutaneously accessible by Globus and Aspera.
Zenodo would be good to add

Rules of Engagement with the data

Once Release 0.1 is out, anyone/everyone is welcome to use the data and publish with it. There is no expectation that the people who made the release/data should be co-authors on these publications, but we would appreciate citation of the preprint (https://www.biorxiv.org/content/10.1101/2024.03.08.584059v1).

Rules of Involvement with the project

All welcome, contact us via Github, Slack or the monthly zoom calls. Anyone who contributes to the project, through analysis, project management or any other means, ought to be an author of the paper.

Next zoom calls

22nd March 2024, 9am and 4pm GMT

FAQ

What happens if two people want to run their competing methods (bad example, prokka versus bakta or one AMR tool versus another). First, anyone can do anything they like, but to get into the releases, we should discuss on a zoom call and make a decision. We shall tend towards allowing multiple analyses (eg we intend to run bakta on everything but if someone wants to run prokka too, we should we ok to add that to the release too). However, if it starts to get silly with people wanting 4 tools each run with 3 parameters, then I think we get a lot stricter - this compute isn't free (in terms of carbon, or money), so we'll make a decision and do something limited.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
meetings/2024		meetings/2024
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meetings/2024

meetings/2024

LICENSE

LICENSE

README.md

README.md

Repository files navigation

AllTheBacteria

Release 0.1

Further releases

Distribution

Rules of Engagement with the data

Rules of Involvement with the project

Next zoom calls

FAQ

About

Releases

Packages

License

AllTheBacteria/AllTheBacteria

Folders and files

Latest commit

History

Repository files navigation

AllTheBacteria

Release 0.1

Further releases

Distribution

Rules of Engagement with the data

Rules of Involvement with the project

Next zoom calls

FAQ

About

Resources

License

Stars

Watchers

Forks