Welcome to the MGBC repository! Here you will find the genomes and data files for analysing mouse gut metagenome samples as well as the pipelines and scripts used to build this catalogue.
If you use this resource or the linked MGBC-Toolkit, please cite our paper:
- Beresford-Jones, B.S., Forster, S.C., Stares, M.D., Notley, G., Viciani, E., Browne, H.P., Boehmler, D.J., Soderholm, A.T., Kumar, N., Vervier, K., Cross, J.R., Almeida, A., Lawley, T.D., Pedicord, V.A., 2021. The Mouse Gastrointestinal Bacteria Catalogue enables translation between the mouse and human gut microbiotas via functional mapping. Cell Host Microbe. https://doi.org/10.1016/j.chom.2021.12.003
Genomes:
- MGBC collection:
- 26,640 high-quality, non-redundant genomes: MGBC-hqnr_26640.tar.gz
- Genome metadata: MGBC_md_26640.tar.gz
- Genome protein coding sequences (amino acid sequences; .faa): MGBC-faa_26640.tar.gz
- Genome protein coding sequences (nucleotide sequences; .ffn): MGBC-ffn_26640.tar.gz
- Genome annotations (GenBank flat file format)
- Part 1 (MGBC000001-MGBC129999): MGBC-gbk_26640-d1.tar.gz
- Part 2 (MGBC130000-MGBC167528): MGBC-gbk_26640-d2.tar.gz
- This is the final genome set following dereplication that is used for the study's analyses.
- Full genome collection:
- 35,925 high-quality genomes: MGBC-hq_35925.tar.gz
- 29,129 medium-plus quality genomes: MGBC-mq_29129.tar.gz
- Genome metadata: MGBC_md_65097.tar.gz
- This is the complete collection of genomes generated/curated for this study.
- Mouse Culture Collection:
- Genome assemblies for the 276 sequenced isolates (post-qc) are available from BioProject PRJEB45232
- Genome annotations (GenBank flat file format): MCC-gbk_276.tar.gz
- Deposition of cultured isolates from the paper to DSMZ is on-going, and accessions of available isolates are being actively updated on this GitHub.
Protein catalogues:
- MGBC-UHGG combined catalogue - 100% sequence identity clusters: mgbc-uhgg_clus100.tar.gz
- MGBC-UHGG combined catalogue - 90% sequence identity clusters: mgbc-uhgg_clus90.tar.gz
- MGBC-UHGG combined catalogue - 80% sequence identity clusters: mgbc-uhgg_clus80.tar.gz
- MGBC-UHGG combined catalogue - 50% sequence identity clusters: mgbc-uhgg_clus50.tar.gz
- MGBC protein catalogue - 100% sequence identity clusters: mgbc_hq-mq_clus100.tar.gz
- This catalogue contains the gene clusters from all non-redundant high and medium plus quality genomes of the MGBC.
Kraken2/Bracken database:
- MGBC Kraken2/Bracken database: MGBC-26640_KrakenBracken.tar.gz
- This custom database leverages the 26,640 high quality genomes of the MGBC to achieve ~90% average read classification for mouse gut metagenome samples.
The global mouse metagenome compilation:
- Bracken output for 2,446 mouse gut metagenomes: bracken-out_2664.tar.gz
- Species-level data on the microbiome composition for 2,446 mouse gut metagenome samples.
- Sample metadata for these mouse gut metagenomes: sample-metadata_2446.tar.gz
This repository is structured as follows :
-
the
src
directory contains the source materials and pipelines to be able to reproduce our analyses. These pipelines include:- metagenome binning and MAG synthesis
- construction of protein catalogues
- assembly and functional annotation of species pangenomes
- species-level functional analyses
-
the
data
directory contains the reference datasets for the functional schemes and some example intermediate output files forsrc
-
the
figures
directory contains the scripts used to build the figures for the manuscript -
the
supp
directory contains a description of the Supplementary Data Tables from the paper
Please read on for a detailed over-view of the src
directory. Specific information on the other directories can be found in the relevant README files.
The src
directory contains four sub-directories organised to reflect the main workflows of this project.
This directory includes the custom MAG building pipeline that leverages MetaWRAP to get the best quality bins out of single samples. In addition it also contains QC and taxonomy pipelines/scripts.
Overview of workflow:
MAG_pipeline.sh
: runs metagenome QC, assembly, binning and refinement for MAG synthesis.QC_TAX_pipeline
: wrapper for running CheckM and GTDB-Tk on refined bins.get.RNA_profile.sh
: generate tRNA and rRNA profiles for a genome; designed to be run in parallel.
Build metagenome-assembled genomes quickly and easily from shotgun metagenomes.
Requirements:
- KneadData (tested v0.7.3)
- MetaWRAP (tested v1.2.3)
- GTDBT-Tk v1.3.0 r95
- bsub.py v0.42.1
This pipeline was coded for running within LSF cluster environments, and runs multiple parallel job submissions to rapidly generate high-quality bins.
Usage:
MAG_pipeline.sh -i path/to/sample_ids -s path/to/study_directory -t threads -e REFINE
Arguments:
-i
path to input file listing the sample file names without any file suffix.
-s
path to the metagenome study directory (see below).
-t
number of threads.
-S
do not run pipeline, just generate the scripts.
-e
early end - end after running ASSEMBLY or REFINE.
-f
file count - keep track of the number of files that are being produced by each job for troubleshooting purposes.
Notes:
-
the
1-build-MAGs/
directory need to be part of your$PATH
system variable -
this pipeline requires a specific file structure for the metagenome samples:
- STUDY_NAME/
- Metagenomes/
- metagenome sample files e.g.
SRR6051702.fastq
# single readSRR11404551_1.fastq
SRR11404551_2.fastq
# paired end
- SAMPLE_IDs.txt: file listing the names of the metagenome samples in the
Metagenomes/
directory with out a suffix or paired end index. For example:SRR6051702
,SRR11404551
- Metagenomes/
- STUDY_NAME/
-
the pipeline runs: QC - ASSEMBLY - BINNING - REFINE - REASSEMBLY
-
output MAGs will be found in the REFINE or REASSEMBLY directory (if run)
-
reassembling bins is a computationally expensive and resource intensive process, potentially generating hundreds of thousands of temporary files. It is therefore recommended to use the
-e REFINE
option if running on many samples.
Runs CheckM on genomes and returns QC outcomes, before running GTDB-Tk classifier.
Requirements:
- CheckM v1.1.2
- GTDBT-Tk v1.3.0 r95
Usage:
QC_TAX_pipeline.sh -i path/to/genome_directory -t threads -o path/to/output -x fa
Arguments:
-i
path to directory containing genomes on which to run pipeline.
-o
directory to write output to.
-t
number of threads.
-x
genome suffix, default = fna.
Notes:
- runs CheckM and GTDB-Tk on genomes in the genome directory
- QC specifications are:
- Completeness ≥90%
- Contamination ≤5%
- Genome size ≤8Mb
- Number of contigs ≤500
- N50 ≥10,000
- Mean contig length ≥5,000
- Genome ids that pass QC can be found
<-o>/CheckM/Validated_genomes.txt
GTDBTK_CLASSIFY_EFFICIENT.sh
: the same as GTDB-Tk'sclassify_wf
except with a smaller temporary file footprint.get_lowest_taxonomy_v1.0.R
: takes GTDB-Tk output (gtdbtk.bac120.summary.tsv
) and summarises the lowest taxonomy obtained - output is used in other pipelines.get.RNA_profile.sh
: generate tRNA and rRNA analyses for a genome. Automatically tries to tar archive the RNA sequences for later use.get.coverage.sh
: uses samtools and bowtie to generate bam alignments for MAGs and isolate genomes from their fastq files. Facilitates getting coverage for these genomes (to be added).- the remaining files are part of the
MAG_pipeline.sh
pipeline
This directory includes the scripts to build the protein catalogues. As a prerequisite, CDS predictions should be generated for genomes (e.g. using prokka) and the .faa files for each genome concatenated (using cat *.faa
). The resulting .faa file serves as input for this pipeline.
Overview of workflow:
mmseqs_wf_bsub.sh
: pipeline for running mmseqs2 (builds database and clusters sequences).CLUSTER_STATS.sh
: generate human vs mouse analyses for clusters.
Build protein cluster databases from concatenated protein sequence file.
Requirements:
- mmseqs2 (tested with v10.6d92c--h2d02072_0)
- bsub.py v0.42.1
This pipeline was coded for running within LSF cluster environments.
Usage:
mmseqs_wf_bsub.sh -i <INFILE> -s <OUTDIR> -t <THREADS> -T <TMPDIR> -FENH -m <MEMORY>
Arguments:
-i
path to input file (concatenated protein sequences e.g. .faa to be clustered) [REQUIRED]
-o
output directory [default: .]
-T
directory to use to build the MMseqs database [default: .]
-F
cluster at 50% sequence identity (orthologue level)
-E
cluster at 80% sequence identity (genus level)
-N
cluster at 90% sequence identity (species level)
-H
cluster at 100% sequence identity
-t
number of threads to submit jobs with [default: 1]
-q
queue to submit jobs to [default: normal]
-m
memory to submit jobs with, 120 Gb is recommended [REQUIRED]
Notes:
- will skip building MMseqs database if one already exist in
<TMPDIR>
- the
2-build-protein-catalogues/
directory needs to be part of your$PATH
system variable to accesslinclust.sh
- output files are written to
<-o>/CLUS_X/
, where X represents the chosen sequence identity threshold(s)mmseqs_cluster_rep.fa
: fasta file containing sequence representativesmmseqs_cluster.tsv
: cluster membership file
CLUSTER_STATS.sh
: run in theCLUS_X
directory to generate human vs mouse statistics for comparing cluster membership. Output is written toCLUS_x/tmp/cluster_stats.out
This directory includes the scripts for building and functionally annotating species pangenomes.
Overview of workflow:
get.pangenome_MGBC1094.sh
: build and functionally annotate a species pangenome from the clustered protein catalogue from phase 2. eggNOG v2 and InterProScan v5 are run in parallel to generate functional annotations.
Build a host-specific pangenome for a species using a clustered protein catalogue.
Requirements:
- eggNOG-mapper v2.0.1
- InterProScan v5.39-77.0-W01
- bsub.py v0.42.1
This pipeline was coded for running within LSF cluster environments.
Usage:
get.pangenome_MGBC1094.sh -i <GENOME_REP_ID> -t <THREADS> -q <QUEUE> -H <HOST> -p <OUT_DIR> -CEI -l <SEQID>
Arguments:
Input [REQUIRED]:
-i
Representative genome id without file suffix (i.e. .fna, .fa)
-t
Number of threads with which to run analyses.
-q
Queue to submit jobs to, for use with cluster analysis [default: normal]
-H
Specify host - either HUMAN or MOUSE.
Output - pick one of the following options:
-o
Output directory in which to generate the results, mutually exclusive with -p [-p flag is default option].
-p
Path to directory to build a unique output directory (e.g. REP_ID.TAX.HOST) [default: .]
NB: For smooth integration with downstream pipelines, I recommend using -p HUMAN
or -p MOUSE
for human and mouse pangenomes respectively, run from the same directory.
Action:
-C
Build pangenome using mmseqs gene clusters.
-l
Protein cluster sequence identity threshold to use with -C
. Can be one of 50, 80, 90 or 100 [default: 90]
-E
Run eggNOG v2 on pangenome.
-I
Run InterProScan on pangenome.
Notes:
- need to update the path variables to the required data inside the file:
$LINCLUST_DB
: path to directory containing CLUS_X directories- this will be the same directory as supplied to
mmseqs_wf_bsub.sh
with the<-o>
flag
- this will be the same directory as supplied to
$M_REPMEMS
and$H_REPMEMS
: tab-separated representative genome index files for each host, where- column 1 contains the genome id,
- column 2 contains the representative genome id for the species cluster,
- column 3 indicates the lowest taxonomy as determined by GTDB-Tk and
get_lowest_taxonomy_v1.0.R
mgbc_rep_index_26640.tsv
anduhgg_rep_index_100456.tsv
are given as examples in thedata/
directory- e.g.
MGBC000001 MGBC000001 g__Schaedlerella
MGBC000002 MGBC129157 s__CAG-485 sp002362485
MGBC000003 MGBC000003 g__Schaedlerella
MGBC000005 MGBC000328 s__Phocaeicola vulgatus
MGBC000006 MGBC000320 s__Schaedlerella sp000364245
- the
3-build-species-pangenomes/
directory needs to be part of your$PATH
system variable to accessGET_FASTA_FROM_CONTIGS_v4.py
- output files are written to
$OUTDIR/cluster_"<-l>".out
- eggNOG output -->
<-i>.dmnd.emapper.annotations
- InterProScan output -->
ips_out.gff
- eggNOG output -->
The other scripts in this directory are used by the pipelines discussed above.
This directory includes the scripts for comparing the functional profiles of bacterial species of the human and mouse gut microbiota.
Overview of workflow:
summarise.eggnog_annotations.sh
: summarise eggNOG v2 annotations generated withget.pangenome_MGBC1094.sh
.summarise.ips_gff.sh
: summarise InterProScan v5 annotations generated withget.pangenome_MGBC1094.sh
.summarise.all_functions.MGBC120421.sh
: compile data for all pangenomes and generate functional profiles for each species.build.function_presence_absence.sh
: generate presence-absence matrices for each functional scheme.analyse.pangenome-distance_MGBC.R
: generate distance matrices for each functional scheme.
Summarise the eggNOG v2 output annotation file for a pangenome, returning feature-gene and gene-genome indices. Additionally calculates annotation efficiency data.
This pipeline was coded for running within LSF cluster environments.
Usage:
summarise.eggnog_annotations.sh -i <EGGNOG_OUT> -a <FAA> -o <OUTDIR> -g <GENOME_ID> -H <HOST>
Arguments:
-i
Path to emapper v2 output file from get.pangenome_MGBC1094.sh
i.e. <-i>.dmnd.emapper.annotations
-a
Path to original fasta file used for eggNOG annotation i.e. $OUTDIR/Cluster_"<-l>"/extracted_seqs.faa
-o
Directory to write to [REQUIRED]
-g
Path to genome_ids.txt
file in pangenome $OUTDIR
-H
Host organism: either HUMAN or MOUSE
Notes:
- need to update the path variables to the required data inside the file:
$CLUS_MEM
: path to cluster membership file for the protein cluster catalogue used to generate the pangenome
Summarise the InterProScan v5 output annotation file for a pangenome, returning feature-gene and gene-genome indices. Additionally calculates annotation efficiency data.
This pipeline was coded for running within LSF cluster environments.
Usage:
summarise.ips_gff.sh -i <IPS_OUT> -a <FAA> -o <OUTDIR> -g <GENOME_ID> -H <HOST>
Arguments:
-i
Path to IPS output file from get.pangenome_MGBC1094.sh
i.e. ips_out.gff
-a
Path to original fasta file used for IPS i.e. $OUTDIR/Cluster_"<-l>"/extracted_seqs.faa
-o
Directory to write to [REQUIRED]
-g
Path to genome ids file
-H
Host organism: either HUMAN or MOUSE
Notes:
- need to update the path variables to the required data inside the file:
$CLUS_MEM
: path to cluster membership file for the protein cluster catalogue used to generate the pangenome$IPS_DATA
: path to the InterPro database i.e.data/InterPro_DBs/
Summarises the functional annotations generated for all pangenomes using the summarise.ips_gff.sh
and summarise.eggnog_annotations.sh
scripts described above. Generates analyses of human and mouse specific functional features as well as total feature-level analsyes. Automatically runs summarise.pangenome_function.MGBC120421.sh
on the pangenomes to build feature-genome indexes ready for downstream distance matrix calculation.
Requirements:
- Requires
summarise.ips_gff.sh
andsummarise.eggnog_annotations.sh
to have already been run on all pangenomes, and their temporary files to still be available.
This pipeline was coded for running within LSF cluster environments, and runs jobs (via a bsub array) for paralellising analyses.
Usage:
summarise.all_functions.MGBC120421.sh <OUTDIR>
Arguments:
The script only needs to be run with the output directory specified. The script builds this directory if it does not already exist.
Notes:
- need to update the path variables to the required data inside the file:
$PANGENOMES
: path to the directory whereHUMAN
andMOUSE
directories exist, containing the pangenomes for each host organism.
Compiles species functional profiles to generate genome-function presence-absence matrices for each InterProScan and eggNOG functional scheme.
Usage:
build.function_presence_absence.sh <OUTDIR>
Arguments:
The script only needs to be run with the output directory (e.g. DISTANCE_MATRICES) specified. The script builds this directory if it does not already exist.
Notes:
- need to update the path variables to the required data inside the file:
$HUMAN
and$MOUSE
: paths to theHUMAN
andMOUSE
pangenome directories, containing the species pangenomes for each host organism.
Compiles species functional profiles to generate genome-function presence-absence matrices for each InterProScan and eggNOG functional scheme.
Usage:
build.function_presence_absence.sh <OUTDIR>
Arguments:
The script only needs to be run with the output directory (e.g. DISTANCE_MATRICES) specified. The script builds this directory if it does not already exist.
Notes:
- need to update the path variables to the required data inside the file:
$HUMAN
and$MOUSE
: paths to theHUMAN
andMOUSE
pangenome directories, containing the species pangenomes for each host organism.
Takes presence-absence matrix as input and produces a distance matrix from the functional profiles of each species.
Requirements:
- R v3.6.0
Usage:
analyse.pangenome-distance_MGBC.R -i <INFILE> -m <DIST_METHOD> -p <PREFIX> -o <OUTDIR>
Arguments:
-i
Path to tsv file containing data for feature distribution across a core or pangenome (e.g. output from build.function_presence_absence.sh
).
-m
Which METHOD to use for distance matrix calculation. Any of the distance methods supported by Vegan's 'vegdist' function are allowed.
-b
Flag to use BINARY distance analyses. [default: FALSE]
-p
Prefix to give files that are being written.
-o
Directory to write output files to.
Notes:
- need to update the path variables to the required data inside the file:
$HUMAN
and$MOUSE
: paths to theHUMAN
andMOUSE
pangenome directories, containing the species pangenomes for each host organism.
The other scripts in this directory are used by the pipelines discussed above.