Skip to content

RChGO/VMGC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 

Repository files navigation

VMGC - Human Vaginal Microbiome Genome Collection

The VMGC is a large-scale reference genome resource of the human vaginal microbiome, including over 33,000 genomes derived from 786 prokaryotes, 11 fungi, and 4,263 viruses associated with the human vagina. In terms of representation, the VMGC demonstrates high efficiency in capturing microbial sequences, with a median mapping rate of 91.7% across 4,472 vaginal metagenomic samples obtained from 14 countries.

See our paper for details:

Huang L, Guo R, Li S, Wu X, Zhang Y, Guo S, Lv Y, Xiao Z, Kang J, Meng J, Zhou P, Ma J, You W, Zhang Y, Yu H, Zhao J, Huang G, Duan Z, Yan Q, Sun W. 2024. A multi-kingdom collection of 33,804 reference genomes for the human vaginal microbiome. Nature Microbiology doi:10.1038/s41564-024-01751-5.

The microbial sequence data for VMGC is stored at https://zenodo.org/records/10457006, including the data presented in the table below.

Description Size Filename
Full prokaryotic genomes (n=19,542) 8.1 GB VMGC_prokaryote_MAG.tar.gz
Annotations of prokaryotic genomes (Source/Quality/Clustering...) 2.3 MB VMGC_prokaryote_MAG.info
Nonredundant prokaryotic genomes (n=786) 440 MB VMGC_prokaryote_SGB.tar.gz
Annotations of nonredundant prokaryotic genomes (Taxonomy/...) 120 KB VMGC_prokaryote_SGB.info
Full eukaryotic genomes (n=42, including 4 from parasites) 192 MB VMGC_eukaryocyte.tar.gz
Annotations of eukaryotic genomes (Taxonomy/Quality/Clusting/...) 11 KB VMGC_eukaryocyte.info
Full viral genomes (n=14,224) and vOTUs (n=4,263) 197 MB VMGC_virus.tar.gz
Annotations of eukaryotic genomes (Taxonomy/Quality/Clusting/Host...) 969 KB VMGC_virus.info
Kraken & Bracken database 2.1 GB VMGC_prokaryote_SGB_KrakenDB.tar.gz


Single-coverage and Mash-based multiple-coverage binning

Metagenome-assembled genomes in the VMGC were obtained through a process that integrated both single-coverage binning and Mash-based multiple-coverage binning methods. The specific operational steps are outlined below.

  • Required dependencies

    Mash 2.3
    GNU parallel 20201122
    bwa-mem2 2.2.1
    MetaBAT v2
    perl 5.16.3
    combine.pl

  • Step 1: Calculate the Mash distances between the assembled files, and obtain the top 20 closest samples for each assembled sample.

    mash sketch -p 50 -s 100000 -k 32 -o mash.sketch contigs/*.fasta

    find contigs/*.fasta | parallel -k -j 10 mash dist mash.sketch.msh {} \| sort -nk3 \| head -n 20 \| sed -e "s/contigs\\\///g" -e "s/.fasta//g" > mash.sketch.msh.top20

  • Step 2: For a assembly file, clean reads from the top 20 closest samples (including its own reads) by Mash distance were used to calculate the sequencing depth of contigs.

    find contigs/*.fasta | parallel --colsep '\t' -j 20 bwa-mem2 index -p {} {}

    cat mash.sketch.msh.top20 | parallel --colsep '\t' -j 10 mkdir -p depth/{2} \; bwa-mem2 mem -t 10 contigs/{2} clean_reads/{1}.1.fq.gz clean_reads/{1}.2.fq.gz \| samtools view -bS - -@ 10 \| samtools sort -@ 10 -o depth/{2}/{1}.sort.bam \&\& jgi_summarize_bam_contig_depths --outputDepth depth/{2}/{1}.sort.bam.depth depth/{2}/{1}.sort.bam

  • Step 3: Integrate depth calculation files using the public script combine.pl, and perform multiple-coverage binning.

    find depth/* -type d | parallel -j 10 combine.pl {}/*.sort.bam.depth \> {}.depth

    find depth/*.depth | parallel -j 10 mkdir -p bins/{/.}/ \; metabat2 -i contigs/{/.}.fasta -a {} -o bins/{/.}/{/.}.mbin -m 2000 -s 200000 --saveCls --unbinned --seed 2020

  • Step 4: Perform single-coverage binning.

    find depth/* -type d | parallel -j 10 metabat2 -i contigs/{/.}.fasta -a depth/{/.}/{/.}.depth -o bins/{/.}/{/.}.sbin -m 2000 -s 200000 --saveCls --unbinned --seed 2020



Taxonomic profiling

Based on the 786 species-level genome bins (SGBs) in the VMGC, we reconstructed the prokaryotic composition of the vagina using Kraken2 and Bracken tools.

  • Required dependencies

    Kraken 2.1.3
    Bracken 2.8
    Python 3.8.16
    R 4.2.3
    GNU parallel 20201122
    kraken_MAGdb.sh

  • Step 1: Create customized Kraken2 and Bracken databases.

    kraken_MAGdb.sh sgb.info sgb.seq KBdb

      1. The information on "sgb.info" is derived from the annotation results of GTDB-tk.
      2. "sgb.seq" includes the file paths for the genomes of SGBs.
      3. "KBdb" is the output folder path, and the generated data includes the following files:
          >tree KBdb/
          KBdb/
          ├── database150mers.kmer_distrib
          ├── database150mers.kraken
          ├── database.kraken
          ├── hash.k2d
          ├── library
          │   └── added
          │       ├── nFPH00iyWZ.fna
          │       ├── nFPH00iyWZ.fna.masked
          │       ├── prelim_map.txt
          │       └── prelim_map_ZLPtIxoxoZ.txt
          ├── opts.k2d
          ├── seqid2taxid.map
          ├── taxo.k2d
          └── taxonomy
              ├── db.accession2taxid
              ├── names.dmp
              ├── nodes.dmp
              └── prelim_map.txt
    
  • Step 2: The clean reads from each sample are mapped to the database, generating compositions at various taxonomic levels.

    find clean_reads/*.1.fq.gz | sed 's/.1.fq.gz//'| parallel -j 5 kraken2 --threads 10 --confidence 0.1 --db KBdb --report prof/{/}.report --report-minimizer-data --output prof/{/}.output {}.1.fq.gz {}.2.fq.gz

    find prof/*.report | parallel -j 5 bracken -d KBdb -i {} -o {}.bracken -r 150 -l S -t 1



Fungal genome identification

We utilized the aforementioned process, which integrates both single-sample binning and Mash-based multiple-sample binning methods, to generate raw bins. Subsequently, we employed EukRep and BUSCO tools to extract fungal genomes from raw bins.

  • Required dependencies

    EukRep 0.6.7
    BUSCO 5.4.2
    perl 5.16.3
    GNU parallel 20201122

  • Step 1: Remove the prokaryotic sequences from the raw bins.

    find bins/*/*.[ms]bin.[0-9]*.fa -size +3M | parallel -j 20 EukRep --min 2000 -i {} -o euk_bin/{/.}.fa

  • Step 2: Assess the genome quality of candidate bins, and extract fungal genomes.

    find euk_bin/*.fa -size +3M | parallel -j 20 busco -m genome -l fungi_odb10 -i {} -q -o busco/{/.}

    grep " C:" busco/*/*txt | perl -ne '/fungi_odb10.(.*).txt.*C:(.*?)%.*D:(.*?)%/;print "ln -s euk_bin/$1.fa fungal_bins/\n" if $2>50 and $3<5' |sh



Viral genome identification

  • Required dependencies

    seqkit 2.1.0
    checkv 0.7.0
    VIBRANT 1.2.1
    DeepVirFinder 1.0
    BLAST 2.12.0+
    HMMER 3.3.2
    MinCED 0.4.2
    Prodigal 2.6.3
    diamond 2.0.13.151
    perl 5.16.3
    Python 3.8.16
    GNU Awk 4.0.2
    GNU parallel 20201122
    virusFind_flow.sh
    virusClstr.sh
    virusHost.sh
    virusTaxAnno.sh

  • Step 1: Extract the viral sequences from the assembly files.

    find contigs/*.fasta | parallel -j 20 virusFind_flow.sh {} 5000 {/.} virus/{/.}

  • Step 2: Cluster viral sequences to generate Viral Operational Taxonomic Units (vOTUs).

    virusClstr.sh virus/*.virus.fa votu clstr 50

  • Step 3: Taxonomic annotation for vOTUs.

    prodigal -q -d clstr/votu.ffn -o clstr/votu.gff -p meta -a clstr/votu.faa -f gff

    virusTaxAnno.sh clstr/votu.faa clstr/votu.fa tax/votu

  • Step 4: Virus-Host prediction based on the 19,542 prokaryotic MAGs in the VMGC.

    • Step 4.1: Find CRISPRs in all MAGs, and build BLAST databases.

      find mag_fasta/*.fa | parallel -j 20 minced -minNR 2 {} csp/{/.}.csp csp/{/.}.gff

      cat csp/*.csp |grep '^Sequence\|\[' | perl -ne 'if(/^Sequence.*/){/'\''(\S+)'\''/; $a=$1;$b=1}else{/(\d+)(\s+)(\S+)(\s+)(\S+)/;print ">$a.sp$b\n$5\n";$b++}' > csp.tot.fa

      makeblastdb -in csp.tot.fa -dbtype nucl -out csp.tot.fa

      cat mag_fasta/*.fa > mag.fa && makeblastdb -in mag.fa -dbtype nucl -out mag.fa

    • Step 4.2: Predict hosts.

      virusHost.sh clstr/votu.fa csp.tot.fa mag.fa mag.tax

Note that the associated scripts nested within the shell scripts are also available on https://github.com/RChGO/VMGC/tree/main/Pipelines.




Correspondence and requests for materials should be addressed to grchun@hotmail.com

About

vaginal microbiota

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published