This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). For the analysis of bacterial metagenomic and metatranscriptomic samples more and more tools become available, although these tools are not capable of profiling specific metabolic gene clusters (MGCs), that have been shown to be major phenotype drivers. Therefore, this tool is focussed on finding the representation of MGCs and their related homologs in metagenomic and metatranscriptomic samples. These pathways are readily obtained from (draft) bacterial genomes using antiSMASH or gutSMASH. To be able to process the outputs from these tools into proper abundance and expressions values, the following programs form the essential part of BiG-MAP:
- BiG-MAP.download.py
- BiG-MAP.family.py
- BiG-MAP.map.py
- BiG-MAP.analyse.py
For information on how to implement this program, scroll down to Overview and example run.
Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:
~$ git clone https://github.com/KoenvdBerg/BiG-MAP.git
Then install all the dependencies from the BiG-MAP.yml file with:
# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py ~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process ~$ conda activate BiG-MAP_process # For BiG-MAP.analyse.py ~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse ~$ conda activate BiG-MAP_analyse
After this all the dependencies are installed. BiG-MAP can now be used.
A typical workflow for BiG-MAP consists of the following 4 consecutive steps:
- Downloading WGS data using BiG-MAP.download.py
- Generating gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
- Computing abundance and expression profiles of selected representatives from each GCF and HGF using BiG-MAP.map.py
- Analysing the resulting BIOM file for profiles using BiG-MAP.analyse.py
The four steps are described below, and for each an example is provided.
This script is created to easily download the metagenomic and/or metatranscriptomic samples from the online NCBI repository. First, the samples are downloaded in .SRA format, and then they are converted into .fastq pairs using fastq-dump.
conda activate BiG-MAP_process python3 BiG-MAP.download.py -h python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]
To download the samples, go to the SRA run selector and fill in the study code. For the IBD-cohort of schirmer et al. (2018) that is PRJNA389280. Next, select the accessions and click Accession List to download the accessions. Use this accession file in the following command:
python3 BiG-MAP.download.py -A Acc_list.txt -O /mnt/scratch/usr001/fastq/schirmer/ Acc_list.txt: SRR5983273 SRR5983265 SRR5983266 SRR5983268 SRR5983270 SRR5983271 SRR5983275 ...
The main purpose of this script is to compute GCFs and HGFs using sequence similarity as sole metric. For GCF computation, protein sequences are used while for the HGF computation DNA sequences are used. FastANI is implemented to compute the GCFs and HGFs. The input consists of the output directories of anti- or gutSMASH. Options can be investigated by running the -h flag. General usage is:
conda activate BiG-MAP_process python3 BiG-MAP.family.py -h python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]
In the example of a gutSMASH run on 1520 (draft) reference genomes that are present in the gut, with a fastANI treshold of 0.6 for GFCs and 0.8 for HGFs, no flanking genes of the core, no genome fasta file outputs and 6 process cores:
python3 BiG-MAP.family.py -tg 0.6 -th 0.8 -f 0 -g False -p 6 -D /mnt/scratch/usr001/gutSMASH-output/ -O /mnt/scratch/usr001/results/ This yields: BiG-MAP.GCF_HGF.bed = Bedfile to extract core regions in BiG-MAP.map.py BiG-MAP.GCF_HGF.fna = Reference file to map the WGS reads to BiG-MAP.GCF_HGF.json = Dictionary that contains the GCFs and HGFs
This module is designed to align the WGS (paired or unpaired) reads to the reference representatives in each GCF and HGF. It does this using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:
conda activate BiG-MAP_process python3 BiG-MAP.map.py -h python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} -R [reference] -O [outdir] -F [family] [Options*]
To map 10 reads from schirmer et al to the reference representatives from the GCFs and HGFs, and also calculate the core metrics, run:
NOTE: It is important for downstream analysis to also use the -b flag.
python3 BiG-MAP.map.py -f False -s fast -th 10 -b /mnt/scratch/usr001/results/schirmer_metadata.txt -cc /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.bed -R /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.fna -I1 /mnt/scratch/usr001/fastq/schirmer/*pass_1* -I2 /mnt/scratch/usr001/fastq/schirmer/*pass_2* -O /mnt/scratch/usr001/results/ -F /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.json the schirmer_metadata.txt is set up as follows (tab-delimited): #run.ID host.ID SampleType DiseaseStatus SRR5947852 C3001C10_MGX METAGENOMIC CD SRR5947945 C3001C10_MTX METATRANSCRIPTOMIC CD SRR5947826 C3001C5_MGX METAGENOMIC CD SRR5947900 C3001C5_MTX METATRANSCRIPTOMIC CD SRR5947876 C3001C9_MGX METAGENOMIC CD SRR5947934 C3001C9_MTX METATRANSCRIPTOMIC CD note the '#' to denote the header row!!!
This module is a wrapper script for BiG-MAP.norm.R. This R script can also be used locally in R-studio, which is recommended for creating nice visualizations. Although the main set-back is that it requires local installation of all the dependencies, which is taken care of by BiG-MAP_analyse for the command line but not for local R-studio analyses. The comments in the script mention how that works. For example:
Scroll down to the main in BiG-MAP.norm.R Edit and uncomment: biom_file <- path/to/biom-file MT <- condition sampletype <- "METATRANSCRIPTOMIC" | "METAGENOMIC" group_1 <- condition_1 group_2 <- condition_2 explore <- TRUE/FALSE Run all the functions and analyse locally
If you want to do it from the command line (eg in automated analysis), first install all dependencies using the BiG-MAP_process.yml file, if not done already. Then, it works as follows:
python3 BiG-MAP.analyse.py inspect -h python3 BiG-MAP.analyse.py inspect -B [biom_file] [options*] Example: python3 BiG-MAP.analyse.py inspect -B /mnt/scratch/usr001/BiG-MAP.map.biom -e /mnt/scratch/usr001/ -s metagenomic -m DiseaseStatus Output: which conditions can be analysed heatmap
To perform statistical testing on the biom file, use:
python3 BiG-MAP.analyse.py test -h python3 BiG-MAP.analyse.py test -B [biom_file] -T [SampleType] -M [meta_group] -G [[groups]] -O [outdir] Example: python3 BiG-MAP.analyse.py test -B /mnt/scratch/usr001/BiG-MAP.map.biom -T metagenomic -M DiseaseStatus -G UC non-IBD -O /mnt/scratch/usr001/
- Python 3+
- R statistics
- fastq-dump
- fastANI
- HMMer
- Bowtie2
- Samtools
- Bedtools
- biom
- BioPython
- pandas
- metagenomeSeq
- biomformat
- ComplexHeatmap=2.0.0
- viridisLite
- RColorBrewer
- tidyverse