Skip to content

Biosynthetic gene Cluster Meta’omics Abundance Profiler

License

Notifications You must be signed in to change notification settings

KoenvdBerg/BiG-MAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP)

https://github.com/KoenvdBerg/BiG-MAP/blob/master/BiG-MAP.pipeline.png

This is the Github repository for the Biosynthetic Gene cluster Meta’omics abundance Profiler (BiG-MAP). For the analysis of bacterial metagenomic and metatranscriptomic samples more and more tools become available, although these tools are not capable of profiling specific metabolic gene clusters (MGCs), that have been shown to be major phenotype drivers. Therefore, this tool is focussed on finding the representation of MGCs and their related homologs in metagenomic and metatranscriptomic samples. These pathways are readily obtained from (draft) bacterial genomes using antiSMASH or gutSMASH. To be able to process the outputs from these tools into proper abundance and expressions values, the following programs form the essential part of BiG-MAP:

  • BiG-MAP.download.py
  • BiG-MAP.family.py
  • BiG-MAP.map.py
  • BiG-MAP.analyse.py

For information on how to implement this program, scroll down to Overview and example run.

Installation

Install BiG-MAP dependencies using conda. Conda can be installed from miniconda_link. First pull the BiG-MAP repository from github:

~$ git clone https://github.com/KoenvdBerg/BiG-MAP.git

Then install all the dependencies from the BiG-MAP.yml file with:

# For BiG-MAP.download.py, BiG-MAP.family.py and BiG-MAP.map.py
~$ conda env create -f BiG-MAP_process.yml BiG-MAP_process
~$ conda activate BiG-MAP_process

# For BiG-MAP.analyse.py
~$ conda env create -f BiG-MAP_analyse.yml BiG-MAP_analyse
~$ conda activate BiG-MAP_analyse

After this all the dependencies are installed. BiG-MAP can now be used.

Overview and example run

A typical workflow for BiG-MAP consists of the following 4 consecutive steps:

  1. Downloading WGS data using BiG-MAP.download.py
  2. Generating gene cluster families (GCFs) and housekeeping gene families (HGFs) using BiG-MAP.family.py
  3. Computing abundance and expression profiles of selected representatives from each GCF and HGF using BiG-MAP.map.py
  4. Analysing the resulting BIOM file for profiles using BiG-MAP.analyse.py

The four steps are described below, and for each an example is provided.

1) BiG-MAP.download.py

This script is created to easily download the metagenomic and/or metatranscriptomic samples from the online NCBI repository. First, the samples are downloaded in .SRA format, and then they are converted into .fastq pairs using fastq-dump.

conda activate BiG-MAP_process
python3 BiG-MAP.download.py -h
python3 BiG-MAP.download.py [Options]* -A [accession_list_file] -O [path_to_outdir]

To download the samples, go to the SRA run selector and fill in the study code. For the IBD-cohort of schirmer et al. (2018) that is PRJNA389280. Next, select the accessions and click Accession List to download the accessions. Use this accession file in the following command:

python3 BiG-MAP.download.py -A Acc_list.txt -O /mnt/scratch/usr001/fastq/schirmer/

Acc_list.txt:
SRR5983273
SRR5983265
SRR5983266
SRR5983268
SRR5983270
SRR5983271
SRR5983275
...

2) BiG-MAP.family.py

The main purpose of this script is to compute GCFs and HGFs using sequence similarity as sole metric. For GCF computation, protein sequences are used while for the HGF computation DNA sequences are used. FastANI is implemented to compute the GCFs and HGFs. The input consists of the output directories of anti- or gutSMASH. Options can be investigated by running the -h flag. General usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.family.py -h
python3 BiG-MAP.family.py [Options]* -D [input dir(s)] -O [output dir]

In the example of a gutSMASH run on 1520 (draft) reference genomes that are present in the gut, with a fastANI treshold of 0.6 for GFCs and 0.8 for HGFs, no flanking genes of the core, no genome fasta file outputs and 6 process cores:

python3 BiG-MAP.family.py -tg 0.6 -th 0.8 -f 0 -g False -p 6 -D /mnt/scratch/usr001/gutSMASH-output/ -O /mnt/scratch/usr001/results/

This yields:
BiG-MAP.GCF_HGF.bed = Bedfile to extract core regions in BiG-MAP.map.py
BiG-MAP.GCF_HGF.fna = Reference file to map the WGS reads to
BiG-MAP.GCF_HGF.json = Dictionary that contains the GCFs and HGFs

3) BiG-MAP.map.py

This module is designed to align the WGS (paired or unpaired) reads to the reference representatives in each GCF and HGF. It does this using bowtie2. The following will be computed: RPKM, coverage, core coverage. The coverage is calculated using Bedtools, and the read count values using Samtools. The general usage is:

conda activate BiG-MAP_process
python3 BiG-MAP.map.py -h
python3 BiG-MAP.map.py {-I1 [mate-1s] -I2 [mate-2s] | -U [samples]} -R [reference] -O [outdir] -F [family] [Options*]

To map 10 reads from schirmer et al to the reference representatives from the GCFs and HGFs, and also calculate the core metrics, run:

NOTE: It is important for downstream analysis to also use the -b flag.

python3 BiG-MAP.map.py -f False -s fast -th 10 -b /mnt/scratch/usr001/results/schirmer_metadata.txt -cc /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.bed -R /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.fna -I1 /mnt/scratch/usr001/fastq/schirmer/*pass_1* -I2 /mnt/scratch/usr001/fastq/schirmer/*pass_2* -O /mnt/scratch/usr001/results/ -F /mnt/scratch/usr001/results/BiG-MAP.GCF_HGF.json

the schirmer_metadata.txt is set up as follows (tab-delimited):
#run.ID         host.ID	        SampleType	     DiseaseStatus
SRR5947852	C3001C10_MGX	METAGENOMIC	        CD
SRR5947945	C3001C10_MTX	METATRANSCRIPTOMIC	CD
SRR5947826	C3001C5_MGX	METAGENOMIC	        CD
SRR5947900	C3001C5_MTX	METATRANSCRIPTOMIC	CD
SRR5947876	C3001C9_MGX	METAGENOMIC	        CD
SRR5947934	C3001C9_MTX	METATRANSCRIPTOMIC	CD

note the '#' to denote the header row!!!

4) BiG-MAP.analyse.py

This module is a wrapper script for BiG-MAP.norm.R. This R script can also be used locally in R-studio, which is recommended for creating nice visualizations. Although the main set-back is that it requires local installation of all the dependencies, which is taken care of by BiG-MAP_analyse for the command line but not for local R-studio analyses. The comments in the script mention how that works. For example:

Scroll down to the main in BiG-MAP.norm.R
Edit and uncomment:
biom_file <- path/to/biom-file
MT <- condition
sampletype <- "METATRANSCRIPTOMIC" | "METAGENOMIC"
group_1 <- condition_1
group_2 <- condition_2
explore <- TRUE/FALSE

Run all the functions and analyse locally

If you want to do it from the command line (eg in automated analysis), first install all dependencies using the BiG-MAP_process.yml file, if not done already. Then, it works as follows:

python3 BiG-MAP.analyse.py inspect -h
python3 BiG-MAP.analyse.py inspect -B [biom_file] [options*]

Example:
python3 BiG-MAP.analyse.py inspect -B /mnt/scratch/usr001/BiG-MAP.map.biom -e /mnt/scratch/usr001/ -s metagenomic -m DiseaseStatus

Output: 
which conditions can be analysed
heatmap

To perform statistical testing on the biom file, use:

python3 BiG-MAP.analyse.py test -h
python3 BiG-MAP.analyse.py test -B [biom_file] -T [SampleType] -M [meta_group] -G [[groups]] -O [outdir]

Example:
python3 BiG-MAP.analyse.py test -B /mnt/scratch/usr001/BiG-MAP.map.biom -T metagenomic -M DiseaseStatus -G UC non-IBD -O /mnt/scratch/usr001/

Requirements

Software:

  • Python 3+
  • R statistics
  • fastq-dump
  • fastANI
  • HMMer
  • Bowtie2
  • Samtools
  • Bedtools
  • biom

Packages:

Python

  • BioPython
  • pandas

R

  • metagenomeSeq
  • biomformat
  • ComplexHeatmap=2.0.0
  • viridisLite
  • RColorBrewer
  • tidyverse

About

Biosynthetic gene Cluster Meta’omics Abundance Profiler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published