Skip to content
Rauf Salamzade edited this page Apr 19, 2023 · 35 revisions

lsaBGC Suite Overview

lsaBGC consists of several individual programs which provide a broad suite of functions for comparative analysis of biosynthetic gene clusters across a single focal lineage or taxa (recommended/tested at species or genus levels), to understand the allelic variability observed for BGC genes, and mine for novel SNVs within such genes representative of previously unidentified allelic variants.

For a very straightforward introduction we recommend visiting the lsaBGC-Easy.py wiki page first!

Installation

To learn more about the installation of lsaBGC and its dependencies, please take a look at the Installation wiki page.

Background / Introduction

What functionalities does lsaBGC offer to users? Learn more about the suite's intended usages and where it should not be used, along with recommendations to other great software for exploring and wrangling comparative analysis of secondary metabolite genetic architectures Background wiki page!

Detailed Walkthrough and Test Cases

A very quick start to begin using lsaBGC can be achieved via lsaBGC-Easy.py.

A more detailed tutorial for using the lsaBGC suite with the latest recommended workflow, please see this Wiki page. The older workflow and framework used for our study can be found detailed on this Wiki page.

We found that the Corynebacterium kefirresidentii is a common species complex of the skin microbiome and harbor several BGCs across their compact genome. We use the publicly available genomes from the complex as a small and simple test set to demonstrate the exploratory power of lsaBGC. Please have a look at the lsaBGC_Ckefir_Testing_Cases Github repo for further details.

Main Programs

lsaBGC comprises of 8 primary programs:

Many of the main programs utilize an object oriented infrastructure for processing and analysis. More information on this infrastructure can be found on the wiki page OOP Framework.

Program Description Input Output
lsaBGC-Ready.py Takes existing antiSMASH results (and optionally BiG-SCAPE) and creates inputs necessary to run downstream lsaBGC analyses (reformats BGC genbanks, groups orthologs, finds genome-wide paralogs etc.).
  • antiSMASH Results Directory
  • (optional) BiG-SCAPE Results Directory
  • OrthoFinder Homolog Group vs. Sample Matrix
  • Listing of antiSMASH BGCs
  • Listing of Sample Predicted Proteomes/Genbanks
  • (if BiG-SCAPE results provided) GCF Listings Directory
    lsaBGC-Cluster.py Takes the comprehensive list of BGCs and clusters them using MCL into GCFs
    • Comprehensive listing of AntiSMASH BGC predictions in Genbank format (from completed/high-quality genomes)
    • OrthoFinder Homolog Group vs. Sample Matrix
    • Summary of GCFs
    • Automated report to inform on best clustering parameter choices (if requested)
    • List for each GCF of BGC members
      lsaBGC-Refiner.py Refines boundaries of BGCs belonging to a single GCF according to user specifications.
      • BGC instances for focal GCF in Genbank format
      • OrthoFinder Homolog Group vs. Sample Matrix
      • Boundary Homolog Group ID #1
      • Boundary Homolog Group ID #2
      • BGC instances for focal GCF in Genbank format edited for requested refinement.
      lsaBGC-Expansion.py Uses an HMM based approach to quickly find homologous instances of GCF in draft-quality genomes.
      • BGC instances for focal GCF in Genbank format
      • Additional genomic assemblies listing (post gene-calling)
      • Expanded list of BGCs belonging to GCF
      • Expanded OrthoFinder Homolog Group vs Sample Matrix
      lsaBGC-MIBiGMapper.py Map MIBiG BGCs to GCF.
      • BGC instances for focal GCF in Genbank format
      • Table listing association between MIBiG BGCs/proteins and GCF homolog groups.
      lsaBGC-See.py Visualizes BGC instances of a GCF across a phylogeny
      • BGC instances for focal GCF in Genbank format
      • (Optional) Species phylogeny
      • Modified species phylogeny to expand samples which feature multiple BGCs for the GCF (if species phylogeny was provided)
      • (Optional) Single-copy-core phylogeny of GCF
      • Automated visualization of BGC gene architectures across species or BGC phylogeny in PDF format
      • Track file for visualization of gene architecture for BGCs in GCF to be input into iTol.
      lsaBGC-Divergence.py Determines 𝜷-RT statistic for assessing BGC divergence relative to genome-wide divergence between isolate pairs.
      • BGC instances for focal GCF in Genbank format
      • Pairwise ANI or AAI estimates between samples/genomes with GCF
      • Report with the 𝜷-RT statistic showcasing the ratio of the genome-wide similarity to the GCF-specific similarity between pairs of isolates with the GCF.
      lsaBGC-PopGene.py Looks at sequence conservation and performs population genetic analyses for each homolog group found in GCF.
      • BGC instances for focal GCF in Genbank format
      • Expanded OrthoFinder Homolog Group vs Sample Matrix
      • Report with conservation and population-genetic relevant statistic for each homolog group associated with the GCF.
      • Automated visualization of genetic variability present in the lineage for each homolog group in PDF format.
      • Codon alignment for each homolog group in GCF
      lsaBGC-DiscoVary.py Identifies GCF instances in metagenomes and looks for base-resolution novelty within genes from raw sequencing data not observed in genomic assemblies for the taxonomy.
      • BGC instances for focal GCF in Genbank format
      • Metagenomic/sequencing readsets
      • Codon alignments for homolog groups in GCF
      • Listing of which metagenomic/sequencing readsets are predicted to contain the GCF
      • Table report with novel variants never previously observed in genomic assemblies
      • (Optional) Phased homolog group alleles found in metagenomic/sequencing data. [uses DESMAN]

      Also provided are three workflow/pipeline programs, lsaBGC-Easy.py, lsaBGC-AutoExpansion.py, and lsaBGC-AutoAnalyze.py, which simplify the generation of inputs necessary for the lsaBGC framework and allow for the automatic processing of each GCF post-clustering through standard analysis:

      Program Description Input Output
      lsaBGC-(Euk-)Easy.py Automatically run an investigation using lsaBGC - made easy!
      • Species/genus name
      • (Optional) User provided genomes.
      • Results from lsaBGC-AutoAnalyze.py
      lsaBGC-AutoExpansion.py Automatically runs lsaBGC-Expansion for all GCFs and resolves conflicts (e.g. overlapping BGCs for different GCFs)
      • Directory with BGC listings for each GCF
      • Additional genomic assemblies listing (post gene-calling)
      • OrthoFinder Homolog Group vs. Sample Matrix
      • Expanded list of BGCs belonging to GCF
      • Expanded OrthoFinder Homolog Group vs Sample Matrix
      lsaBGC-AutoAnalyze.py Automatically runs lsaBGC-See.py, lsaBGC-PopGene.py, lsaBGC-Divergence.py, and lsaBGC-DiscoVary for each GCF.
      • Genomic listing file
      • Directory with BGC listings for each GCF
      • Additional options
      • Consolidated reports for lsaBGC-PopGene and lsaBGC-Divergence results
      • Visualizations providing overview of lsaBGC analyses
      lsaBGC-AutoProcess.py - defunct - lsaBGC-Ready instead! Automatically runs Prokka, AntiSMASH, and OrthoFinder
      • Genomic assemblies
      • AntiSMASH BGC predictions in Genbank format
      • OrthoFinder Homolog Group vs. Sample Matrix

      Future to-do's involve getting these workflows re-written in a DSL framework such as NextFlow.

      Additional Programs / Scripts

      Several additional programs and scripts are included in the lsaBGC suite. Major scripts of potential interest are described here.

      Clone this wiki locally