Skip to content

05 Output Files

Matin Nuhamunada edited this page Dec 14, 2023 · 5 revisions

Pipeline Output

Data Structure

The output of BGCFlow is a processed folder that contains the following subdirectories and files:

.
├── antismash
├── automlst_wrapper
├── bgcflow_wrapper.log
├── bigscape
│   ├── for_cytoscape_antismash_7.0.0
│   ├── Lactobacillus_delbrueckii_bigscape_as_7.0.0_mapping.csv
│   └── result_as7.0.0
├── bigslice
│   ├── cluster_as_7.0.0
│   └── query_as_7.0.0
├── cblaster
├── data_warehouse
├── dbt
│   └── antiSMASH_7.0.0
│       └──dbt_bgcflow.duckdb
├── docs
├── fastani
├── genbank
├── log_changes
├── main.py
├── mash
├── metadata
├── mkdocs.yml
├── README.md
├── roary
└── tables
    ├── df_antismash_7.0.0_summary.csv
    ├── df_arts_as-7.0.0.csv
    ├── df_deeptfactor.csv
    ├── df_gtdb_meta.csv
    ├── df_ncbi_meta.csv
    ├── df_regions_antismash_7.0.0.csv
    └── df_seqfu_stats.csv

This processed folder is a combination of the MkDocs report, data build tools, and results from the bioinformatic pipelines in the BGCFlow workflow. Details can be seen below:

File / Directory Description
antismash A directory containing the AntiSMASH results, which predicts and annotates secondary metabolite biosynthetic gene clusters (BGCs) in bacterial and fungal genomes.
automlst_wrapper A directory containing the genome tree build using simplified AutoMLST wrapper. The *.newick file can be used for further tree visualization.
bgcflow_wrapper.log A log file generated upon serving the mkdocs report.
bigscape A directory containing the results of the BiG-SCAPE tool, which clusters BGCs into families based on their biosynthetic gene content.
bigslice A directory containing the results of the BiG-SLiCE tool, which clusters BGCs using the BIRCH algorithm.
cblaster A directory containing the diamond database of the dataset generated by CBlaster which can be used for BLAST searches.
data_warehouse A directory containing parquet tables of various table generated by different tools in the workflow.
dbt A directory dbt SQL schema for data transformation of BGCFlow results into DuckDB database. Inspired from: https://github.com/dbt-labs/jaffle_shop_duckdb
docs A directory containing the jupyter notebooks and markdown reports that are served in the report.
fastani A directory containing the results of the FastANI tool, which performs pairwise genome comparisons.
genbank A directory containing the GenBank files for the bacterial genomes used in the BGCFlow workflow.
log_changes A directory recording the BGC id changes made by antiSMASH and BGCFlow.
main.py Python script generated by BGCFlow wrapper to serve the markdown report.
mash A directory containing the results of the MASH tool, which performs pairwise genome comparisons.
metadata A directory containing metadata and dependency version used in the project.
mkdocs.yml The configuration file for the MkDocs tool, which generates the documentation for the BGCFlow report.
overrides A directory containing the overrides for the MkDocs tool.
pycache A directory containing the compiled Python bytecode files.
README.md The README file for the BGCFlow project result.
roary A directory containing the results of the Roary tool, which performs pan-genome analysis on bacterial genomes.
tables A directory containing the tables generated by the BGCFlow workflow.

Summary of Available Pipeline (Main Workflow)

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

Keyword Description Links
0 eggnog Annotate samples with eggNOG database (http://eggnog5.embl.de) eggnog-mapper
1 mash Calculate distance estimation for all samples using MinHash. Mash
2 fastani Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. FastANI
3 automlst-wrapper Simplified Tree building using autoMLST automlst-simplified-wrapper
4 roary Build pangenome using Roary. Roary
5 eggnog-roary Annotate Roary output using eggNOG mapper eggnog-mapper
6 seqfu Calculate sequence statistics using SeqFu. seqfu2
7 bigslice Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) bigslice
8 query-bigslice Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) bigfam.bioinformatics.nl
9 checkm Assess genome quality with CheckM. CheckM
10 gtdbtk Taxonomic placement with GTDB-Tk GTDBTk
11 prokka-gbk Copy annotated genbank results. prokka
12 antismash Summarizes antiSMASH result. antismash
13 arts Run Antibiotic Resistant Target Seeker (ARTS) on samples. arts
14 deeptfactor Use deep learning to find Transcription Factors. deeptfactor
15 deeptfactor-roary Use DeepTFactor on Roary outputs. Roary
16 cblaster-genome Build diamond database of genomes for cblaster search. cblaster
17 cblaster-bgc Build diamond database of BGCs for cblaster search. cblaster
18 bigscape Cluster BGCs using BiG-SCAPE BiG-SCAPE

Network Analysis with Cytoscape

A graphml file containing the annotated BiG-SCAPE network is generated by the automated report and can be explored with Cytoscape. A guideline for network analysis with Cytoscape can be found in the Cytoscape documentation: https://manual.cytoscape.org/en/stable/