This tutorial aims to show the flexibility of modifying, or generating markdown reports from BGCFlow template notebooks.

We will go through a simple analysis of filtering results from different genome mining tools and finding our BGC of interest.

### Prerequisites
This tutorial utilizes the example dataset used in the [BGCFlow manuscript](https://github.com/NBChub/saccharopolyspora_manuscript)

To run this tutorial, you will need to finish the `mq_saccharopolyspora` run and build the report.

[MQ_Saccharopolyspora Project Configuration](https://github.com/NBChub/bgcflow/tree/dev-0.8.2-notebooks_update4/.examples/mq_saccharopolyspora){:target="_blank" .md-button}

Create a new project configuration folder in the `config` folder and copy all the necessary project files there. Then run and build the report as follows:

```bash
# Finish running BGCFlow
bgcflow run

# Build the report
bgcflow build report
```

Once the run finishes, check the result in the processed directory (`data/processed/mq_saccharopolyspora`), it should generate a folder structure similar to this: 
```
.
├── antismash
├── bigscape
├── bigslice
├── data_warehouse
├── docs
│   ├── antismash.ipynb
│   ├── antismash.md
│   ├── arts.ipynb
│   ├── arts.md
│   ├── assets
│   ├── bigscape.ipynb
│   ├── bigscape.md
│   ├── index.md
│   ├── query-bigslice.ipynb
│   └── query-bigslice.md
├── log_changes
├── main.py
├── metadata
├── mkdocs.yml
├── overrides
├── README.md
└── tables
```

BGCFlow contains a starter Jupyter notebook template, generated in the `docs` folder, which are then converted to a `markdown` file, which can be used by `mkdocs` to generate a static HTML site. 

For more details about customizing the HTML report, see [https://squidfunk.github.io/mkdocs-material/](https://squidfunk.github.io/mkdocs-material/)

### Default conda environments for the notebooks
When you run `bgcflow build report`, you can actually see which conda environments are being used `Snakemake` to generate the report. You can either use this existing environments, or set it up on your own from the recipe.

There are two environments that BGCFlow used to make reports, a `Python` and a mix of `R` and `Python` environment:

- **Python** - [bgcflow_notes.yaml](https://github.com/NBChub/bgcflow/blob/main/workflow/envs/bgcflow_notes.yaml)
- **R** - [r_notebook.yaml](https://github.com/NBChub/bgcflow/blob/main/workflow/envs/r_notebook.yaml)

You can install the conda environments using those yaml file by:

```bash
# use conda or mamba
mamba env create -f bgcflow_notes.yaml # or r_notebook.yaml
```

### Adding a new notebook to the analysis

To add your own analysis, you can start up Jupyter session using the environments above, and create a new notebook inside the `docs` folder. You can also download this notebook tutorial here:

Let's give our notebook the name `unique_polyketides.ipynb`

[Download Notebook](https://github.com/NBChub/bgcflow/blob/main/.examples/notebooks/unique_polyketides.ipynb){:target="_blank" .md-button}

## Tutorial: Identifying Unique Polyketide BGCs with Specific Criteria
In this notebook, we aim to identify unique polyketide biosynthetic gene clusters (BGCs) that do not have hits in BigFam or MIBiG databases but have been identified in ARTS. This analysis can be used as the first step in the exploratory analysis of discovering potentially novel polyketide BGCs with unique functions. 

### Setting up libraries and environment variables

We will start by use Python and pandas for data manipulation and filtering.

First, let's import the necessary Python libraries and define the directory containing our report data.

In [None]:
from pathlib import Path
import json
import pandas as pd

from IPython.display import display, Markdown, HTML

from itables import to_html_datatable as DT
import itables.options as opt

opt.css = """
.itables table td { font-style: italic; font-size: .8em;}
.itables table th { font-style: oblique; font-size: .8em; }
"""

opt.classes = ["display", "compact"]
opt.lengthMenu = [5, 10, 20, 50, 100, 200, 500]

# Define the directory containing the report
report_directory = Path("../")

Note from the code that the report directory is located one directory above of this notebook (which is located in the `docs` folder)

The `metadata` folder records some of the important software versions and also other information of the BGCFlow runs. First, we will fetch the `antiSMASH` version used in the run from the metadata.

In [None]:
# Load the dependency versions
dependency_versions_file = report_directory / "metadata/dependency_versions.json"
with open(dependency_versions_file, "r") as file:
    dependency_versions = json.load(file)

# Extract the version of antiSMASH used
antismash_version = dependency_versions["antismash"]

display(Markdown(f"> antiSMASH version is: `{antismash_version}`"))

### Setting up input files
First, we need to find all the necessary tables or files required for our analysis. Here, we will need the results from antiSMASH, BiG-SCAPE, BiG-FAM query, and also ARTS2.

Most of the tables can be found in the `tables` folder, such as the antiSMASH summary region tables:

In [None]:
# Define the paths to the input files
antismash_regions_file = report_directory / "tables/df_regions_antismash_7.0.0.csv"
display(Markdown(f">`{antismash_regions_file}`"))

Some other tables are located in their specific directories, such BiG-SCAPE

In [None]:
# Define the directory containing the BIG-SCAPE data
bigscape_directory = report_directory / f"bigscape/for_cytoscape_antismash_{antismash_version}/"

# Find the cluster table file
cluster_table_file = [i for i in bigscape_directory.glob("*_df_clusters_0.30.csv")][0]
display(Markdown(f">`{cluster_table_file}`"))

In [None]:
# Define the directory containing the query data
bigfam_query = report_directory / f"bigslice/query_as_{antismash_version}/query_network.csv"
display(Markdown(f">`{bigfam_query}`")) 

Other tables are created by the Jupyter notebook templates, and usually can be interactively downloaded from the HTML report. By convention, this are stored in the `assets` directory within the `docs` folder. 

In [None]:
arts_table_file = Path(f"assets/tables/arts_hits_as{antismash_version}.csv")
display(Markdown(f">`{arts_table_file}`"))

### Using pandas and itables to show tables interactively

While pandas can show nice summary of a table, it does not do so interactively, and does not render well in the HTML report. We can use [itables](https://mwouts.github.io/itables/quick_start.html) to display our tables as interactive datatables that we can sort, paginate, scroll or filter. To enable this feature in the final report, we are converting the tables to HTML datatables and displaying it with iPython.

In [None]:
# Load the data from the input files
df_antismash_regions = pd.read_csv(antismash_regions_file)
# Correct similarity values and fill null values with 0
df_antismash_regions["similarity"] = df_antismash_regions["similarity"].apply(lambda x: 1 if x > 1 else x).fillna(0)

df_bigscape = pd.read_csv(cluster_table_file)
df_arts_hits = pd.read_csv(arts_table_file)
df_bigfam_hits = pd.read_csv(bigfam_query )

In [None]:
display(Markdown(f">`{antismash_regions_file}`"))
display(HTML(DT(df_antismash_regions, scrollX=True)))

In [None]:
display(Markdown(f">`{cluster_table_file}`"))
display(HTML(DT(df_bigscape, scrollX=True)))

In [None]:
display(Markdown(f">`{arts_table_file}`"))
display(HTML(DT(df_arts_hits, scrollX=True)))

In [None]:
display(Markdown(f">`{bigfam_query}`"))
display(HTML(DT(df_bigfam_hits, scrollX=True)))

Note that these tables are downsampled by default to prevent the Markdown report become heavy. See [https://mwouts.github.io/itables/downsampling.html](https://mwouts.github.io/itables/downsampling.html) for more details. 

We do not recommend to use the HTML reports to show big tables as it's purpose is to give a quick summary of the analysis.

### Exploratory Data Analysis

#### BiG-SCAPE filtering
Let's start by looking at the BiG-SCAPE category of unknown PKS. We will create a filtering following this logic:

- **Define Classes to Filter By**: We will select all possible PKS categories in BiG-SCAPE and put it in a list named bigscape_class, containing PKSI, PKSother, and PKS-NRP_Hybrids. To see all available bigscape_class, do: `df_bigscape["bigscape_class"].unique()`

- **Filter for unknown families**: The column `fam_type_0.30` defines whether the BGCs belongs to a known or unknown GCFs using the cutoff value of 0.3. We will set the string variable family_type to "unknown_family", which then used to filter rows based on a column in the DataFrame related to family types.

- **Create Masks for Filtering**:

    - **mask1**: This is a boolean mask created by checking if the values in the column fam_type_0.30 of df_bigscape are equal to family_type ("unknown_family"). This mask is true for rows where the family type is unknown.

    - **mask2**: This is another boolean mask created by checking if the values in the column bigscape_class are within the list bigscape_class defined at the start. This mask is true for rows that match any of the specified classes (PKSI, PKSother, PKS-NRP_Hybrids).

- **Filter DataFrame**: The DataFrame df_bigscape is filtered using the logical AND (&) of mask1 and mask2. This means only rows where both conditions are true (i.e., the family type is "unknown_family" and the bigscape class is one of the specified classes) are selected.

In [None]:
bigscape_class = ['PKSI', 'PKSother', 'PKS-NRP_Hybrids']
family_type = "unknown_family"
mask1 = df_bigscape["fam_type_0.30"] == family_type
mask2 = df_bigscape["bigscape_class"].isin(bigscape_class)
df_bigscape_PKS_unknown = df_bigscape[mask1 & mask2]
display(HTML(DT(df_bigscape_PKS_unknown, scrollX=True)))

In [None]:
number_of_gcf = len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())

text = f"""We found {df_bigscape.shape[0]} BGC regions in the category of {', '.join(bigscape_class)} which belongs to {family_type.replace('_', ' ')}.\
 These BGCs can be grouped into {len(df_bigscape_PKS_unknown["fam_id_0.30"].unique())} GCFs."""
Markdown(text)

#### AntiSMASH KnownClusterBlast filtering
We will now look at the KnownClusterBlast similarity score and decide if we need to further filter our search. We will subset the antiSMASH BGC regions table to only contains BGCs identified in the previous step and observe the KnownClusterBlast similarity values. We will narrow down our search by setting up a threshold for the similarity score.

In [None]:
# Extract the BGC IDs of the unknown families
unknown_family_bgcs = df_bigscape_PKS_unknown.bgc_id.to_list()

In [None]:
mask_unknown_family = df_antismash_regions.bgc_id.isin(unknown_family_bgcs)
columns_to_show = ["bgc_id", "genome_id", "product", "similarity", 
                   "most_similar_known_cluster_id", "most_similar_known_cluster_description", "region", "contig_edge", "region_length", ]
display(HTML(DT(df_antismash_regions[mask_unknown_family].loc[:, columns_to_show], scrollX=True)))

In [None]:
similarity_cutoff = 0.30
mask_similarity = df_antismash_regions.similarity < similarity_cutoff
text2 = f"From those **{df_antismash_regions[mask_unknown_family].shape[0]} BGCs**, we actually still have a few BGCs with high similarity based on KnownClusterBlast.\
 Let us focus on BGCs with low similarity **(<{similarity_cutoff})**, and now we are left with **{df_antismash_regions[mask_unknown_family & mask_similarity].shape[0]} BGCs**."
Markdown(text2)

In [None]:
display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity].loc[:, columns_to_show], scrollX=True)))

#### BiG-FAM query filtering
We will also remove all BGCs that has a match to other BGCs in the BiG-FAM database.

In [None]:
mask_no_bigfam_hit = ~df_antismash_regions.bgc_id.isin(df_bigfam_hits.bgc_id.unique())
display(HTML(DT(df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit].loc[:, columns_to_show], scrollX=True)))

#### ARTS hits filtering
We will now select only BGCs with close proximity to one of the ARTS2 model. Note that ARTS2 actually have several criteria to prioritise BGCs with possible antibiotic activity, here we just search for all BGCs that has hits to ARTS2 profile.

In [None]:
mask_arts_hit = df_antismash_regions.bgc_id.isin(df_arts_hits.bgc_id.dropna().unique())
df_pks_filtered = df_antismash_regions[mask_unknown_family & mask_similarity & mask_no_bigfam_hit & mask_arts_hit]
display(HTML(DT(df_pks_filtered.loc[:, columns_to_show], scrollX=True)))

### Final Remarks

In [None]:
text3 = f"We are now left with **{df_pks_filtered.shape[0]} PKS BGCs** of interest which can be further invesigated in-depth."
Markdown(text3)

This tutorial help us narrow down our search to potentially novel PKS. Nevertheless, an in-depth investigation to the BGC structure and see if it has all the required genes to perform natural product biosynthesis.

We also recommend users to try out `Metabase` for a more interactive experience of exploring the datasets. See the how to guide in our [WiKi](https://github.com/NBChub/bgcflow/wiki/04-Building-and-Serving-OLAP-Database)

## Displaying the notebook in the report
To display this notebook in the HTML report, we need to convert this notebook into a Markdown file.

First, we need to serve BGCFlow report this command:
```bash
bgcflow serve --project <project name>
```
The report will be served locally in [http://localhost:8001/](http://localhost:8001/)

Then, to generate the Markdown use NBConvert in the terminal:

```bash
# cd to the docs folder containing this notebook
jupyter nbconvert --to markdown \
    --execute "unique_polyketides.ipynb" \
    --output "unique_polyketides.md" \
    --template "admonition" \
    --TemplateExporter.extra_template_basedirs="../../../../workflow/notebook/nb_convert"
```

You can also use the `--no-input` flag if you don't want to show the code cells.

Then, add the Markdown file to the `mkdocs.yaml` navigation:

```yaml
extra:
  social:
  - icon: fontawesome/brands/twitter
    link: https://twitter.com/NPGMgroup
  - icon: fontawesome/brands/github
    link: https://github.com/NBChub/bgcflow
markdown_extensions:
- attr_list
nav:
- Home: index.md
- QC and Data Selection:
  - seqfu: seqfu.md
  - mash: mash.md
  - fastani: fastani.md
  - checkm: checkm.md
- Functional Annotation:
  - prokka-gbk: prokka-gbk.md
  - deeptfactor: deeptfactor.md
- Genome Mining:
  - antismash: antismash.md
  - query-bigslice: query-bigslice.md
  - bigscape: bigscape.md
  - bigslice: bigslice.md
  - arts: arts.md
  - cblaster-genome: cblaster-genome.md
  - cblaster-bgc: cblaster-bgc.md
- Phylogenomic Placement:
  - automlst-wrapper: automlst-wrapper.md
- Comparative Genomics:
  - roary: roary.md
  - eggnog-roary: eggnog-roary.md
- Custom Reports:
  - How to Add Custom Analysis to the BGCFlow Report: unique_polyketides.md
...
```

Once you updated the `mkdocs.yaml` file, the site will be regenerated to show your newly added report.
The report can be opened here [http://localhost:8001/unique_polyketides/](http://localhost:8001/unique_polyketides/)