# Data collection and project setup

Summary of data collection steps followed in the Streptomyces pangenome project. This notebook describes all the projects and BGCFlow runs carried out in the study.  

In [None]:
import pandas as pd
from pathlib import Path
import altair as alt
import yaml
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

### File Configuration

This include the path to directory of BGCFlow runs on the dataset.

In [None]:
with open("config.yaml", "r") as f:
    notebook_configuration = yaml.safe_load(f)
notebook_configuration

# 1. Setup qc project with all genomes from NCBI

2938 genome accessions were collected from the NCBI RefSeq database on the 30-06-2023 for genomes belonging to Streptomycetaceae family

In [None]:
tables_dir = Path("assets/tables/")
ncbi_accn_table = tables_dir / "streptomycetaceae_refseq_30062023.tsv"

In [None]:
df_ncbi_accn = pd.read_csv(ncbi_accn_table, sep="\t", index_col=2)
df_ncbi_accn.head()

In [None]:
samples_columns = ["source","organism","genus","species","strain","closest_placement_reference"]
df_samples_strepto_ncbi_all = pd.DataFrame(index=df_ncbi_accn.index, columns=samples_columns)
df_samples_strepto_ncbi_all.index.name = "genome_id"
df_samples_strepto_ncbi_all.source = "ncbi"
df_samples_strepto_ncbi_all.head()

In [None]:
df_samples_strepto_ncbi_all.to_csv(tables_dir / "df_samples_strepto_ncbi_all.csv")

### Setup the config project using BGCFlow wrapper

Follow the guide to set up a project called "qc_strepto_ncbi" with about df_samples tables and project config with the following rules

```
cd bgcflow_dir
bgcflow init --project qc_strepto_ncbi
cp .examples/_genome_project_example/prokka-db.csv config/qc_strepto_ncbi/
```

In [None]:
# Write samples.csv table to config directory of qc_strepto_ncbi project
bgcflow_dir = Path(notebook_configuration["bgcflow_dir"])
project_name_1 = "qc_strepto_ncbi"

config_dir_1 = bgcflow_dir / f"config/{project_name_1}"
df_samples.to_csv(config_dir / "samples.csv")

## Update project_config

Update the file `config/qc_strepto_ncbi/project_config.yaml` to allow seqfu rule on.

Note that the latest release 214 of GTDB was used here with version 2.3.0 updated in the `workflow/envs/gtdbtk.yaml` file.

```
name: qc_strepto_ncbi
pep_version: 2.1.0
description: '2938 genome accessions were collected from the NCBI RefSeq database 
on the 30-06-2023 for genomes belonging to Streptomycetaceae family'
sample_table: samples.csv
prokka-db: 'prokka-db.csv'
rules:
  seqfu: true

#### CUSTOM RULE CONFIGURATION ####
rule_parameters:  install_gtdbtk:
    release: "214"
    release_version: "214"
```

## Evaluate the results of the above first BGCFlow run

Key tables generated:

1. df_ncbi_meta : NCBI metadata
2. df_gtdb_meta: GTDB metadata for genomes available in the release 214
3. df_seqfu_meta: Genome assembly statistics using seqfu rule

In [None]:
processed_dir_1 = bgcflow_dir / "data" / "processed" / project_name_1

# Read output tables from the processed directory
ncbi_meta_table_1 = processed_dir_1 / "tables"/ "df_ncbi_meta.csv"
df_ncbi_meta_1 = pd.read_csv(ncbi_meta_table_1, index_col= 0)

gtdb_meta_table_1 = processed_dir_1 / "tables"/ "df_gtdb_meta.csv"
df_gtdb_meta_1 = pd.read_csv(gtdb_meta_table_1, index_col= 0)

seqfu_meta_table_1 = processed_dir_1 / "tables"/ "df_seqfu_stats.csv"
df_seqfu_meta_1 = pd.read_csv(seqfu_meta_table_1, index_col= 0)

### Genomes not found in GTDB 

A particular version of GTDB might not contain all of the genomes from NCBI. Some genomes dont have information on the CheckM. 

In [None]:
df_gtdb_meta_1.detail.value_counts()

In [None]:
df_gtdb_meta_1.Genus.value_counts()

In [None]:
df_gtdb_absent = df_gtdb_meta_1[df_gtdb_meta_1.detail != "Genome found"]
gtdb_absent_table = processed_dir_1 / "tables"/ "df_gtdb_absent.csv"

df_gtdb_absent.to_csv(gtdb_absent_table)

# 2. Set up GTDBtk run for the remaining genomes

In this second BGCFlow run, a project is setup to run GTDBTk and CheckM to assess the taxonomy and assembly quality for genomes not found in R214 version of GTDB server along with the genomes from the NBC strain collection.

1. df_gtdb_absent : 396 NCBI genomes that were not found in the GTDB server.
2. df_samples_NBC_1023: List of 1023 actinomycetes from the NBC collection. Note that some of these are redundant with NCBI database and thus are removed.

### Setup the config project using BGCFlow wrapper

Follow the guide to set up a project called "qc_gtdbtk" with samples tables and project config with the following rules

```
cd bgcflow_dir
bgcflow init --project qc_gtdbtk
cp .examples/_genome_project_example/prokka-db.csv config/qc_gtdbtk/
```

## Update project_config

Update the file `config/qc_strepto_ncbi/project_config.yaml` to allow seqfu rule on.

Note that the latest release 214 of GTDB was used here with version 2.3.0 updated in the `workflow/envs/gtdbtk.yaml` file.

```
name: qc_gtdbtk
pep_version: 2.1.0
description: 'Project to assess taxonomy and assembly quality of 396 NCBI + 902 NBC genomes.'
sample_table: samples.csv
prokka-db: 'prokka-db.csv'
rules:
  seqfu: true
  checkm: false
  gtdbtk: false

#### CUSTOM RULE CONFIGURATION ####
rule_p  install_gtdbtk:
    release: "214"lease: "214"
    release_version: "214"
"```

In [None]:
# Write samples.csv table to config directory of qc_strepto_ncbi project
project_name_2 = "qc_gtdbtk"

config_dir_2 = bgcflow_dir / f"config/{project_name_2}"

In [None]:
# Read NBC data file (manually created for custom genomes). Follow documentation of BGCFlow for custom samples.
samples_NBC_table = config_dir_2 / "samples_NBC_1023.csv"
df_samples_NBC_1023 = pd.read_csv(samples_NBC_table, index_col=0)

### Remove redundant genomes

As some of th NBC genomes were already part of NCBI dataset on 30 June 2023, we removed them from the custom project and kept the NCBI versions. All these genomes were part of the BioProject: PRJNA747871.

In [None]:
redundant_strain_list = df_ncbi_meta_1[df_ncbi_meta_1.BioProject == "PRJNA747871"].strain.tolist()
print("The number of genomes that were removed from NBC dataset:", len(redundant_strain_list))
df_samples_NBC_902 = df_samples_NBC_1023.drop(redundant_strain_list)

In [None]:
df_ncbi_meta_1.assembly_level.value_counts()

### Add fasta files for custom genomes

The fasta files for all 902 genomes were manuaully added to the directory `data/raw/fasta/`.

### Combine the GTDBTk project 

Samples table for the GTDBTk project is created with the 902 NBC genomes and 396 NCBI genomes without taxonomic/ assembly quality informatio

In [None]:
df_samples_gtdb_absent = df_samples_strepto_ncbi_all.loc[df_gtdb_absent.index,:] 

df_samples_gtdbtk = pd.concat([df_samples_NBC_902, df_samples_gtdb_absent])

In [None]:
# Save the samples tables
df_samples_gtdb_absent.to_csv(config_dir_2/ "samples_NCBI_gtdb_absent.csv")
df_samples_NBC_902.to_csv(config_dir_2/ "samples_NBC_902.csv")
df_samples_gtdbtk.to_csv(config_dir_2/ "samples.csv")

## Evaluate the results of the above first BGCFlow run on gtdbtk

Key tables generated:

1. gtdbtk.bac120.summary.tsv: GTDB-Tk based taxonomic identification with version 214
2. df_seqfu_stats: Genome assembly statistics using seqfu rule
3. df_checkm_stats: Genome assembly statistics using CheckM rule

In [None]:
processed_dir_2 = bgcflow_dir / "data" / "processed" / project_name_2

# Read output tables from the processed directory
ncbi_meta_table_2 = processed_dir_2 / "tables"/ "df_ncbi_meta.csv"
df_ncbi_meta_2 = pd.read_csv(ncbi_meta_table_2, index_col= 0)

gtdb_meta_table_2 = processed_dir_2 / "tables"/ "df_gtdb_meta.csv"
df_gtdb_meta_2 = pd.read_csv(gtdb_meta_table_2, index_col= 0)

seqfu_meta_table_2 = processed_dir_2 / "tables"/ "df_seqfu_stats.csv"
df_seqfu_meta_2 = pd.read_csv(seqfu_meta_table_2, index_col= 0)

check_meta_table_2 = processed_dir_2 / "tables"/ "df_checkm_stats.csv"
df_checkm_meta_2 = pd.read_csv(check_meta_table_2, index_col= 0)

gtdbtk_meta_table_2 = processed_dir_2 / "tables"/ "gtdbtk.bac120.summary.tsv"
df_gtdbtk_meta_2 = pd.read_csv(gtdbtk_meta_table_2, index_col= 0, sep="\t")

In [None]:
df_checkm_meta_2.sort_values("Completeness")

# 3. Rerun checkM on NBC_01635

Above assessment showed that CheckM registered the completeness value of 0 for the genome NBC_01635. Thus, we reran CheckM only on this genome. 

NOTE: It seemed that translation table 4 was being used instead of 11. We manually edited the rule of checkM to use custom Prokka based gene annotation as input instead of fasta nucleotide file. 

### Setup the config project using BGCFlow wrapper

Follow the guide to set up a project called "checkm_rerun" with samples tables and project config with the following rules

```
cd bgcflow_dir
bgcflow init --project checkm_rerun
cp .examples/_genome_project_example/prokka-db.csv config/checkm_rerun/
```

## Update project_config

Update the file `config/checkm_rerun/project_config.yaml` to allow seqfu rule on.

```
name: qc_strepto_ncbi
pep_version: 2.1.0
description: 'CheckM rerun for the genome NBC_01635'
sample_table: samples.csv
prokka-db: 'prokka-db.csv'
rules:
  checkm: True
```

NOTE: We manually edited the rule of checkM to use custom Prokka based gene annotation as input instead of fasta nucleotide file. Thus the results will not be generated automatically.*

In [None]:
# Write samples.csv table to config directory of qc_strepto_ncbi project
project_name_3 = "checkm_rerun"

config_dir_3 = bgcflow_dir / f"config/{project_name_3}"

samples_columns = ["source","organism","genus","species","strain","closest_placement_reference"]
df_samples_checkm_rerun = pd.DataFrame(index=["NBC_01635"], columns=samples_columns)
df_samples_checkm_rerun.index.name = "genome_id"
df_samples_checkm_rerun.loc["NBC_01635", :] = df_samples_gtdbtk.loc["NBC_01635", :]
df_samples_checkm_rerun.to_csv(config_dir_3 / "samples.csv")

In [None]:
processed_dir_3 = bgcflow_dir / "data" / "processed" / project_name_3

check_meta_table_3 = processed_dir_3 / "tables"/ "df_checkm_stats.csv"
df_checkm_meta_3 = pd.read_csv(check_meta_table_3, index_col= 0)

### Update the checkM table to include the updated results

In [None]:
df_checkm_meta_2.loc["NBC_01635", :] = df_checkm_meta_3.loc["NBC_01635", :]
# df_checkm_meta_2.to_csv(processed_dir_2 / "tables" / "df_checkm_stats_curated.csv")

### Assess gtdbtk data and extract Streptomyces genus

In [None]:
# Remove genomes without classification
df_gtdb_meta = df_gtdb_meta_1.copy()
df_gtdbtk_meta_2 = df_gtdbtk_meta_2[df_gtdbtk_meta_2.classification != "Unclassified Bacteria"]

tax_mapping = {"d" : "Domain",
              "p" : "Phylum",
              "c" : "Class",
              "o" : "Order",
              "f" : "Family",
              "g" : "Genus",
              "s" : "Organism"}

for index in df_gtdbtk_meta_2.index:
    tax = [i for i in df_gtdbtk_meta_2.loc[index, "classification"].split(";")]
    for level in tax:
        key = level.split("__")[0]
        if key == "g":
            genus = level.split("__")[-1]
        if level == "s__":
            level = f"s__{genus} sp."
        df_gtdb_meta.loc[index, tax_mapping[key]] = level
    df_gtdb_meta.loc[index, "gtdb_release"] = "R214"
    df_gtdb_meta.loc[index, "detail"] = "found using gtdbtk"
df_gtdb_meta.Species = [i.split()[-1] for i in df_gtdb_meta.Organism]

In [None]:
df_gtdbtk_meta_curated_2 = df_gtdb_meta.loc[df_seqfu_meta_2.index,:]
df_gtdbtk_meta_curated_2.to_csv(processed_dir_2/ "tables"/ "df_gtdbtk_meta_curated.csv")
df_gtdb_meta.to_csv(processed_dir_1/ "tables"/ "df_gtdb_meta_curated.csv")

### Assess checkM, SeqFu data and define quality filters

In [None]:
df_seqfu_meta = df_seqfu_meta_1.combine_first(df_seqfu_meta_2)

In [None]:
# Manually load NBC genomes assembly_level status
df_nbc_assembly_level = pd.read_csv(config_dir_2/ "df_nbc_assembly_level.csv", index_col=0)

In [None]:
df = pd.DataFrame(index=df_gtdb_meta.index, columns=["genome_id", "genus", "source", "species", "quality", "completeness", "contamination", "N50", "contigs", "genome_len"])
df["genome_id"] = df.index


for genome_id in df.index:
    # Define genus
    if genome_id in df_gtdb_meta.index:
        genus = df_gtdb_meta.loc[genome_id, "Genus"].split("__")[1]
        species = df_gtdb_meta.loc[genome_id, "Species"]
        
        df.loc[genome_id, "species"] = species
        df.loc[genome_id, "genus"] = genus
    else:
        print(genome_id, "not found in df_gtdb_meta!")
        
    # Define quality levels
    if genome_id in df_ncbi_meta_1.index:
        df.loc[genome_id, "source"] = "NCBI"
        assembly_level = df_ncbi_meta_1.loc[genome_id, "assembly_level"]
    else:
        df.loc[genome_id, "source"] = "NBC"
        if genome_id in df_nbc_assembly_level.index:
            assembly_level = df_nbc_assembly_level.loc[genome_id, "assembly_level"]
        else:
            assembly_level = "Complete Genome"

            
    # Define checkM metrics of completeness and contamination
    if genome_id in df_checkm_meta_2.index:
        completeness = df_checkm_meta_2.loc[genome_id, "Completeness"]
        contamination = df_checkm_meta_2.loc[genome_id, "Contamination"]
    else:
        completeness = df_gtdb_meta.loc[genome_id, "checkm_completeness"]
        contamination = df_gtdb_meta.loc[genome_id, "checkm_contamination"]

    df.loc[genome_id, 'completeness'] = completeness
    df.loc[genome_id, 'contamination'] = contamination

    # Define seqfu metrics of N50 and Contigs
    df.loc[genome_id, 'N50'] = df_seqfu_meta.loc[genome_id, "N50"]
    df.loc[genome_id, 'contigs'] = df_seqfu_meta.loc[genome_id, "Count"]
    df.loc[genome_id, 'genome_len'] = df_seqfu_meta.loc[genome_id, "Total"]
    df.loc[genome_id, 'gc'] = df_seqfu_meta.loc[genome_id, "gc"]

    # Define the quality of genomes
    if contamination > 5 or completeness < 90:
        df.loc[genome_id, "quality"] = "LQ"
        print(genome_id, "removed.", "Contamination:", contamination, "Completeness:",completeness)
    else:
        if assembly_level in ["Complete Genome", "Chromosome"]:
            df.loc[genome_id, "quality"] = "HQ"
        else:
            contigs = df_seqfu_meta.loc[genome_id, "Count"]
            N50 = df_seqfu_meta.loc[genome_id, "N50"]
            if contigs <= 100 and  N50 >= 100000:
                df.loc[genome_id, "quality"] = "MQ"
            else:
                print(genome_id, "removed.", "Contigs:", contigs, "N50:", N50)
                df.loc[genome_id, "quality"] = "LQ"

df_stats = df.copy()

In [None]:
df_stats.quality.value_counts()

# 4. Streptomyces genus medium quality project

By following the above assessement, different quality filters were defined to select medium to high quality genomes for the next BGCFlow project called `mq_strepto`

Low quality
1. CheckM contamination > 5 %
2. CheckM completeness < 90 %
3. Number of contigs > 100
4. N50 < 100 kb

Medium quality
1. NCBI assembly level = Cotig, Scaffold
2. CheckM contamination <= 5 %
3. CheckM completeness <= 90 %
4. Number of contigs <= 100
5. N50 >= 100 kb

High quality
1. NCBI assembly level = Complete genome, Chromosome
2. CheckM contamination <= 5 %
3. CheckM completeness <= 90 %
4. Number of contigs <= 100
5. N50 >= 100 kb








### Setup the config project using BGCFlow wrapper

Follow the guide to set up a project called "mq_strepto" with samples tables and project config with the following rules

```
cd bgcflow_dir
bgcflow init --project mq_strepto
cp .examples/_genome_project_example/prokka-db.csv config/mq_strepto/
```

## Update project_config

Update the file `config/qc_strepto_ncbi/project_config.yaml` to allow seqfu rule on.

Note that the latest release 214 of GTDB was used here with version 2.3.0 updated in the `workflow/envs/gtdbtk.yaml` file.

```
name: mq_strepto
pep_version: 2.1.0
description: 'Project to of 2371 genomes of medium to high quality genomes of Streptomyces genus.'
sample_table: samples.csv
prokka-db: 'prokka-db.csv'
rules:
  seqfu: true
  mash: true
  prokka: true

#### CUSTOM RULE CONFIGURATION ####
rule_p  install_gtdbtk:
    release: "214"
    release_version: "214"
```

In [None]:
### Extract Streptomyces genus and MQ, HQ genomes

In [None]:
df_streptomyces = df_stats[df_stats.genus == "Streptomyces"]
df_strepto_mq = df_streptomyces[df_streptomyces.quality.isin(["HQ", "MQ"])]
df_strepto_mq.quality.value_counts()

In [None]:
df_samples_mq_ncbi = df_samples_strepto_ncbi_all.loc[df_strepto_mq[df_strepto_mq.source == "NCBI"].index,:] 
df_samples_mq_nbc = df_samples_NBC_902.loc[df_strepto_mq[df_strepto_mq.source == "NBC"].index,:] 
df_samples_mq_strepo = pd.concat([df_samples_mq_ncbi, df_samples_mq_nbc])
df_samples_mq_strepo

In [None]:
df_strepto_mq.species.value_counts()[:40]

In [None]:
import plotly.express as px

df = df_strepto_mq.copy()

# filter top 20 species
top_species_names = species_counts[:20].index
df_top_species = df[df['species'].isin(top_species_names)]

# create a new column combining 'source' and 'quality'
df_top_species['source_quality'] = df_top_species['source'] + '-' + df_top_species['quality']

# compute total counts for each species
species_total_counts = df_top_species['species'].value_counts()

# sort species by total counts
df_top_species['species'] = pd.Categorical(df_top_species['species'], categories=species_total_counts.index, ordered=True)

fig = px.histogram(df_top_species, x="species", color="source_quality",
                   barmode="stack",
                   title="Distribution of Top 20 Species by Source and Quality",
                   labels={'species':'Species', 'source_quality':'Source and Quality', 'count':'Count'})

fig.update_layout(height=600, showlegend=True)
fig.show()

In [None]:
# Write samples.csv table to config directory of qc_strepto_ncbi project
project_name_4 = "mq_strepto"
config_dir_4 = bgcflow_dir / f"config/{project_name_4}"

samples_mq_strepto_table = config_dir_4 / "samples.csv"
# df_samples_mq_strepo.to_csv(samples_table)

In [None]:
processed_dir_4 = bgcflow_dir / "data" / "processed" / project_name_4

In [None]:
df_stats.to_csv(processed_dir_1 / "tables" / "df_filters.csv")
df_stats.to_csv(processed_dir_2 / "tables" / "df_filters.csv")
df_strepto_mq.to_csv(processed_dir_4 / "tables" / "df_filters.csv")

# 5. Streptomyces Mash-cluster projects 

We used the MASH based clustering to define consistent Mash-clusters. Here a separate project can be created for each of the Mash-Cluster for detailed analysis. 

Here, it is required to already have done the Mash-clustering and table be stored at assets/table/df_mash_clusters_main_reduced.csv. This is just one example of M1 Mash-cluster, which can be extended to setup any other similar BGCFlow projects on the subset of genomes.

### Setup the config project using BGCFlow wrapper

Follow the guide to set up a project called "mq_strepto" with samples tables and project config with the following rules

```
cd bgcflow_dir
bgcflow init --project M1
cp .examples/_genome_project_example/prokka-db.csv config/M1/
```

## Update project_config

Update the file `config/M1/project_config.yaml` to allow seqfu rule on.

```
name: M1
pep_version: 2.1.0
description: 'Project to of individual Mash-cluster M1.'
sample_table: samples.csv
prokka-db: 'prokka-db.csv'
rules:
  seqfu: true
  bigscape: true

#### CUSTOM RULE CONFIGURATION ####
rule_p  install_gtdbtk:
    release: "214"
    release_version: "214"
```

In [None]:
samples_mq = bgcflow_dir / "config" / "mq_strepto" / "samples.csv"
df_samples_mq = pd.read_csv(samples_mq, index_col= 0)

In [None]:
df_mash_clusters = pd.read_csv("assets/tables/df_mash_clusters_main_reduced.csv", index_col=0)
cluster_list = sorted(df_mash_clusters.Cluster.unique())
for cluster in cluster_list:
    mash_cluster = "M" + str(cluster)
    genome_list = df_mash_clusters[df_mash_clusters.Cluster == cluster].index.tolist()
    df_samples_mash_cluster = df_samples_mq.loc[genome_list, :]
    samples_phylo_table = bgcflow_dir / "config" / mash_cluster / "samples.csv"
    df_samples_mash_cluster.to_csv(samples_phylo_table)