## Manuscript Workflow

### Install this package


In [None]:
!pip install ../

### Download `prodigal` and `diamond`

Put the following two programs into `bin` and specify their paths in `config.toml`

**prodigal:** https://github.com/hyattpd/Prodigal  
**diamond:** https://github.com/bbuchfink/diamond


### Download the genomes used in this study


In [None]:
import os
import tarfile
import requests
from tqdm.auto import tqdm

core_files = [
    "core_Acidobacteria.tar.gz",
    "core_Actinobacteria.tar.gz",
    "core_Bacteroidetes.tar.gz",
    "core_Firmicutes.tar.gz",
    "core_Proteobacteria.tar.gz",
]

positive_files = [
    "positive_validation.tar.gz",
]

negative_files = [
    "negative_validation.tar.gz",
]

base_url = "https://zenodo.org/records/15860571/files/"
data_dir = "../data"
positive_validation_dir = "../positive_data"
negative_validation_dir = "../negative_data"


for files, download_dir in zip(
    (core_files, positive_files, negative_files),
    (data_dir, positive_validation_dir, negative_validation_dir),
    strict=True,
):
    os.makedirs(download_dir, exist_ok=True)
    for fname in files:
        url = f"{base_url}{fname}?download=1"
        local_path = os.path.join(download_dir, fname)

        print(f"Downloading {fname}...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            total = int(r.headers.get("content-length", 0))
            with (
                open(local_path, "wb") as f,
                tqdm(
                    desc=fname,
                    total=total,
                    unit="B",
                    unit_scale=True,
                    unit_divisor=1024,
                ) as bar,
            ):
                for chunk in r.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
                        bar.update(len(chunk))

        print(f"Extracting {fname}...")
        with tarfile.open(local_path, "r:gz") as tar:
            tar.extractall(path=download_dir)

        print(f"Cleaning up {fname}...")
        os.remove(local_path)


### Rename genome files to sequence names


In [None]:
# python ../bcpip/utils/rename_sequence_label.py [PATH_TO_DOWNLOADED_GENOMES] fna
!python ../bcpip/utils/rename_sequence_label.py ../data fna

### Run prediction pipeline

The parameters used in this study are specified in `config.toml`.

In the sequence alignment step, results were filtered to retain only those with at least 50% query coverage (`-f coverage=50`). For each query sequence, the best hit, if available, was selected based on the highest bit-score (`-c score`).
In the prediction score calculation step, both the binary (`-m binary`) and the logistic (`-m prob`) model were executed.

_Note: If a parameter is not explicitly specified via command-line arguments, the corresponding default value from `config.toml` will be used. Only parameters provided by the user will override the defaults._


#### Binary Model


In [None]:
# Binary Model
# Note: This step can take several hours to complete, depending on dataset size and system resources.
!bcpip -i ../data/ -o ../output_binary -m binary -f coverage=50 -c score

In [None]:
# Equivalent to
# Gene prediction (prodigal) and BLAST may take several hours depending on dataset size and system resources.
# !bcpip prodigal -i ../data -o ../output_binary
# !bcpip blastp -i ../output_binary/prodigal -o ../output_binary
# !bcpip parse_xml -i ../output_binary/blast -o ../output_binary
# !bcpip best_blast -i ../output/parse_blast -o ../output_binary -f coverage=50 -c score
# !bcpip match_enzyme -i ../output_binary/best_blast -o ../output_binary -m binary
# !bcpip result_summary -i ../output_binary/match_enzyme_result -o ../output_binary

#### Logistic Model


In [None]:
# Logistic Model
# Note: This step can take several hours to complete, depending on input size and system resources.
!bcpip -i ../data/ -o ../output_logistic -m prob -f coverage=50 -c score

### Results

A summary of the prediction results is saved to `../output/result_summary`.  
The file `prediction_output.csv` contains the prediction scores for IAA production for each genome.
Summary statistics for all genomes are provided in `compound_output.csv` and `enzyme_output.csv`.  
Individual predictions for each genome are saved in `../output/match_enzyme_result`.
