## Manuscript Workflow

### Install this package


In [None]:
!pip install ../

### Download `prodigal` and `diamond`

Put the following two programs into `bin` and specify their paths in `config.toml`

**prodigal:** https://github.com/hyattpd/Prodigal  
**diamond:** https://github.com/bbuchfink/diamond


### Download the genomes used in this study

TODO: Upload genomes to Zenodo and download them to `../data/`


### Optional: Rename genome files to sequence names


In [None]:
# python ../biopathpred/utils/rename_sequence_label.py [PATH_TO_DOWNLOADED_GENOMES] fna
!python ../biopathpred/utils/rename_sequence_label.py ../data fna

### Run prediction pipeline

The parameters used in this study are specified in `config.toml`.

In the sequence alignment step, results were filtered to retain only those with at least 50% query coverage (`-f coverage=50`). For each query sequence, the best hit, if available, was selected based on the highest bit-score (`-c score`).
In the prediction score calculation step, both the binary (`-m binary`) and the logistic (`-m prob`) model were executed.

_Note: If a parameter is not explicitly specified via command-line arguments, the corresponding default value from `config.toml` will be used. Only parameters provided by the user will override the defaults._


#### Binary Model


In [None]:
# Binary Model
# Note: This step can take several hours to complete, depending on dataset size and system resources.
!biopathpred -i ../data/ -o ../output_binary -m binary -f coverage=50 -c score

In [None]:
# Equivalent to
# Gene prediction (prodigal) and BLAST may take several hours depending on dataset size and system resources.
# !biopathpred prodigal -i ../data -o ../output_binary
# !biopathpred blastp -i ../output_binary/prodigal -o ../output_binary
# !biopathpred parse_xml -i ../output_binary/blast -o ../output_binary
# !biopathpred best_blast -i ../output/parse_blast -o ../output_binary -f coverage=50 -c score
# !biopathpred match_enzyme -i ../output_binary/best_blast -o ../output_binary -m binary
# !biopathpred result_summary -i ../output_binary/match_enzyme_result -o ../output_binary

#### Logistic Model


In [None]:
# Logistic Model
# Note: This step can take several hours to complete, depending on input size and system resources.
!biopathpred -i ../data/ -o ../output -m prob -f coverage=50 -c score

### Results

A summary of the prediction results is saved to `../output/result_summary`.  
The file `prediction_output.csv` contains the prediction scores for IAA production for each genome.
Summary statistics for all genomes are provided in `compound_output.csv` and `enzyme_output.csv`.  
Individual predictions for each genome are saved in `../output/match_enzyme_result`.
