# BioPKS Pipeline Tutorial 01 - the basics

In this tutorial, we go over and explain many of the settings/ parameters that users can tune in order to run BioPKS Pipeline. 

In [1]:
from biopks_pipeline import biopks_pipeline
from DORA_XGB import DORA_XGB
import warnings
warnings.simplefilter('ignore')

import os
os.chdir('../../BioPKS-Pipeline/notebooks')
print("Now working in:", os.getcwd())



Now working in: /Users/yashchainani96/PycharmProjects/BioPKS-Pipeline/notebooks


Users need to specify the following input parameters when attempting to synthesize a molecule *in silico* with BioPKS Pipeline:

`pathway sequence`: this can be either of the lists `['pks']` or `['pks', 'bio']` and it determines if only PKSs should be used to synthesize a target molecule, or if post-PKS modifications should be allowed as well. The goal of BioPKS Pipeline is to expand the chemical space accessible by merging PKSs and regular, monofunctional enzymes in biology, so it is likely that using both types of enzymes will result in a higher chance of reaching one's target chemical than using either alone.

`target_smiles`: the SMILES string of the desired target chemical. Regardless of whether the input SMILES string contains stereochemical information about a molecule or not, BioPKS Pipeline will automatically convert the input SMILES to its canonical form and remove any chirality. This is because in this work, we have focused only on getting the correct 2D structure of a target molecule rather than its 3D structure. We anticipate future releases of our tool to be able to achieve the correct 3D structure of a target.

`target_name`: the name of the input target. This will be used to save BioPKS Pipeline's results in the folder `data/results_logs`.

`pks_release_mechanism`: the offloading reaction with which to release the PKS product from RetroTide. Users have two options here - they can either select between `thiolysis` or `cyclization`. A `thiolysis` reaction will form a carboxylic acid while a `cyclization` reaction will form a lactone. There are several other termination reactions catalyzed by uniqe thioesterase (TE) domains but we've only included `thiolysis` and `cyclization` here since these are the most common anyway.

In [2]:
pathway_sequence = ['pks','bio']  # choose between ['pks'] or ['pks','bio']
target_smiles = 'C(C1C(C(C(C(=O)O1)O)O)O)O'
target_name = 'gluconic_lactone'
pks_release_mechanism = 'thiolysis' # choose from 'cyclization' or 'thiolysis'

In addition to the parameters above, users can also specify additional parameters via a config file. The config file that we will be using for this tutorial is in `notebooks/input_config_file_tutorial_1.json`. Following are the parameters that users can tune within the config file:

`pks_starters_filepath`: list of PKS starter units available to RetroTide (default: `../biopks_pipeline/retrotide/data/starters.smi`).

`pks_extenders_filepath`: list of PKS extender units available to RetroTide (default: `"../biopks_pipeline/retrotide/data/extenders.smi"`).

`pks_starters`: list of PKS **starter units** to run BioPKS Pipeline with. This is the list that users can edit to control which starting acyl-CoA derivatives can be used when designing synthesis pathways for small-molecules. In our manuscript, for non-aromatic molecules, we constrained the list of starter units to only malonyl-CoA ("mal"), methylmalonyl-CoA ("mmal"), methoxymalonyl-CoA ("mxmal"), hydroxymalonyl-CoA ("hmal"), and allylmalonyl-CoA ("allylmal"). This list of starter units can be written as: `["mal", "mmal", "mxmal", "hmal", "allylmal"]`. For non-aromatic molecules, however, we allowed all starter units to be used by BioPKS Pipeline. This can be enabled by simply writing "all" for this field in `pks_starters`. For this tutorial, we will use the 5 malonyl-CoA type starter units mentioned above.

`pks_extenders`: similar to above, this is the list of PKS **extender units** to run BioPKS Pipeline with. This is the list that users can edit to control which extender acyl-CoA derivatives can be used when designing synthesis pathways for small-molecules. In our manuscript, for non-aromatic molecules, we constrained the list of extender units to only malonyl-CoA ("mal"), methylmalonyl-CoA ("mmal"), methoxymalonyl-CoA ("mxmal"), hydroxymalonyl-CoA ("hmal"), and allylmalonyl-CoA ("allylmal"). This list of extender units can be written as: `["mal", "mmal", "mxmal", "hmal", "allylmal"]`. For non-aromatic molecules, however, we allowed all starter units to be used by BioPKS Pipeline. This can be enabled by simply writing "all" for this field in `pks_starters`. For this tutorial, we will use the 5 malonyl-CoA type extender units mentioned above.

`pks_similarity_metric`: chemical similarity metric to use for ranking RetroTide's PKS products. We recommend using `mcs_without_stereo`, which prioritizes the maximum common substructure (MCS) between an intermediate PKS molecule and the final, target product. In performing this comparison, we ignore any stereochemical considerations. Other options include `atompairs` and `atomatompath`. Users are also welcome to add their own chemical similarity metrics of interest in `biopks_pipeline/retrotide/retrotide.py`.

`non_pks_similarity_metric`: chemical similarity metric to use for ranking DORAnet's post-PKS products. We also recommend using `mcs_without_stereo` here. This metric enables BioPKS Pipeline to retrieve the most chemically similar post-PKS product with respect to the target chemical in the event that the final target is not reached.

`non_pks_steps`: number of non-PKS steps to modify PKS products for.

`non_pks_cores`: number of computing cores used by DORAnet when searching for pathways after reaction networks have been generated. 

`bio_max_atoms`: the maximum number of atoms that are allowed by DORAnet. This is a filter and users can specify upper-bounds for the number of each atom of each type that should be allowed. For instance, to limit both the number of carbon atoms and the number of nitrogen atoms, users can enter a dictionary such as: {"C":"6", "N":"2"}. To not use this filter at all, however, users can simply specify `"None"`.

In [3]:
config_filepath = os.path.join('input_config_file_tutorial_1.json')

Here, we initialize DORA-XGB, a supervised learning classifier that can help predict the feasibility of reactions catalyzed by regular, monofunctional enzymes in Biology. After performing post-PKS modifications, DORAnet ranks any post-PKS pathways found between the RetroTide product and the final target molecule using our previously published DORA-XGB model. More details of our DORA-XGB model can be found in our previous publication: https://pubs.rsc.org/en/content/articlehtml/2024/me/d4me00118d 

In [4]:
post_pks_rxn_model = DORA_XGB.feasibility_classifier(cofactor_positioning = 'add_concat',
                                                     model_type = "spare")

Now, with all the parameters defined, we can initialize an object of the `biopks_pipeline` class: 

In [5]:
biopks_pipeline_object = biopks_pipeline.biopks_pipeline(
                                             pathway_sequence = pathway_sequence,
                                             target_smiles = target_smiles,
                                             target_name = target_name,
                                             feasibility_classifier = post_pks_rxn_model,
                                             pks_release_mechanism = pks_release_mechanism,
                                             config_filepath = config_filepath)


Extender units successfully chosen for polyketide synthases

Starter units successfully chosen for polyketide synthases


Finally, we can begin a combined PKS and post-PKS synthesis with BioPKS Pipeline using the method `run_combined_synthesis`. This method accepts `max_designs` as an argument and this corresponds to the number of alternate PKS designs and consequently, the number of **unique** alternate PKS products that will be expanded upon in order to reach the final, downstream target.

In [6]:
### ----- Start synthesis -----
if __name__ == "__main__":
    biopks_pipeline_object.run_combined_synthesis(max_designs = 4)
    biopks_pipeline_object.save_results_logs()


Starting PKS synthesis with RetroTide
---------------------------------------------
computing module 1
   testing 120 designs
   best score is 0.6666666666666666
computing module 2
   testing 600 designs
   best score is 0.9166666666666666
computing module 3
   testing 600 designs
   best score is 0.8571428571428571

Best PKS design: [["AT{'substrate': 'Hydroxymalonyl-CoA'}", 'loading: True'], ["AT{'substrate': 'Hydroxymalonyl-CoA'}", "KR{'type': 'B1'}", 'loading: False'], ["AT{'substrate': 'Hydroxymalonyl-CoA'}", "KR{'type': 'B1'}", 'loading: False']]

Closest final product is: O=C(O)[C@H](O)[C@H](O)[C@H](O)[C@H](O)CO

Finished PKS synthesis: closest product to the target using the top PKS design of [["AT{'substrate': 'Hydroxymalonyl-CoA'}", 'loading: True'], ["AT{'substrate': 'Hydroxymalonyl-CoA'}", "KR{'type': 'B1'}", 'loading: False'], ["AT{'substrate': 'Hydroxymalonyl-CoA'}", "KR{'type': 'B1'}", 'loading: False']] is O=C(O)[C@H](O)[C@H](O)[C@H](O)[C@H](O)CO.

Moving onto non-PKS 



By product_number ranking finished
min score 6.0
max score 6.0

Pathway ranking finished. Pathway scores:

ranking 1
final score 6.0
Max reaction enthalpy score 0  x  2  =  0
Number of reactions score 1.0  x  4  =  4.0
By-product score 1.0  x  2  =  2.0
Pathway atom economy score 0.0  x  1  =  0.0
Salt score 0  x  0  =  0
Reaxys score 0  x  0  =  0
Cool score 0  x  0  =  0
atom economy 1.0
pathway by-product 0
intermediate by-product {None: 0}
O=C(O)C(O)C(O)C(O)C(O)CO>>O.O=C1OC(CO)C(O)C(O)C1O
rule0032_01
No_Thermo

Time used for pathway ranking: 0.11 minutes

Job name: gluconic_lactone_PKS0_BIO1
Job type: pathway visualization
Job started on: 2025-04-07 11:43:02.169029
pygraphviz is NOT installed.
Graphviz is NOT installed.
A custom node layout will be used for pathway visualization
Number of pathways:  1
Number of reactions in reaxys:  0
Working on creating pages
You can adjust multi-processing number to speed up PDF generation




page done: 1
Finished with pages, writing to pdf
Time used for pathway visualization: 0.07 minutes

A custom node layout was used for pathway visualization
For a better layout, please install pygraphviz and Graphviz with the following command:
conda install conda-forge::pygraphviz
which should install both packages together

Pathways found in 1 step/s between the top PKS product O=C(O)[C@H](O)[C@H](O)[C@H](O)[C@H](O)CO and the eventual target product O=C1OC(CO)C(O)C(O)C1O !!!
