# Route clustering by strategic bonds

Retrosynthesis often yields many routes. We group routes by high‑level strategy so you can review distinct ideas, not duplicates.

What we use
- CGR: one graph per reaction step (formed/broken bonds).
- RouteCGR: all steps merged into one graph for a route.
- SB‑CGR: RouteCGR reduced to atoms of the target; shows only strategic bond changes.

Goal: routes with the same SB‑CGR go to the same cluster, even if they differ in step order, protections, or leaving groups.

In [None]:
#@title SynPlanner Installation
%pip install -q "synplanner>=1.2.0"

## 1. Data download
Download only the assets needed for planning and clustering. You can replace them with your own later (e.g., building blocks).

In [None]:
from pathlib import Path
from synplan.utils.loading import download_selected_files

# download only necessary SynPlanner assets
assets = [
    ("building_blocks", "building_blocks_em_sa_ln.smi"),
    ("uspto", "uspto_reaction_rules.pickle"),
    ("uspto/weights", "ranking_policy_network.ckpt"),
]

data_folder = download_selected_files(
    files_to_get=assets,
    save_to="synplan_data",
    extract_zips=True,
)

# input data
ranking_policy_network = data_folder.joinpath("uspto/weights/ranking_policy_network.ckpt").resolve(strict=True)
reaction_rules_path = data_folder.joinpath("uspto/uspto_reaction_rules.pickle")

# use your custom building blocks if needed
building_blocks_path = data_folder.joinpath("building_blocks/building_blocks_em_sa_ln.smi")

# planning reslts folder
planning_results_folder = Path("planning_with_downloaded_data").resolve()
planning_results_folder.mkdir(exist_ok=True)
clustering_results_folder = Path("clustering_with_downloaded_data").resolve()
clustering_results_folder.mkdir(exist_ok=True)

## 2. Retrosynthetic planning
Run a short example to produce routes, or skip if you already have `routes_*.json/csv`.

In [None]:
from synplan.chem.utils import mol_from_smiles

# let's take capivasertib used as anti-cancer medication for the treatment 
# of breast cancer and approved by FDA in 2023
example_smiles = "NC1(C(=O)N[C@@H](CCO)c2ccc(Cl)cc2)CCN(c2nc[nH]c3nccc2-3)CC1"

target_molecule = mol_from_smiles(
    example_smiles, 
    clean2d=True, 
    standardize=True, 
    clean_stereo=True
    )

Run example planning (optional)

In [None]:
from synplan.mcts.tree import Tree
from synplan.utils.config import TreeConfig
from synplan.utils.loading import load_building_blocks, load_reaction_rules, load_policy_function

building_blocks = load_building_blocks(building_blocks_path, standardize=False)
reaction_rules = load_reaction_rules(reaction_rules_path)

policy_network = load_policy_function(weights_path=ranking_policy_network)

tree_config = TreeConfig(
    search_strategy="expansion_first",
    max_iterations=300,
    max_time=120,
    max_depth=9,
    min_mol_size=1,
    init_node_value=0.5,
    ucb_type="uct",
    c_ucb=0.1,
)

tree = Tree(
    target=target_molecule,
    config=tree_config,
    reaction_rules=reaction_rules,
    building_blocks=building_blocks,
    expansion_function=policy_network,
    # you can also specify evaluation_function=ValueNetwork(...), by default it is None
)

tree_solved = False
for solved, node_id in tree:
    if solved:
        tree_solved = True
tree

Export planned routes as JSON (AiZynthFinder‑compatible) or CSV for later clustering.

In [None]:
from synplan.chem.reaction_routes.io import export_tree_to_json, export_tree_to_csv

export_tree_to_json(tree, clustering_results_folder.joinpath("routes_1_1.json"))
export_tree_to_csv(tree, clustering_results_folder.joinpath("routes_1_1.csv"))

## 3. Load routes
Input formats
- CSV: rows of (route_id, step_id, reaction_smiles)
- JSON: route tree with `mol`/`reaction` nodes

Both can be converted to `routes_dict` (route_id → {step_id → ReactionContainer}).

In [None]:
from synplan.chem.reaction_routes.io import read_routes_csv, read_routes_json, make_json

Load from CSV

In [None]:
csv_path = clustering_results_folder.joinpath("routes_1_1.csv")
routes_dict_1 = read_routes_csv(csv_path)

Load from JSON

In [None]:
json_path = clustering_results_folder.joinpath("routes_1_1.json")
routes_dict_2 = read_routes_json(file_path=json_path, to_dict=True)

In [None]:
routes_json = make_json(routes_dict_2)

## 4. Build RouteCGR and SB‑CGR

We convert each multi‑step route into compact graph objects:

- CGR: overlay reactants/products for one step using atom mapping (formed/broken bonds).
- RouteCGR: fold all step CGRs into one graph for the whole route.
- SB‑CGR: keep only atoms of the target and their dynamic bonds (strategic bonds).

Why: SB‑CGR captures the core disconnections used to make the target and ignores non‑strategic details.

In [None]:
from synplan.chem.reaction_routes.route_cgr import *
from synplan.chem.reaction_routes.clustering import *
from synplan.chem.reaction_routes.visualisation import cgr_display
from IPython.display import display, HTML, SVG
from synplan.utils.visualisation import get_route_svg_from_json

`compose_all_route_cgrs` builds RouteCGRs from a `Tree` or from a `routes_dict` (loaded from CSV/JSON).

In [None]:
all_route_cgrs = compose_all_route_cgrs(routes_dict_2)
# or
# all_route_cgrs = compose_all_route_cgrs(tree_2)

In [None]:
i = 1
for route_id, route_cgr in all_route_cgrs.items():
    print(route_id)
    cgr_prods = [route_cgr.substructure(c) for c in route_cgr.connected_components]
    target_cgr = cgr_prods[0]
    display(SVG(cgr_display(target_cgr)))
    display(SVG(get_route_svg_from_json(routes_json, route_id)))
    # or 
    # display(SVG(get_route_svg(tree, route_id))) # Currently the pathway from the serialized tree can not be depicted
    if i >= 3:
        break
    i += 1


In [None]:
all_reduced_route_cgrs = compose_all_reduced_route_cgrs(all_route_cgrs)

In [None]:
i = 1
for route_id, route_cgr in all_reduced_route_cgrs.items():
    print(route_id)
    cgr_prods = [route_cgr.substructure(c) for c in route_cgr.connected_components]
    target_cgr = cgr_prods[0]
    display(SVG(cgr_display(target_cgr)))
    display(SVG(get_route_svg_from_json(routes_json, route_id)))
    # or 
    # display(SVG(get_route_svg(tree, route_id)))
    if i >= 3:
        break
    i += 1

## 5. Cluster routes

**How it works**

- Cluster by SB‑CGR: routes with identical SB‑CGR end up together (same strategic bonds).
- `use_strat=False` (default here): compare SB‑CGR graph signatures; robust to atom mapping.
- Output: dict keyed by `NSB.index` (e.g., `3.2`) with route IDs and a representative SB‑CGR.

In [None]:
# use_strat: if True, clustering will use the CGRContainer’s structural signature
#            to ensure that routes which are chemically identical but differ only
#            in their atom mappings are grouped together instead of split apart
clusters = cluster_routes(all_reduced_route_cgrs, use_strat=False)

In [None]:
clusters

### Cluster report (HTML)
For any cluster ID (e.g., `2.1`), generate a compact HTML summary:
- target SMILES
- cluster index and size
- SB‑CGR (strategic bonds)
- each route: steps, optional score, SVG and reaction SMILES

In [None]:
cluster_index = '2.1'
if cluster_index in clusters.keys():
    # display(HTML(routes_clustering_report(tree, clusters, cluster_index,
    #                      all_reduced_route_cgrs)))
    # or
    display(HTML(routes_clustering_report(routes_json, clusters, cluster_index,
                         all_reduced_route_cgrs)))
else:
    print(f"Cluster {cluster_index} not found in the clustering results.")


## 6. Subclustering
Refines each main cluster by abstracting non‑strategic details.

What happens
- Replace leaving/protecting groups in RouteCGR by generic X‑labels (synthon).
- Build a “pseudo‑reaction” between labeled building blocks and target.
- Collect and tabulate leaving groups per position across routes.

Why: routes may share the same SB‑CGR but differ in tactics (e.g., different leaving groups or protections). Subclustering separates these variants.

In [None]:
all_subclusters = subcluster_all_clusters(clusters, all_reduced_route_cgrs, all_route_cgrs)

In [None]:
cluster_index = '3.1'
subcluster_num = 1

if subcluster_num in all_subclusters[cluster_index].keys():
    subgroup = all_subclusters[cluster_index][subcluster_num]
    display(HTML(routes_subclustering_report(tree, subgroup, cluster_index, subcluster_num, all_reduced_route_cgrs, aam=False)))
else:
    print(f"Cluster {cluster_index} not found in the subclustering results.")

### Post‑processing (experimental)
- Remove leaving‑group columns that are constant across all routes in a subgroup.
- Merge routes with identical leaving‑group sets.
- Merge symmetric pseudo‑reactions that differ only by atom mapping.

In [None]:
if len(subgroup['nodes_data']) != 1:
    new_subgroup = post_process_subgroup(subgroup)
    display(HTML(routes_subclustering_report(tree, new_subgroup, cluster_index, subcluster_num, all_reduced_route_cgrs, if_lg_group=True)))

## References & further reading
- SynPlanner docs: <a href="https://synplanner.readthedocs.io/">official documentation</a>
- Concepts used here: CGR, RouteCGR, SB‑CGR, strategic bond patterns (see paper).