# Tutorial on SynTool from data to planning

## Introduction to the SynTool Tutorial

Welcome to the SynTool tutorial, a detailed guide on utilizing a sophisticated retrosynthesis planning tool that combines Monte Carlo Tree Search (MCTS) with neural networks. This tutorial is designed for synthetic chemists and chemoinformaticians looking to deepen their understanding and application of SynTool in their work.

## Understanding SynTool
SynTool is a tool developed to enhance the efficiency of retrosynthetic analysis in chemical synthesis. It provides insights into possible synthetic pathways for target molecules, leveraging the power of MCTS and neural networks to streamline this process.

## Tutorial's Focus
This tutorial is specifically designed to provide a rapid overview of SynTool's basic functionalities, focusing on a simple yet common setup in retrosynthesis tools. We will delve into the implementation of a ranking policy network and the rollout evaluation function, akin to those used in popular tools such as 3N-MCTS and AiZynthFinder.

## Tutorial Structure
1. **Handling Reaction Data**: Techniques for processing reaction data effectively for retrosynthesis.
2. **Deriving Reaction Rules**: Strategies for extracting valuable rules from reaction data.
3. **Neural Network Training for Reaction Ranking**: Steps to train neural networks to identify feasible synthetic pathways.
4. **Practical Use of MCTS in Retrosynthesis**: Insights into how MCTS aids in decision-making in retrosynthetic analysis.

## Prerequisites
- **Knowledge Base**: A good understanding of synthetic chemistry and basic principles of machine learning is expected.
- **Technical Requirements**: Familiarity with Python and its relevant libraries, which will be covered in the tutorial.

## 0. General imports

In [None]:
import os
from pathlib import Path  # needed for paths validations in this tutorial

In [None]:
root_folder = "."  # replace this path where you want to keep all your results
root_folder = Path(root_folder).resolve(strict=True)

## 1. Reaction data filtration

In SynTool, reaction data filtration is a crucial step to ensure the quality and accuracy of the data used for retrosynthesis analysis. The USPTO dataset, a standardized but unfiltered collection of reaction records, serves as the primary data source. However, this dataset may contain records with no reaction center or atom-to-atom mapping errors. To address this, we apply several filters:

1. **No Reaction Filter**: Removes reactions with identical reactants and products.
2. **Small Molecules Filter**: Excludes reactions where all reactants and products have ≤ specified number of heavy atoms.
3. **Reaction Distance Filter**: Discards reactions with more than specified changed bonds.
4. **Multi-Centre Reaction Filter**: Eliminates records with multiple reaction centers.
5. **Csp3-C Breaking Filter**: Removes reactions where a bond between two sp3 carbons is broken, indicating potential mapping errors.
6. **C-C Ring Breaking Filter**: Filters out reactions breaking a bond between two carbons in the same ring (sizes 5, 6, or 7), identified using the SSSR algorithm.
7. **C-H Breaking Filter**: Removes reactions breaking a C-H bond to form a C-C bond, with exceptions for condensation reactions or those involving carbenes.

The `ReactionCheckConfig` class in Syntool manages the configuration settings for these filters. It includes paths, file formats, checker-specific parameters, and a range of options to tailor the filtration process to specific requirements. This step is vital to ensure that the data fed into the retrosynthesis model is reliable and accurate.


First of all we have to import main config class called `ReactionCheckConfig`, function for filtering `filter_reactions` and configuration classes of filters that we want to apply to the input reaction data

In [None]:
# folder for reaction data filtration
filtered_data_path = root_folder.joinpath("uspto_filtered.smi")

In [None]:
from SynTool.chem.data.filtering import (
    filter_reactions,
    ReactionCheckConfig,
    CCRingBreakingConfig,
    WrongCHBreakingConfig,
    CCsp3BreakingConfig,
    DynamicBondsConfig,
    MultiCenterConfig,
    NoReactionConfig,
    SmallMoleculesConfig,
)

In [None]:
# you have to download standardized USPTO yourself or to put your path to the reaction database
reaction_data_path = Path("uspto_standardized.smi").resolve(strict=True)

### Setting Up the Filtration Configuration

Once we've established the importance of filtering our reaction data, the next step is to configure the filtration process. We do this using the `ReactionCheckConfig` class in SynTool. This class allows us to specify various parameters and settings for our filtration process. Let's break down what each line in our configuration setup does:

In [None]:
filtration_config = ReactionCheckConfig(
    remove_small_molecules=False,  # Flag to determine whether to remove small molecules.
    small_molecules_config=SmallMoleculesConfig(
        limit=6  # Setting the heavy atoms limit for the small molecules filter.
    ),
    dynamic_bonds_config=DynamicBondsConfig(
        min_bonds_number=1,  # Minimum number of dynamic bonds for a reaction to be considered.
        max_bonds_number=6  # Maximum number of dynamic bonds.
    ),
    no_reaction_config=NoReactionConfig(),  # Configuration for the 'no reaction' filter.
    multi_center_config=MultiCenterConfig(),  # Configuration for the 'multi-center reaction' filter.
    wrong_ch_breaking_config=WrongCHBreakingConfig(),  # Configuration for the 'C-H breaking' filter.
    cc_sp3_breaking_config=CCsp3BreakingConfig(),  # Configuration for the 'Csp3-C breaking' filter.
    cc_ring_breaking_config=CCRingBreakingConfig()  # Configuration for the 'C-C ring breaking' filter.
)
filtration_config

By setting up `filtration_config`, we are essentially telling SynTool what filters to apply and how to apply them. This step is crucial for ensuring that the data we use for further analysis is as accurate and reliable as possible. The filters we apply here are based on the specific needs of our analysis and the characteristics of the USPTO dataset.

Once this configuration is in place, we can proceed to apply these filters to our reaction data:

In [None]:
filter_reactions(config=filtration_config,
                 reaction_database_path=reaction_data_path, # The path to our reaction database file.
                 result_reactions_file_name=filtered_data_path, # The file where the results will be stored.
                 append_results=True,
                 num_cpus=4,
                 batch_size=100)

## 2. Reaction rules extraction

In [None]:
# folder for reaction data filtration
reaction_rules_path = root_folder.joinpath("reaction_rules.pickle")

In [None]:
from SynTool.utils.config import RuleExtractionConfig
from SynTool.chem.reaction_rules.extraction import extract_rules_from_reactions

After filtering the reaction data, the next crucial step in SynTool is the extraction of reaction rules. This process is vital for retrosynthetic analysis, as it defines the patterns and transformations that can be applied to synthesize target molecules.

#### The Extraction Protocol
The protocol for extracting reaction rules from reactions in SynTool utilizes the CGRtools Python library. This procedure involves the following steps:
1. **Substructure Extraction**: For each reactant and product in a given reaction, substructures containing the atoms of the reaction center and their immediate environment are extracted.
2. **Substructure Exchange**: The reactant and product substructures are then exchanged.
3. **Reagents Handling**: If the reaction includes reagents, they are not incorporated into the retro-rule.
4. **Label Preservation**: All labels related to the atoms of the reaction center, such as hybridization, the number of neighbors, and the ring sizes in which the atoms participate, are preserved. For atoms in the first environment, only the sizes of rings are preserved.

A retrosynthetic transformation (reaction rules) formed by this protocol is applied to the product of the original reaction. If it successfully generates the reactants of the reaction, the rule is considered valid.

#### Configuring Rule Extraction
The `ExtractRuleConfig` class in SynTool allows for the fine-tuning of how reaction rules are extracted. Key parameters of this class include:

- **multicenter_rules**: Determines whether a single rule is extracted for all centers in multicenter reactions (`True`) or if separate rules are generated for each center (`False`). Default is `True`.
- **as_query_container**: When set to `True`, the extracted rules are formatted as `QueryContainer` objects, similar to SMARTS for chemical pattern matching. Default is `True`.
- **reverse_rule**: If `True`, the direction of the reaction is reversed during rule extraction, which is useful for retrosynthesis. Default is `True`.
- **reactor_validation**: Activates the validation of each generated rule in a chemical reactor to confirm accurate generation of products from reactants when set to `True`. Default is `True`.
- **include_func_groups**: If `True`, specific functional groups are included in the reaction rule in addition to the reaction center and its environment. Default is `False`.
- **func_groups_list**: Specifies a list of functional groups to be included when `include_func_groups` is `True`.
- **include_rings**: Includes ring structures in the reaction rules connected to the reaction center atoms if set to `True`. Default is `False`.
- **keep_leaving_groups**: Keeps the leaving groups in the extracted reaction rule when set to `True`. Default is `False`.
- **keep_incoming_groups**: Retains incoming groups in the extracted reaction rule if set to `True`. Default is `False`.
- **keep_reagents**: Includes reagents in the extracted reaction rule when `True`. Default is `False`.
- **environment_atom_count**: Sets the number of layers of atoms around the reaction center to be included in the rule. A value of `0` includes only the reaction center, `1` includes the first surrounding layer, and so on. Default is `1`.
- **min_popularity**: Establishes the minimum number of times a rule must be applied to be included in further analysis. Default is `3`.
- **keep_metadata**: Preserves associated metadata with the reaction in the extracted rule when set to `True`. Default is `False`.
- **single_reactant_only**: Limits the extracted rules to those with only a single reactant molecule if `True`. Default is `True`.
- **atom_info_retention**: Dictates the level of detail retained about atoms in the reaction center and their environment. Default settings retain information about neighbors, hybridization, implicit hydrogens, and ring sizes for both the reaction center and its environment. Default settings:
    ```python
    
    {
        "reaction_center": {
            "neighbors": True,
            "hybridization": True,
            "implicit_hydrogens": True,
            "ring_sizes": True,
        },
        "environment": {
            "neighbors": True,
            "hybridization": True,
            "implicit_hydrogens": True,
            "ring_sizes": True,
        },
    }
    ```

These settings are crucial for tailoring the rule extraction process to the specific needs of the retrosynthesis analysis, ensuring that the rules are both accurate and relevant.

#### Example Configuration
Here's an example of setting up the `ExtractRuleConfig` for a specific use case:

In [None]:
extraction_config = RuleExtractionConfig(
    keep_leaving_groups=True,
    atom_info_retention = {
        "reaction_center": {
            "neighbors": True,  # Retains information about neighboring atoms to the reaction center.
            "hybridization": True,  # Preserves the hybridization state of atoms at the reaction center.
            "implicit_hydrogens": False,  # Includes data on implicit hydrogen atoms attached to the reaction center.
            "ring_sizes": False,  # Keeps information about the sizes of rings that reaction center atoms are part of.
        },
        "environment": {
            "neighbors": True,  # Retains information about neighboring atoms to the atoms in the environment of the reaction center.
            "hybridization": False,  # Preserves the hybridization state of atoms in the environment.
            "implicit_hydrogens": False,  # Includes data on implicit hydrogen atoms attached to atoms in the environment.
            "ring_sizes": False,  # Keeps information about the sizes of rings that environment atoms are part of.
        },
    }
)
extraction_config

This setup, for instance, retains leaving groups and specifies the level of detail for atom information retention around the reaction center. Each parameter in `ExtractRuleConfig` is designed to give users the flexibility to optimize rule extraction according to their requirements.

### Extracting Reaction Rules Using `extract_rules_from_reactions`

After configuring the rule extraction settings in SynTool, the next step is to apply these configurations to extract reaction rules from our dataset. This is achieved using the `extract_rules_from_reactions` function. Here's a breakdown of the function call in the tutorial:

In [None]:
extract_rules_from_reactions(config=extraction_config, # The configuration settings for rule extraction.
                             reaction_file=reaction_data_path, # Path to the file containing the reaction data.
                             rules_file_name=reaction_rules_path, # Name of the file to store the extracted rules.
                             num_cpus=4,
                             batch_size=100)

#### Understanding the Parameters

1. **config**: Passes the `extraction_config` object, which contains all the rule extraction settings we defined earlier. This configures how the rules are extracted from the reaction data.

2. **reaction_file**: Specifies the location of the reaction database file. The reactions in this file will be used to extract the rules.

3. **results_root**: Determines the directory where the results, i.e., the extracted rules, will be saved. In this case, they'll be stored in a folder named "rules/".

4. **rules_file_name**: Sets the name of the file where the extracted rules will be written. Here, it's named "reaction_rules".

5. **num_cpus**: Indicates the number of CPU cores to use for processing. Setting this to 4 allows for parallel processing, making the extraction process faster and more efficient.

6. **batch_size**: Defines the number of reactions to be processed in each batch. A batch size of 100 is chosen to strike a balance between processing speed and memory usage.

#### The Extraction Process

The `extract_rules_from_reactions` function initiates a process that goes through each reaction in the provided database, applying the configured rules to extract relevant information. The function utilizes a Ray environment for distributed computing, allowing it to handle reactions in batches and parallelize the rule extraction process. This approach not only enhances efficiency but also scales well with large datasets.

Once the rules are extracted, they are written to RDF files. Additionally, the function sorts the rules based on their popularity and saves this sorted list, providing a valuable resource for retrosynthesis analysis.

## 3. Ranking policy training

In [None]:
# folder for storing policy network training results
policy_network_root = root_folder.joinpath("ranking_policy")
policy_dataset_file = os.path.join(policy_network_root, 'policy_dataset.dt')

In [None]:
# verifying that all data exist:
molecules_or_reactions_path = reaction_data_path

In [None]:
from SynTool.utils.config import PolicyNetworkConfig
from SynTool.ml.training.supervised import create_policy_dataset, run_policy_training

### Ranking Policy Training in SynTool

After extracting the reaction rules, the next crucial step in SynTool involves training a ranking policy. This policy is a neural network designed to rank retrosynthetic transformations (or retro-rules) based on their suitability for a given reaction. Let's explore this process and the corresponding code in the tutorial.

#### Overview of the Ranking Policy

The ranking policy typically employs a supervised neural network trained on molecular representations like Morgan fingerprints or molecular graphs. The network is trained to perform multi-class classification where, for a given reaction:
- The retro-rule extracted from that reaction is assigned as the positive class.
- All other retro-rules extracted from the reaction database are assigned as negative classes.

This approach biases the network to prioritize transformations that are likely to produce reactions similar to real ones, even without specific reaction conditions. However, it's important to note that reaction databases reflect historical data, and some useful rules might be underrepresented.

#### Setting Up the Training Configuration

First, we define the training configuration using the `PolicyNetworkConfig` class. This configuration includes various hyperparameters for the neural network:

In [None]:
training_config = PolicyNetworkConfig(
    batch_size=500,  # The size of each batch of data.
    dropout=0.4,  # Dropout rate for regularization.
    learning_rate=0.0008,  # Learning rate for the training process.
    num_conv_layers=5,  # Number of convolutional layers in the network.
    num_epoch=100,  # Number of epochs for training.
    vector_dim=256,  # Dimensionality of the feature vectors.
    policy_type='ranking',  # The mode of operation, set to 'ranking'.
)

#### Creating the Policy Dataset

Next, we create the policy dataset using the `create_policy_dataset` function. This involves specifying paths to the reaction rules and the reaction data:

In [None]:
datamodule = create_policy_dataset(reaction_rules_path=reaction_rules_path,
                                   molecules_or_reactions_path=molecules_or_reactions_path,
                                   output_path=policy_dataset_file,
                                   dataset_type='ranking',
                                   batch_size=training_config.batch_size,
                                   num_cpus=4)

#### Running the Policy Training

Finally, we train the policy network using the `run_policy_training` function. This step involves feeding the dataset and the training configuration into the network:

In [None]:
run_policy_training(datamodule, # The prepared data module for training.
                    config=training_config, # The training configuration.
                    results_path=policy_network_root # Path to save the training results.
                   ) 

## 4. Tree search with the ranking policy

The fourth and critical part of the SynTool tutorial is the implementation of the tree search algorithm using the trained ranking policy. This section explains how the `Tree` class is initialized and used for Monte Carlo Tree Search (MCTS) in retrosynthesis, employing the ranking policy to guide the search.

In [None]:
from CGRtools import smiles
from IPython.display import SVG, display

from SynTool.utils.config import TreeConfig, PolicyNetworkConfig
from SynTool.mcts.tree import Tree
from SynTool.mcts.expansion import PolicyFunction
from SynTool.interfaces.visualisation import get_route_svg

In [None]:
ranking_policy_weights = os.path.join(policy_network_root, 'policy_network.ckpt')
building_blocks_path = 'building_blocks.smi'

In [None]:
policy_config = PolicyNetworkConfig(weights_path=ranking_policy_weights)
policy_function = PolicyFunction(policy_config=policy_config)

#### Configuring the Tree Search

The tree search is configured using the `TreeConfig` class. Key parameters include:

- **max_iterations**: Defines the total number of iterations the algorithm will perform, essentially setting a limit on how many times the tree search loop can run. Default is 100 iterations.
- **max_tree_size**: Sets the upper limit on the total number of nodes that can exist in the search tree, controlling the size and complexity of the tree. Default value is 10,000 nodes.
- **max_time**: Specifies a time limit for the algorithm's execution, measured in seconds. This prevents the search from running indefinitely. The default time limit is 600 seconds (10 minutes).
- **max_depth**: Determines the maximum depth of the tree, effectively controlling how far the search can go from the root node. The default maximum depth is 6 levels.
- **ucb_type**: Chooses the type of Upper Confidence Bound (UCB) algorithm used in the search. Options include "puct" (predictive UCB), "uct" (standard UCB), and "value". The default is "uct".
- **c_ucb**: This is the exploration-exploitation balance coefficient in UCB, which influences how much the algorithm favors exploration of new paths versus exploitation of known paths. The default coefficient is 0.1.
- **backprop_type**: Selects the backpropagation method used during the search. Options are "muzero" (model-based approach) and "cumulative" (cumulative reward approach). The default is "muzero".
- **search_strategy**: Determines the strategy for navigating the tree. Options are "expansion_first" (prioritizing the expansion of new nodes) and "evaluation_first" (prioritizing the evaluation of existing nodes). The default strategy is "expansion_first".
- **exclude_small**: A boolean setting that, when true, excludes small molecules from the search, typically to focus on more complex molecules. The default is set to True.
- **evaluation_agg**: This setting determines how the evaluation scores are aggregated. Options are "max" (using the maximum score) and "average" (using the average score). The default method is "max".
- **evaluation_mode**: Defines the method used for node evaluation. Options include "random" (random evaluations), "rollout" (using rollout simulations), and "gcn" (graph convolutional networks). The default is "gcn".
- **init_node_value**: Sets the initial value for newly created nodes in the tree. This can impact how nodes are prioritized during the search. The default initial value is 0.0.
- **epsilon**: This parameter is used in the epsilon-greedy strategy during node selection, representing the probability of choosing a random action for exploration. A higher value leads to more exploration. The default value is 0.0.
- **min_mol_size**: Defines the minimum size of a molecule (in terms of the number of heavy atoms) to be considered in the search. Molecules smaller than this threshold are typically considered as readily available building blocks. The default is set to 6 heavy atoms.
- **silent**: When set to True, this option suppresses the progress output of the tree search, keeping the output clean and focused. The default setting is False.

In [None]:
tree_config = TreeConfig(search_strategy="expansion_first",
                         evaluation_type="rollout",
                         min_mol_size=0,
                         init_node_value=0.5,
                         ucb_type="uct",
                         c_ucb=0.1,
                         max_iterations=100,
                         max_depth=9)

In [None]:
example_molecule = 'Cc1cc(C)c(C2=Nn3c(C)nnc3SC2C)cc1C'

target = smiles(example_molecule)
target.canonicalize()
target.clean2d()
target

#### Initializing the Tree

The `Tree` class is initialized with the target molecule, the path to reaction rules, building blocks, the tree configuration, and the policy function. The policy function, obtained from the trained ranking policy, guides the selection of retrosynthetic transformations.

In [None]:
tree = Tree(target=target,
            tree_config=tree_config,
            reaction_rules_path=reaction_rules_path,
            building_blocks_path=building_blocks_path,
            policy_function=policy_function,
            value_function=None)

#### Running the Tree Search

The tree search is executed by iterating over the `Tree` object. Each iteration of the tree explores new nodes and expands the search space, guided by the ranking policy and the MCTS algorithm.

In [None]:
tree_solved = False
for solved, node_id in tree:
    if solved:
        tree_solved = True

In [None]:
tree

#### Retrosynthesis path visualisation

After the tree search is complete, we can visualize the found retrosynthesis paths. The visualization uses the `path_graph` function from SynTool visualization interface.

In [None]:
for n, node_id in enumerate(tree.winning_nodes):
    print(f'-------- Path starts from node #{node_id} with total path score {tree.path_score(node_id)} --------')
    display(SVG(path_graph(tree, node_id)))
    if n == 3:
        break