# Ranking policy training

## Introduction

The tree nodes in MCTS are expanded by an expansion function approximated by a policy graph neural network. The policy network is composed of two parts: molecular representation and reaction rule prediction parts. In the representation part, the molecular graph is converted to a single vector by graph convolutional layers. The training set structure and the prediction part architecture depend on the type of policy network, particularly the ranking or filtering policy network.

**Ranking policy network**. The training dataset for ranking policy network consists of pairs of reactions and corresponding reaction rules extracted from it. The products of the reaction are transformed to the CGR encoded as a molecular graph with the one-hot encoded label vector where the positive label corresponds to the reaction rule. The prediction part is terminated with the softmax function generating the “probability of successful application” of each reaction rule to a given input molecular graph, which can be used for the reaction rules “ranking”.

**Filtering policy network**. The training dataset for the filtering policy is formed by the application of all reaction rules to the training molecules. The labels vector is filled with positive labels in positions corresponding to the successfully applied reaction rules. The prediction part of the filtering policy is formed from two linear layers with a sigmoid function that assigns the probabilities for the “regular”, as well as “priority” reaction rules (cyclization reaction rules). These two vectors are then combined with a coefficient α ranging from 0 to 1. This approach ensures that the priority reaction rules receive the highest score, followed by other regular reaction rules.
 
<div class="alert alert-info">
**Note**  
    
The filtering policy network requires much more computational resources for the generating of the training dataset than the ranking policy but can be used with any set of reaction rules because the original reaction dataset is not needed. This allows for the portability of reaction rules extracted with another software from any source of reaction data.
</div> 

The tree nodes in MCTS are expanded by an expansion function approximated by a policy graph neural network. The policy network is composed of two parts: molecular representation and reaction rule prediction parts. In the representation part, the molecular graph is converted to a single vector by graph convolutional layers. The training set structure and the prediction part architecture depend on the type of policy network, particularly the ranking or filtering policy network. The training dataset for ranking policy network consists of pairs of reactions and corresponding reaction rules extracted from it. The products of the reaction are transformed to the CGR encoded as a molecular graph with the one-hot encoded label vector where the positive label corresponds to the reaction rule. The prediction part is terminated with the softmax function generating the “probability of successful application” of each reaction rule to a given input molecular graph, which can be used for the reaction rules “ranking”.
All the reaction rules predicted by the ranking or policy neural network are sorted by predicted reaction rule probability and the first N (usually N = 50) of reaction rules are selected to be applied to the current precursor in the expansion step.

Frst, we define the training configuration using the `PolicyNetworkConfig` class. This configuration includes various hyperparameters for the neural network:

In [19]:
from synplan.utils.config import PolicyNetworkConfig
from synplan.ml.training.supervised import create_policy_dataset, run_policy_training

## Configuration
**Configuration parameters**:

| Parameter        | Default | Description                                                     |
|------------------|---------|-----------------------------------------------------------------|
| vector_dim       | 512     | The dimension of the hidden layers                              |
| num_conv_layers  | 5       | The number of convolutional layers                              |
| learning_rate    | 0.0005  | The learning rate                                               |
| dropout          | 0.4     | The dropout value                                               |
| num_epoch        | 100     | The number of training epochs                                   |
| batch_size       | 1000    | The size of the training batch of input molecular graphs        |

The ranking or filtering policy network architecture and training hyperparameters can be adjusted in the training configuration yaml file below.

In [20]:
training_config = PolicyNetworkConfig(
    policy_type="ranking",  # the type of policy network
    num_conv_layers=5,  # the number of graph convolutional layers in the network
    vector_dim=512,  # the dimensionality of the final embedding vector
    learning_rate=0.0008,  # the learning rate for the training process
    dropout=0.4,  # the dropout rate
    num_epoch=100,  # the number of epochs for training
    batch_size=100,
)  # the size of training batch of input data

#### Creating the training set

Next, we create the policy dataset using the `create_policy_dataset` function. This involves specifying paths to the reaction rules and the reaction data:

In [21]:
ranking_policy_network_folder = root_folder.joinpath("ranking_policy_network")
ranking_policy_dataset_path = ranking_policy_network_folder.joinpath("ranking_policy_dataset.pt")

datamodule = create_policy_dataset(
    dataset_type="ranking",
    reaction_rules_path=reaction_rules_path,
    molecules_or_reactions_path=filtered_data_path,
    output_path=ranking_policy_dataset_path,
    batch_size=training_config.batch_size,
    num_cpus=4,
)

Number of reactions processed: 69445 [11:33]


Training set size: 52509, validation set size: 13128


#### Run the policy network training

Finally, we train the policy network using the `run_policy_training` function. This step involves feeding the dataset and the training configuration into the network:

In [22]:
run_policy_training(
    datamodule,  # the prepared data module for training
    config=training_config,  # the training configuration
    results_path=ranking_policy_network_folder,
)  # path to save the training results

Weight decoupling enabled in AdaBelief
Rectification enabled in AdaBelief
Epoch 00012: reducing learning rate of group 0 to 6.4000e-04.
Epoch 00016: reducing learning rate of group 0 to 5.1200e-04.
Epoch 00020: reducing learning rate of group 0 to 4.0960e-04.
Epoch 00024: reducing learning rate of group 0 to 3.2768e-04.
Epoch 00028: reducing learning rate of group 0 to 2.6214e-04.
Epoch 00032: reducing learning rate of group 0 to 2.0972e-04.
Epoch 00036: reducing learning rate of group 0 to 1.6777e-04.
Epoch 00040: reducing learning rate of group 0 to 1.3422e-04.
Epoch 00044: reducing learning rate of group 0 to 1.0737e-04.
Epoch 00048: reducing learning rate of group 0 to 8.5899e-05.
Epoch 00052: reducing learning rate of group 0 to 6.8719e-05.
Epoch 00056: reducing learning rate of group 0 to 5.4976e-05.
Epoch 00060: reducing learning rate of group 0 to 5.0000e-05.
Policy network balanced accuracy: 0.985
