# Data curation

This tutorial will lead through the steps of extraction of reaction rules, training retrosynthetic models (ranking policy network), and retrosynthetic planning in SynPlanner.

### Prerequisites

There are two reaction datasets versions:

1) Standardized USPTO dataset (around 1.3 M reactions) used for preparation of real retrosynthetic planner
2) USPTO tutorial dataset reduced to 72K reactions.

For educational, demonstrative purposes and on machines with limited computational resources (CPU and RAM) it is recommended to use USPTO tutorial dataset. The full reaction dataset practically is executable now only on computational servers.

## 1. Download input data

The input data will be downloaded from the [HuggingFace repository](https://huggingface.co/Laboratoire-De-Chemoinformatique/SynPlanner) to the current directory.

In [None]:
from pathlib import Path
from synplan.utils.loading import download_all_data

# replace this path where you want to keep all your results
root_folder = Path(".").resolve()  
root_folder.mkdir(root_folder, exist_ok=True)

download_all_data(save_to=root_folder)
tutorial_folder = root_folder.joinpath("tutorial")

## 2. Reaction data standardization

The reaction data are standardized using an original protocol for reaction data curation
[published earlier](https://doi.org/10.1002/minf.202100119). This protocol includes two layers:
standardization of individual molecules (reactants, reagents, products) and reaction standardization.
Steps for standardization of individual molecules is following:

* dearomatization
* isotope removal
* stereo mark removal
* explicit hydrogen removal
* small fragment removal
* solvent removal
* salt stripping
* charge neutralization
* functional group standardization
* valence checking 
* aromatization

The reaction standardization layer is following: 

* reaction role assignment
* reaction equation balancing
* atom-to-atom mapping
* duplicate reactions removal

<div class="alert alert-info">
**Note**  
    
In this tutorial, the input data are already standardized by a slightly different protocol from the original paper. It omits major tautomer selection done by ChemAxon standardizer.
</div>

### Reaction atom mapping. 
Reaction atom-to-atom (AAM) mapping in SynPlanner is performed with GraphormerMapper,
a new algorithm for AAM based on a transformer neural network adopted for the direct processing of molecular graphs
as sets of atoms and bonds, as opposed to SMILES/SELFIES sequence-based approaches, in combination with the
Bidirectional Encoder Representations from Transformers (BERT) network. The graph transformer serves to extract molecular features that are tied to atoms and bonds. The BERT network is used for chemical transformation learning.
In a [benchmarking study](https://doi.org/10.1021/acs.jcim.2c00344), it was demonstrated  that GraphormerMapper
is superior to the state-of-the-art IBM RxnMapper algorithm in the “Golden” benchmarking data set
(total correctly mapped reactions 89.5% vs. 84.5%).

### Configuration

Reaction standardization protocol can be adjusted using the configuration YAML file. There are 14 options included:

| Reaction standardizer       | Description                                                     |
|-----------------------------|-----------------------------------------------------------------|
| reaction_mapping_config     | Maps atoms of the reaction using GraphormerMapper                     |
| functional_groups_config    | Standardization of functional groups                            |
| kekule_form_config          | Transforms molecules to Kekule form when possible               |
| check_valence_config        | Checks atom valences                                            |
| implicify_hydrogens_config  | Removes hydrogen atoms                                          |
| check_isotopes_config       | Checks and cleans isotope atoms when possible                   |
| split_ions_config           | Splits ions in reaction when possible                           |
| aromatic_form_config        | Transforms molecules to aromatic form when possible             |
| mapping_fix_config          | Fixes atom-to-atom mapping in reaction when needed and possible |
| unchanged_parts_config      | Removes unchanged parts in reaction                             |
| small_molecules_config      | Removes small molecule from reaction                            |
| remove_reagents_config      | Removes reagents from reaction                                  |
| rebalance_reaction_config   | Rebalances reaction                                             |
| duplicate_reaction_config   | Removes duplicate reactions                                     |


<div class="alert alert-info">
**Note**  
    
If the reaction standardizer name (`..._config`) is listed in the configuration file `standardization.yaml` or in `ReactionStandardizationConfig` (see below), it means that this standardizer will be activated.
</div>


### Setting up the reaction standardization configuration

All configuration objects and functions can be imported from the `synplan.chem.data.standardizing` module:

In [6]:
from synplan.chem.data.standardizing import (
    ReactionStandardizationConfig,
    standardize_reactions_from_file,
    ReactionMappingStandardizer,
    FunctionalGroupsConfig,
    KekuleFormConfig,
    CheckValenceConfig,
    ImplicifyHydrogensConfig,
    CheckIsotopesConfig,
    AromaticFormConfig,
    MappingFixConfig,
    UnchangedPartsConfig,
    DuplicateReactionConfig,
)

The next step is to configure the reaction standardization process. We do this using the `ReactionStandardizationConfig` class in SynPlanner. This class allows for the specification of various parameters and settings for the standardization process.

In [7]:
standardization_config = ReactionStandardizationConfig(
    reaction_mapping_config=ReactionStandardizationConfig(), # the main config class
    functional_groups_config=FunctionalGroupsConfig(), # reaction standardization function
    # reaction standardizers:
    kekule_form_config=KekuleFormConfig(),
    check_valence_config=CheckValenceConfig(),
    implicify_hydrogens_config=ImplicifyHydrogensConfig(),
    check_isotopes_config=CheckIsotopesConfig(),
    aromatic_form_config=AromaticFormConfig(),
    mapping_fix_config=MappingFixConfig(),
    unchanged_parts_config=UnchangedPartsConfig(),
    duplicate_reaction_config=DuplicateReactionConfig(),
)

As mentioned before, you are not obliged to provide all standardization configs to `ReactionStandardizationConfig`. For example, if you only want to perform mapping, you can basically specify only two configs:

``` python 

standardization_config = ReactionStandardizationConfig(
    reaction_mapping_config=ReactionStandardizationConfig(),
    mapping_fix_config=MappingFixConfig(),
)
```

### Running the standardization

Once this standardization configuration is in place, we can proceed to apply these standardizers to our reaction data:

In [8]:
reaction_data = tutorial_folder.joinpath("uspto_tutorial.smi").resolve(strict=True)
standardized_data = tutorial_folder.joinpath("data_curation/uspto_standardized.smi").resolve()

standardize_reactions_from_file(
    config=standardization_config,
    input_reaction_data_path=reaction_data,
    standardized_reaction_data_path=standardized_data,
    num_cpus=4,
    batch_size=100,
)

Number of reactions processed: 71832 [08:45]


Initial number of parsed reactions: 71832
Standardized number of reactions: 69446


<div class="alert alert-info">
**Note**  
    
We do not recommend using more than 4-8 CPUs during data curation. The bottleneck are I/O operations that are not yet parallelized. Also, if you do not have enough RAM on your local machine, it is recommended to reduce `batch_size` number.
</div>

## 3. Reaction data filtration

In SynPlanner, reaction data filtration is a crucial step to ensure the quality and accuracy of the data used for retrosynthetic analysis. The USPTO dataset, a standardized but unfiltered collection of reaction records, serves as the primary data source. However, this dataset may contain records with no reaction center or atom-to-atom mapping errors.

### Configuration

The current version of SynPlanner includes 11 reaction filters (see below).
In brackets, it is shown how this filter should be listed in the configuration file to be activated.

| Reaction filter                | Description                                                                                                |
|--------------------------------|------------------------------------------------------------------------------------------------------------|
| compete_products_config        | Checks if there are compete reactions                                                                      |
| dynamic_bonds_config           | Checks if there is an unacceptable number of dynamic bonds in Condensed Graph of Reaction (CGR)            |
| small_molecules_config         | Checks if there are only small molecules in the reaction or if there is only one small reactant or product |
| cgr_connected_components_config| Checks if CGR contains unrelated components (without reagents)                                             |
| rings_change_config            | Checks if there is changing rings number in the reaction                                                   |
| strange_carbons_config         | Checks if there are 'strange' carbons in the reaction                                                      |
| no_reaction_config             | Checks if there is no reaction in the provided reaction container                                          |
| multi_center_config            | Checks if there is a multicenter reaction                                                                  |
| wrong_ch_breaking_config       | Checks for incorrect C-C bond formation from breaking a C-H bond                                           |
| cc_sp3_breaking_config         | Checks if there is C(sp3)-C bond breaking                                                                  |
| cc_ring_breaking_config        | Checks if a reaction involves ring C-C bond breaking                                                       |




### Setting up the reaction filtration configuration

In this section, we will walk through the steps to configure and apply a reaction filtration process using the `ReactionFilterConfig` class from the SynPlanner library. This class is essential for specifying various parameters and settings needed for the filtration.

In [9]:
from synplan.chem.data.filtering import (
    ReactionFilterConfig,  # the main config class
    filter_reactions_from_file,  # reaction filtration function
    # reaction filters:
    CCRingBreakingConfig,
    WrongCHBreakingConfig,
    CCsp3BreakingConfig,
    DynamicBondsConfig,
    MultiCenterConfig,
    NoReactionConfig,
    SmallMoleculesConfig,
)

Next, we need to specify the parameters and settings for the reaction filtration process. These parameters can be customized according to your needs. Here’s how you can set up the configuration:

In [10]:
filtration_config = ReactionFilterConfig(
    dynamic_bonds_config=DynamicBondsConfig(
        min_bonds_number=1, # minimum number of dynamic bonds for a reaction
        max_bonds_number=6, # maximum number of dynamic bonds for a reaction
    ),  
    no_reaction_config=NoReactionConfig(),  
    multi_center_config=MultiCenterConfig(),  
    wrong_ch_breaking_config=WrongCHBreakingConfig(),  
    cc_sp3_breaking_config=CCsp3BreakingConfig(),
    cc_ring_breaking_config=CCRingBreakingConfig(),
)

By setting up `filtration_config`, we are essentially telling SynPlanner what filters to apply and how to apply them. This step is crucial for ensuring that the data we use for further analysis is as accurate and reliable as possible. The reaction filters we apply here are based on the specific needs of our analysis and the characteristics of the USPTO dataset.

Once this configuration is in place, we can proceed to apply these filters to our reaction data:

In [11]:
# previously standardized reaction data file
standardized_data = tutorial_folder.joinpath("data_curation/uspto_standardized.smi").resolve(strict=True)
# filtered reaction data file
filtered_data_path = root_folder.joinpath("data_curation/uspto_filtered.smi")  

filter_reactions_from_file(
    config=filtration_config,
    input_reaction_data_path=standardized_data, 
    filtered_reaction_data_path=filtered_data_path,
    num_cpus=4,
    batch_size=100,
)

Number of reactions processed: 69446 [04:45]


Initial number of reactions: 69446
Removed number of reactions: 1
