# Step 1: Data Curation

This tutorial demonstrates how to prepare data (reaction standardization and filtration) before reaction rules extraction and retrosynthetic model training in ``SynPlanner``

## Basic recommendations

**1. Always do reaction data filtration**

Reaction data filtration is a crucial step in the reaction data curation pipeline. Reaction filtration ensures the validity of the extracted reaction rules and is needed for the correct execution of the programming code (some erroneous reactions may crash the current version of the ``SynPlanner``). Thus, it is recommended to do a reaction data filtration before the extraction of reaction rules and training retrosynthetic models.

**2. Input and output reaction representation can be different after filtration**

The current version of the reaction data filtration protocol in ``SynPlanner`` includes some functions for additional standardization of input reactions. This is why sometimes the output reaction SMILES, after it passes all the reaction filters, may not exactly to the input reaction SMILES.

**3. Do not use more than 4 CPU**

The current version of SynTool is not perfectly optimal in terms of memory usage with many CPUs (this problem will be fixed in future versions). Moreover, the total performance is limited by the reading and parsing input SMILES, which is not parallelized now. This is why it is recommended to set no more than 4 CPUs for steps related to the data curation.

## 1. Set up input and output data locations

The ``SynPlanner`` input data will be downloaded from the ``HuggingFace`` repository to the specified directory.

In [1]:
import os
import shutil
from pathlib import Path
from synplan.utils.loading import download_all_data

# download SynPlanner data
data_folder = Path("synplan_data").resolve()
download_all_data(save_to=data_folder)

# results folder
results_folder = Path("tutorial_results").resolve()
results_folder.mkdir(exist_ok=True)

# input data
original_data_path = data_folder.joinpath("uspto/uspto_standardized.smi").resolve(strict=True) # replace with custom data if needed
shutil.copy(original_data_path, results_folder.joinpath('uspto_original.smi')) # copy original data to the results folder for consistency

# output_data
standardized_data_path = results_folder.joinpath("uspto_standardized.smi")
filtered_data_path = results_folder.joinpath("uspto_filtered.smi")

Fetching 25 files:   0%|          | 0/25 [00:00<?, ?it/s]

## 2. Reaction standardization

The reaction data standardization protocol includes the standardization of individual molecules (reagents, reactants, and products) and the standardization of reactions (e.g. reaction equation balancing). 

More details about reaction standardization protocol in ``SynPlanner`` can be found in <a href="https://synplanner.readthedocs.io/en/latest/methods/standardization.html">official documentation</a>.


<div class="alert alert-info">
<b>Note</b>

In this tutorial, the input data are already standardized by a slightly different protocol. It omits major tautomer selection done by ChemAxon standardizer.
</div>

### Standardization configuration

The next step is to configure the reaction standardization process. We do this using the `ReactionStandardizationConfig` class in ``SynPlanner``. This class allows for the specification of various parameters and settings for the standardization process.

More details about reaction standardization configuration in ``SynPlanner`` can be found in <a href="https://synplanner.readthedocs.io/en/latest/configuration/standardization.html">official documentation</a>.

In [2]:
from synplan.utils.logging import init_logger

# Initialize before importing standardizing
logger, log_file_path = init_logger(
    name="synplan",
    console_level="ERROR",
    file_level="INFO",
)


from synplan.chem.data.standardizing import (
    ReactionStandardizationConfig, # the main config class
    standardize_reactions_from_file, # reaction standardization function
    # reaction standardizers
    ReactionMappingConfig,
    KekuleFormConfig,
    CheckValenceConfig,
    ImplicifyHydrogensConfig,
    CheckIsotopesConfig,
    AromaticFormConfig,
    MappingFixConfig,
    UnchangedPartsConfig,
    DuplicateReactionConfig,
)

# specify the list of applied reaction standardizers
standardization_config = ReactionStandardizationConfig(
    reaction_mapping_config=ReactionMappingConfig(),
    kekule_form_config=KekuleFormConfig(),
    check_valence_config=CheckValenceConfig(),
    implicify_hydrogens_config=ImplicifyHydrogensConfig(),
    check_isotopes_config=CheckIsotopesConfig(),
    aromatic_form_config=AromaticFormConfig(),
    mapping_fix_config=MappingFixConfig(),
    unchanged_parts_config=UnchangedPartsConfig(),
    duplicate_reaction_config=DuplicateReactionConfig(),
)

<div class="alert alert-info">
<b>Note</b>

If the reaction standardizer name (`..._config`) is listed in the `ReactionStandardizationConfig` (see above), it means that this standardizer will be activated.
</div>

As mentioned before, it is possible to apply only desirable standardizers to the reactions. For example, if you only want to perform reaction mapping, you can specify only two configs in `ReactionStandardizationConfig`:

``` python 

standardization_config = ReactionStandardizationConfig(
    reaction_mapping_config=ReactionStandardizationConfig(),
    reaction_mapping_config=ReactionMappingConfig(),
)
```

### Running standardization

Once this standardization configuration is in place, we can proceed to apply these standardizers to the source reaction data:

In [None]:
standardize_reactions_from_file(
    config=standardization_config,
    input_reaction_data_path=original_data_path, # original input data
    standardized_reaction_data_path=standardized_data_path, # standardized output data
    silent=False,
    num_cpus=4,
    batch_size=100,
    worker_log_level="INFO",
    log_file_path=log_file_path
)

## 3. Reaction filtration

In ``SynPlanner``, reaction data filtration is a crucial step to ensure the validity of reaction rules used in retrosynthetic planning.

More details about reaction filtration protocol in ``SynPlanner`` can be found in <a href="https://synplanner.readthedocs.io/en/latest/methods/filtration.html">official documentation</a>.

### Filtration configuration

The next step is to configure the reaction filtration process. We do this using the `ReactionFilterConfig` class in ``SynPlanner``. This class allows for the specification of various parameters and settings for the filtration process.

More details about reaction filtration configuration in ``SynPlanner`` can be found in <a href="https://synplanner.readthedocs.io/en/latest/configuration/filtration.html">official documentation</a>.

In [4]:
from synplan.chem.data.filtering import (
    ReactionFilterConfig,  # the main config class
    filter_reactions_from_file,  # reaction filtration function
    # reaction filters:
    CCRingBreakingConfig,
    WrongCHBreakingConfig,
    CCsp3BreakingConfig,
    DynamicBondsConfig,
    MultiCenterConfig,
    NoReactionConfig,
)

# specify the list of applied reaction filters
filtration_config = ReactionFilterConfig(
    dynamic_bonds_config=DynamicBondsConfig(
        min_bonds_number=1, # minimum number of dynamic bonds for a reaction
        max_bonds_number=6, # maximum number of dynamic bonds for a reaction
    ),  
    no_reaction_config=NoReactionConfig(),  
    multi_center_config=MultiCenterConfig(),  
    wrong_ch_breaking_config=WrongCHBreakingConfig(),  
    cc_sp3_breaking_config=CCsp3BreakingConfig(),
    cc_ring_breaking_config=CCRingBreakingConfig(),
)

<div class="alert alert-info">
<b>Note</b>

If the reaction filter name (`..._config`) is listed in the `ReactionFilterConfig` (see above), it means that this folter will be activated.
</div>

### Running filtration

Once the filtration configuration is in place, we can proceed to apply these filters to the source reaction data:

In [5]:
filter_reactions_from_file(
    config=filtration_config,
    input_reaction_data_path=standardized_data_path, # standardized input data
    filtered_reaction_data_path=filtered_data_path, # filtered output data
    num_cpus=4,
    batch_size=100,
)

Number of reactions processed: 1314804 [1:38:05]


Initial number of reactions: 1314804
Removed number of reactions: 295500


## Results

If the tutorial is executed successfully, you will get in the results folder three reaction data files: 
- original reaction data
- standardized reaction data
- filtered reaction data

In [6]:
sorted(Path(results_folder).iterdir(), key=os.path.getmtime, reverse=False)

[PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_original.smi'),
 PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_standardized.smi'),
 PosixPath('/home1/dima/synplanner/tutorials/tutorial_results/uspto_filtered.smi')]