<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">


<img src="../Images/logo.png" alt="RetroChem Logo" style="max-width: 100%; height: auto;">

# **Introduction**

The following Jupyter notebook briefly presents **RetroChem**, a pip-installable Python package designed for retrosynthetic analysis. This package was developed to assist chemists and chemical engineers in predicting possible synthetic pathways for target molecules, using a machine learning model trained on the USPTO_50K database. Retrosynthesis is a central concept in organic chemistry, enabling the design of efficient synthetic routes by working backward from the desired product.

This package was created as a collaborative project for the EPFL course Practical Programming in Chemistry. [![GitHub3](https://img.shields.io/badge/EPFL-CH200-red.svg)](https://edu.epfl.ch/studyplan/en/bachelor/chemistry-and-chemical-engineering/coursebook/practical-programming-in-chemistry-CH-200)

Before diving into the code and functionalities of the package, let’s briefly explore the motivations and core concepts that shaped its development.

# **How Retrochem came to mind**

The idea for RetroChem emerged from our shared interest in organic synthesis and the growing importance of computational tools in modern chemistry. During our organic chemistry courses and laboratories we often encountered the challenge of synthesizing a target molecule from known reactants, a task that both requires extensive expertise in chemistry and is also very time consuming. 

At first, we envisioned RetroChem as a tool that would search through a large database of known reactions, both organic and inorganic, to identify possible transformations for a given target molecule. The idea was to use the most comprehensive reaction datasets available and search whether a synthesis existed for the molecule in question.

However, we quickly realized the scale of this task. The chemical universe is immensely large: it’s estimated there are up to 10⁶⁰ possible compounds. Even the most complete databases, such as CAS, which contains over 70 million registered compounds, are just a small fraction of that space. Searching such a large database for each input would not only be computationally intensive, potentially taking multiple minutes for even simple queries, but also fundamentally limited in scope.

This insight led us to turn toward **machine learning**. Instead of exhaustively searching for known reactions, we decided to train a model that could generalize from reaction data and predict retrosynthetic steps based on learned patterns. This approach allows RetroChem to make educated predictions even for molecules it has never seen before. 

# **Step 1: Training the model**

### General Pipeline

* **Data Loading**: We began by loading three preprocessed datasets derived from **USPTO-50K**: training, validation, and test. Each file contains cleaned reaction SMILES strings representing the chemical transformations, along with their associated reaction templates. 

* **Data Merging**: The three datasets were merged into a single DataFrame and saved as **combined_data.csv**. To avoid redundant information, duplicate reactions were removed during this merging step in order to prevent data leakage. 

* **Fingerprint Generation**: Before a machine learning model can understand molecules, we need to convert them from their chemical structure (SMILES format) into a numerical form. To do this, we use **Morgan fingerprints**.
These fingerprints are binary vectors that represent the presence or absence of specific structural patterns or substructures within the molecule. In our case:
    * We use a radius of 3, which means we look at circular substructures around each atom up to 3 bonds away.
    * We generate a 2048-bit vector, where each bit corresponds to a certain chemical feature.
    * For each valid molecule (reactant or product), we generate such a fingerprint vector.

These vectors become the input X to the machine learning model, this is how the model sees molecules.

* **Label Preparation**: For the model to learn what kind of transformation (reaction template) a molecule underwent, we also need to provide a target label. These labels, called template hashes, are strings that uniquely identify the type of reaction used. However, machine learning models don’t work with string labels, they require numbers. To solve this, we use scikit-learn’s LabelEncoder, which converts each unique string into a unique integer. This step produces the output vector y, which the model uses to learn how to classify different types of reactions.

Together, X and y now represent the complete training data: X contains the structural features of molecules, and y contains the corresponding reaction template the model is expected to predict.

* **Dataset Splitting**: The data was split into training (70%), validation (15%), and test (15%) sets.

* **Normalization**: The input vectors were standardized using **StandardScaler** to help the neural network learn more effectively.

* **Model Training**: A multi-layer perceptron: **MLPClassifier from scikit learn library** was trained with three hidden layers. Early stopping was used to prevent overfitting, and training progress was monitored using the loss curve. A cross validation method was used to maximize the accuracy of the model.

* **Evaluation**: The model was evaluated on both the validation and test sets using accuracy as the main metric.

* **Saving Outputs**: Finally, the trained model, along with the scaler and label encoder, were saved to disk for use in future prediction steps.

### Functions to train the model

In this part we will focus on the functions used to train the model and take a closer look at how they work using simple examples.

Before that, we can run the following block to suppress non-critical warnings from RDKit and Streamlit. This helps keep the notebook output clean and focused as we move forward.

In [2]:
import warnings
from rdkit import RDLogger
import logging

# Hide Streamlit warnings for more readable output
logging.getLogger('streamlit').setLevel(logging.ERROR)

# Hide RDKit warnings for more readable output
warnings.filterwarnings("ignore", category=DeprecationWarning)
RDLogger.DisableLog('rdApp.*')

<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">
  <li>
    <strong>remove_atom_mapping(smiles)</strong>:
    <br><br>
    This function removes atom mappings (the ":number") from SMILES.<br>
    <br>&nbsp;&nbsp;&nbsp;&nbsp;1. Uses regular expression to identify ":number" mappings.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;2. Removes these mappings from the SMILES string.<br>
    <br>It returns the SMILES string without any atom mapping.<br><br>
  </li>
</div>


In [3]:
from RetroChem.Package_functions.Model_training_functions import remove_atom_mapping

mapped_rxn = "[CH3:1][CH2:2][OH:3]>>[CH3:1][CH:2]=[O:3]"
clean_rxn = remove_atom_mapping(mapped_rxn)

print(f"Original: {mapped_rxn}")
print(f"Cleaned : {clean_rxn}")


Original: [CH3:1][CH2:2][OH:3]>>[CH3:1][CH:2]=[O:3]
Cleaned : [CH3][CH2][OH]>>[CH3][CH]=[O]


<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">
  <li>
    <strong>split_rxn_smiles(rxn_smiles)</strong>:
    <br>
    <br>This function splits a reaction SMILES into two parts: reactants and products.<br>
    <br>&nbsp;&nbsp;&nbsp;&nbsp;1. Splits the reaction at <code>&gt;&gt;</code> into reactants and products.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;2. Separates individual molecules by <code>.</code> in each part.<br>
    <br>The output is two lists: one for reactants and one for products.<br><br>
  </li>
</div>


In [None]:
from RetroChem.Package_functions.Model_training_functions import split_rxn_smiles

rxn_smiles = "CC(C)CO.O=C=O>>CC(C)COC(=O)O"

reactants, products = split_rxn_smiles(rxn_smiles)
print(f"Reactants: {reactants}")
print(f"Products : {products}")


Reactants: ['CC(C)CO', 'O=C=O']
Products : ['CC(C)COC(=O)O']


<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">
  <li>
    <strong>smiles_to_fingerprints(rxn_smiles)</strong>:
    <br><br>
    This function converts SMILES into molecular fingerprints.<br>
    <br>&nbsp;&nbsp;&nbsp;&nbsp;1. Splits the reaction SMILES into reactants and products.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;2. Cleans the SMILES by removing atom mappings.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;3. Creates molecular fingerprints using Morgan fingerprints (radius 3).<br>
    &nbsp;&nbsp;&nbsp;&nbsp;4. Converts each fingerprint into a binary vector of length 2048.<br>
    <br>It returns the fingerprints of both reactants and products as lists of binary arrays.<br><br>
  </li>
</div>


In [4]:
from RetroChem.Package_functions.Model_training_functions import smiles_to_fingerprints

rxn_smiles = "CC(C)CO.O=C=O>>CC(C)COC(=O)O"

reactants_fps, products_fps = smiles_to_fingerprints(rxn_smiles)
print(products_fps)

[array([0, 1, 0, ..., 0, 0, 0], shape=(2048,))]


<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">
  <li>
    <strong>prepare_fingerprints_for_training(df)</strong>:
    <br><br>
    This function prepares the fingerprint data for training the machine learning model.<br>
    <br>&nbsp;&nbsp;&nbsp;&nbsp;1. Iterates through the dataset to process each reaction's SMILES.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;2. Converts SMILES to fingerprints using the `smiles_to_fingerprints` function.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;3. Stores the reaction fingerprints and corresponding template hashes.<br>
    &nbsp;&nbsp;&nbsp;&nbsp;4. Returns the feature matrix `X` and target vector `y` for training.<br>
    <br>The output is a dataset of molecular fingerprints (X) and their associated reaction templates (y).<br><br>
  </li>
</div>


In [5]:
from RetroChem.Package_functions.Model_training_functions import prepare_fingerprints_for_training

import pandas as pd
sample_data = {
    'RxnSmilesClean': ['C=C.CO>>CCCO'],
    'TemplateHash': [12345] # Some arbitrary target labels
}

df = pd.DataFrame(sample_data)
X, y = prepare_fingerprints_for_training(df)

print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")



Start of data processing.
Index 0 - SMILES: C=C.CO>>CCCO | Target: 12345
Fingerprint preparation finished. Total examples: 3
Shape of X: (3, 2048)
Shape of y: (3,)


<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">

# **Step 2: Visualizing the prediction and Interface**

### General Pipeline

* **Molecule Input**: The user provides a molecule either by drawing it or searching by name (via Pubchem).

* **Template Prediction**: The input molecule is then converted into a fingerprint and the trained model predicts the top 50 reaction templates that most likely fit with the input. These templates are general transformation rules of how certain bonds or functional groups are typically broken or formed in known reactions.

* **Template Filtering**: Each predicted template is individually tested on the input molecule using RDKit’s reaction engine, which attempts to apply the transformation pattern and generate a valid set of reactants: 
    * If it can be applied (e.g., it generates valid reactants), the prediction is accepted.
    * If it can’t be applied (e.g., due to structural mismatch), the prediction is discarded.

This step ensures that only chemically meaningful and syntactically valid retrosynthesis steps are shown to the user. It also prevents the model from suggesting transformations that, although statistically likely, make no chemical sense for the specific molecule in question.


* **Confidence Normalization**: The confidence scores (probabilities) of successful predictions are normalized based on the accpeted prediction.

* **Reactant Visualization**: For each valid prediction, the resulting reactants are displayed graphically using RDKit.

* **Step 2 Prediction**: If the first-step prediction results in only one reactant, the model performs a second retrosynthesis step on that molecule to further break it down.

See below for visual representation of these steps, with the input molecule being acetophenone:

<img src="../Images/Prediction_pipeline.png" alt="visualization_prediction_pipeline" style="max-width: 100%; height: auto;">

### Functions to predict and show the reactants

Just like for training the model, this part will focus on the functions that make our app work:

* **smiles_to_fingerprint(smiles)**

Converts the user input (single molecule in SMILES format) into a 2048-bit binary Morgan fingerprint using RDKit, which is what the model will read to make it's prediction. Similar to the smiles_to_fingerprints function for training the model but boosts some key differences: it handles a single molecule, produces a fixed-size binary vector, and uses the exact settings (radius, bit length, format) required by the trained model. These constraints ensure compatibility with the classifier, making it ideal for inference, whereas smiles_to_fingerprints is more flexible but not suitable for prediction tasks.

In [3]:
from RetroChem.Package_functions.Interface_functions import smiles_to_fingerprint

# Example: isobutanol
smiles = "CC(C)CO"
fp_array = smiles_to_fingerprint(smiles)

print(f"Fingerprint for SMILES '{smiles}':")
print(fp_array)
print("Shape:", fp_array.shape)


Fingerprint for SMILES 'CC(C)CO':
[0 1 0 ... 0 0 0]
Shape: (2048,)


<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">

* **predict_topk_templates(smiles_input, topk=50)**

This function predicts the most likely retrosynthesis templates that can be applied to a given molecule, represented by its SMILES string. It serves as the core of the retrosynthetic prediction pipeline.

1. **Loads model components** — a trained neural network `MLPClassifier`, a `StandardScaler` for fingerprint normalization, and a `LabelEncoder` to map model outputs back to template hashes.
2. **Processes the input SMILES** — converts the molecule into a binary Morgan fingerprint, thanks to the smiles_to_fingerprint function
3. **Ranks predictions** — the model outputs class probabilities, and the top-k predictions (by likelihood) are selected.
4. **Retrieves templates** — for each top prediction, the corresponding retrosynthesis SMARTS template is retrieved from the dataset `combined_data` 

The output is a list of `(TemplateHash, RetroTemplate, Probability)` tuples. This function is essential for turning a molecule into actionable retrosynthesis suggestions.

In [None]:
from RetroChem.Package_functions.Interface_functions import predict_topk_templates

# Example: isobutanol
smiles_input = "CC(C)CO"

# Top-5 template predictions for readibility (The code usually uses the top-50 templates)
top_predictions = predict_topk_templates(smiles_input, topk=5)

print(top_predictions)


[(np.str_('57b89b59164193fe08ede6224e3e385a09e578f6a73e3c6db88eafe75e2cfc26'), '[C:7]-[O;H0;D2;+0:8]-[CH2;D2;+0:1]-[C:2]=[C:3]-[C:4]-[#8:5]-[c:6]>>Br-[CH2;D2;+0:1]-[C:2]=[C:3]-[C:4]-[#8:5]-[c:6].[C:7]-[OH;D1;+0:8]', np.float64(0.5069113160978944)), (np.str_('a2e56378f9f6fce5d0fd7d6197bb2c7c473932ece3b2762f04edf4162fe28825'), '[#7:3]-[C:4](=[O;D1;H0:5])-[c:6]1:[c:7]:[c:8]:[c:9]:[c:10]:[c:11]:1-[O;H0;D2;+0:12]-[CH2;D2;+0:1]-[C:2]>>O-[CH2;D2;+0:1]-[C:2].[#7:3]-[C:4](=[O;D1;H0:5])-[c:6]1:[c:7]:[c:8]:[c:9]:[c:10]:[c:11]:1-[OH;D1;+0:12]', np.float64(0.10153721169819968)), (np.str_('f5bb42323bea905ccb17c1458c688d451278dbc978b9cf452a9d4a73cbf675cc'), '[C:4]-[O;H0;D2;+0:5]-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3]>>Cl-[C;H0;D3;+0:1](-[C:2])=[O;D1;H0:3].[C:4]-[OH;D1;+0:5]', np.float64(0.09742005112252344)), (np.str_('d36315c1871207ae6c3297ccbbc1a329b0e7e57965eae0ed92c1def4764fd433'), '[C:2]-[CH2;D2;+0:1]-[O;H0;D2;+0:3]-[c:4]>>Cl-[CH2;D2;+0:1]-[C:2].[OH;D1;+0:3]-[c:4]', np.float64(0.03743849985962636)),

<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">

* **apply_template(template_smarts, smiles_input)**

This function applies a retrosynthesis SMARTS template to a molecule given as a SMILES string. It checks whether the molecule matches the product side of the template and, if so, generates possible reactants.
1. The input SMILES is converted into an RDKit molecule object.
2. The input SMARTS template is parsed into a reaction object using RDKit.
3. The molecule is matched against the product side of the template. If a match is found, the reaction is applied in reverse to generate possible reactants.
4. If the template can be applied, a set of reactant molecules is returned as a list of SMILES strings.



In [9]:
from RetroChem.Package_functions.Interface_functions import apply_template

# Example input molecule: acetophenone
smiles_input = "CO"

# Predict top 5 most likely retrosynthesis templates
template_smarts = "[C:1][O:2]>>[C:1][Br].[O:2][H]"

# Generate possible reactants using the template
reactants = apply_template(template_smarts, smiles_input)

print(reactants)


[['CBr', '[H]O']]
