<div style="background-color: #f0efec; color: #14334a ; padding: 20px; border-radius: 20px; font-family: sans-serif; font-size: 14px">


<img src="../Images/logo.png" alt="RetroChem Logo" style="max-width: 100%; height: auto;">

# Introduction 

The following Jupyter notebook briefly presents **RetroChem**, a pip-installable Python package designed for retrosynthetic analysis. This package was developed to assist chemists and chemical engineers in predicting possible synthetic pathways for target molecules, using a machine learning model trained on the USPTO_50K database. Retrosynthesis is a central concept in organic chemistry, enabling the design of efficient synthetic routes by working backward from the desired product.

This package was created as a collaborative project for the EPFL course Practical Programming in Chemistry. [![GitHub3](https://img.shields.io/badge/EPFL-CH200-red.svg)](https://edu.epfl.ch/studyplan/en/bachelor/chemistry-and-chemical-engineering/coursebook/practical-programming-in-chemistry-CH-200)

Before diving into the code and functionalities of the package, let’s briefly explore the motivations and core concepts that shaped its development.

# How Retrochem came to mind

The idea for RetroChem emerged from our shared interest in organic synthesis and the growing importance of computational tools in modern chemistry. During our organic chemistry courses and laboratories we often encountered the challenge of synthesizing a target molecule from known reactants, a task that both requires extensive expertise in chemistry and is also very time consuming. 

At first, we envisioned RetroChem as a tool that would search through a large database of known reactions, both organic and inorganic, to identify possible transformations for a given target molecule. The idea was to use the most comprehensive reaction datasets available and search whether a synthesis existed for the molecule in question.

However, we quickly realized the scale of this task. The chemical universe is immensely large: it’s estimated there are up to 10⁶⁰ possible compounds. Even the most complete databases, such as CAS, which contains over 70 million registered compounds, are just a small fraction of that space. Searching such a large database for each input would not only be computationally intensive, potentially taking multiple minutes for even simple queries, but also fundamentally limited in scope.

This insight led us to turn toward **machine learning**. Instead of exhaustively searching for known reactions, we decided to train a model that could generalize from reaction data and predict retrosynthetic steps based on learned patterns. This approach allows RetroChem to make educated predictions even for molecules it has never seen before. 

# Step 1: Training the model

## General Pipeline

* **Data Loading**: We began by loading three preprocessed datasets derived from **USPTO-50K**: training, validation, and test. Each file contains cleaned reaction SMILES strings representing the chemical transformations, along with their associated reaction templates. 

* **Data Merging**: The three datasets were merged into a single DataFrame and saved as **combined_data.csv**. To avoid redundant information, duplicate reactions were removed during this merging step in order to prevent data leakage. 

* **Fingerprint Generation**: Before a machine learning model can understand molecules, we need to convert them from their chemical structure (SMILES format) into a numerical form. To do this, we use **Morgan fingerprints**.
These fingerprints are binary vectors that represent the presence or absence of specific structural patterns or substructures within the molecule. In our case:
* We use a radius of 3, which means we look at circular substructures around each atom up to 3 bonds away.
* We generate a 2048-bit vector, where each bit corresponds to a certain chemical feature.
* For each valid molecule (reactant or product), we generate such a fingerprint vector.
These vectors become the input X to the machine learning model, this is how the model sees molecules.

* **Label Preparation**: For the model to learn what kind of transformation (reaction template) a molecule underwent, we also need to provide a target label. These labels, called template hashes, are strings that uniquely identify the type of reaction used. However, machine learning models don’t work with string labels, they require numbers. To solve this, we use scikit-learn’s **LabelEncoder**, which converts each unique string into a unique integer. This step produces the output vector y, which the model uses to learn how to classify different types of reactions.

Together, X and y now represent the complete training data: X contains the structural features of molecules, and y contains the corresponding reaction template the model is expected to predict.

* **Dataset Splitting**: The data was split into training (70%), validation (15%), and test (15%) sets.

* **Normalization**: The input vectors were standardized using **StandardScaler** to help the neural network learn more effectively.

* **Model Training**: A multi-layer perceptron: **MLPClassifier from scikit learn library** was trained with three hidden layers. Early stopping was used to prevent overfitting, and training progress was monitored using the loss curve.

* **Evaluation**: The model was evaluated on both the validation and test sets using accuracy as the main metric.

* **Saving Outputs**: Finally, the trained model, along with the scaler and label encoder, were saved to disk for use in future prediction steps.

