This is the source code for the paper "FRAIL: Fragment-Based Reinforcement Learning for Molecular Design and Benchmarking on Fatty Acid Amide Hydrolase 1 (FAAH-1)".
The system combines:
- Generator based on
DrugExwith fragment-based representation (graph transformer) and multi-objective reinforcement learning training. - Predictor (QSAR) for properties such as SA, MW, TPSA, HBD, HBA, LogP, and pIC50 on FAAH-1, using RDKit, PyTDC, and a trained Chemprop model.
- Complete inference pipeline for generating and filtering FAAH-1 molecules.
It is recommended to use conda to install RDKit and create a separate environment.
- Step 1: Create environment
conda create -n frail python=3.10 -y
conda activate frail- Step 2: Install RDKit (required for RDKit, PyTDC, Chemprop)
conda install -c rdkit rdkit -y- Step 3: Install remaining Python libraries
pip install -r requirements.txt- Step 4: Check path configuration
Adjust the following configuration files to match your system (especially absolute paths to data and checkpoints):
src/configs/generators/DrugexConfigs.py- contains
DATASETS_PATH,DATASETS_ENCODED_PATH,DATA_FILE,TARGET_NAME,MODEL_PATH,VOCAB_PATH, ...
- contains
src/configs/predictors/ChempropConfigs.py- contains
PIC50_PATH,ROUND_DIGITS,FILTER_THRES
- contains
- (optional)
src/settings.pyif you want to synchronize paths.
Key files:
src/run.py: complete inference pipeline (generate + filter) for FAAH-1.src/engine/generator.py: wrapper around DrugEx for fine-tuning and reinforcement learning based on fragment graph.src/engine/predictor.py: property predictor using RDKit + PyTDC + Chemprop pIC50, supports molecule filtering.
Example data directories:
data/: contains input dataset (SMILES, pIC50 labels, etc.)
By default, src/run.py defines the Pipeline class:
Pipeline.generator: instance ofGenerator(DrugEx) fromsrc/engine/generator.pyPipeline.predictor: instance ofPredictorfromsrc/engine/predictor.py- Method
Pipeline.generate(...):- generates molecules using DrugEx from input fragments
- predicts properties
- filters / writes to CSV file.
From the root directory:
conda activate frail
python src/run.pyOutput:
- A CSV file containing generated molecules and properties, for example:
SMILES,SA,MW,TPSA,HBD,HBA,LogP,pIC50
- Edit directly in
src/run.py:NUM_SAMPLES: desired number of molecules.INPUT_FRAGMENTS: seed fragments (SMILES) for other targets, if you want to try beyond FAAH-1.output_file: location/reason for saving results (e.g., ingen_data/).
Or use Pipeline in your own script:
from src.run import Pipeline
pipeline = Pipeline()
smiles_out_path = "gen_data/my_gen_results.csv"
pipeline.generate(
input_fragments=["ClC1=CC=C2CCNCC2=C1Cl"],
num_samples=1000,
output_file=smiles_out_path,
upscale=5, # optional, increase raw samples generated for filtering
)The file src/engine/generator.py defines the Generator class with main methods:
finetune(...): fine-tuning a pretrained DrugEx model on your FAAH-1 dataset.train_rl(...): multi-objective RL training with DrugExEnvironment + custom scorers.generate(...): generate molecules from a trained model.
General steps:
-
Prepare data:
- Configure paths in
configs/generators/DrugexConfigs.py:DATASETS_PATH: directory containing raw data files.DATA_FILE: name of CSV file containingSMILEScolumn (and pIC50 label if needed).TARGET_NAME: target name (e.g.,"FAAH"), used to name encoded train/test files.
- Generator will:
- read data,
- standardize SMILES,
- fragment and encode,
- create
train_setandtest_setinDATASETS_ENCODED_PATH.
- Configure paths in
-
Fine-tune DrugEx model:
Simple example script:
from engine.generator import Generator
from configs.generators.DrugexConfigs import MODEL_PATH, VOCAB_PATH
generator = Generator()
finetuned_model, vocab = generator.finetune(
model_path=MODEL_PATH, # pretrained DrugEx checkpoint
vocab_path=VOCAB_PATH, # corresponding vocabulary
epochs=10,
batch_size=64,
save_path="data/models/finetune/FAAH_FT_ep10"
)- Train Reinforcement Learning:
After having a good model (pretrained or fine-tuned), you can run RL:
from engine.generator import Generator
from configs.generators.DrugexConfigs import MODEL_PATH, VOCAB_PATH
generator = Generator()
rl_explorer, env = generator.train_rl(
agent_model_path=MODEL_PATH, # or finetuned checkpoint
agent_vocab_path=VOCAB_PATH,
mutate_model_path=None, # default uses agent as mutate
mutate_vocab_path=None,
epochs=20,
batch_size=64,
epsilon=0.2,
save_path="data/models/rl/FAAH_RL_ep20"
)You can put the above code into a separate script (e.g., train_generator.py) and run:
python train_generator.pyThe file src/engine/predictor.py defines the Predictor class:
- Combines:
- PyTDC Oracles for
logPandSA. - RDKit for
MW,TPSA,HBD,HBA. - Chemprop model for
pIC50(FAAH-1), loaded fromPIC50_PATH.
- PyTDC Oracles for
- Provides functions:
predict_*for each property.predict(smiles)returns a dict of properties.is_valid(smiles, props)to check if molecule is within filter range inFILTER_THRES.filter(smiles_list)to compute properties & filter molecule list (used in RL and pipeline).
Example:
from engine.predictor import Predictor
predictor = Predictor()
smiles = "COc1ccc(OC(=O)N2CCC(c3nc(C4=NOC(c5ccccc5)C4)cs3)CC2)cc1"
props = predictor.predict(smiles)
is_ok = predictor.is_valid(smiles, props)
print(props)
print("Valid:", is_ok)In this repo, Predictor assumes you already have a trained Chemprop checkpoint for pIC50 on FAAH-1:
- Checkpoint path is configured in
PIC50_PATH(fileconfigs/predictors/ChempropConfigs.py). - Model is loaded via
engine.predictors.chemprop.chemprop.
To retrain the pIC50 model:
- Use original Chemprop code or the
engine/predictors/chempropdirectory (following Chemprop instructions). - Train model on pIC50 dataset for FAAH-1.
- Update
PIC50_PATHto point to the new checkpoint.
After updating, Predictor will automatically use the new model when you run:
src/run.py(inference pipeline)src/engine/custom_scorers.pyduring RL.
- Paths in the current repo are absolute paths according to the experimental environment; when running on other machines, please update these paths in the configuration files.
- RL training and inference with DrugEx may require GPU for efficient execution; ensure you install CUDA/cuDNN and appropriate
torchversion if using GPU. - If you change data structure or add new properties, please update accordingly:
engine/custom_scorers.pyengine/predictor.py- configuration in
configs/.
