# Model External Validation

This document outlines the external validation procedures conducted for the RXNGraphormer model, utilizing established literature datasets to rigorously assess performance, robustness, and generalizability across diverse chemical reaction spaces. The validation framework leverages high-quality, peer-reviewed datasets spanning distinct reaction classes and experimental paradigms, enabling a comprehensive evaluation of the model’s predictive capabilities beyond the original training domain.

## Primary Validation Datasets

### Sulfoxonium Ylide Dataset
- **Reference**: [Lin et al., *Sci. China Chem.* **68**, 679–686 (2025)](https://doi.org/10.1007/s11426-024-2313-5)
- **Description**: This dataset comprises results from high-throughput experimentation on Ru-catalyzed P(O)O–H insertion reactions of sulfoxonium ylides. It features broad coverage of substrate scope and reaction conditions, with well-characterized α-phosphoryloxy ketone products and experimentally measured yields.

### Meta-C–H Functionalization Dataset
- **Reference**: [Selective functionalization of hindered meta-C–H bond of o-alkylaryl ketones, *Chem* (2022)](https://www.cell.com/chem/fulltext/S2451-9294(22)00421-1)
- **Description**: A dataset derived from automated synthesis platforms performing meta-selective C–H functionalization. It includes sterically congested substrates and diverse coupling partners, presenting a challenging benchmark for selectivity prediction.

### Amide Coupling Reaction HTE Dataset
- **Reference**: [Zhang et al., *ChemRxiv* (2025)](https://chemrxiv.org/engage/chemrxiv/article-details/67d288b3fa469535b90eb631)
- **Description**: A preprint dataset of amide coupling reactions generated via machine-guided high-throughput experimentation (HTE). The dataset features high-quality yield measurements and was constructed using unbiased sampling strategies, ensuring broad chemical diversity and minimal selection bias.

### Amide Coupling Literature Dataset
- **Description**: A curated compilation of amide coupling reactions extracted from recent chemical literature (past three years). This dataset provides broad coverage of reagent systems, functional groups, and substrate architectures, serving as an independent benchmark for model generalizability.

---

## Methodological Limitations in Dataset Utilization

Due to the absence of mechanistic intermediate annotations in the source datasets, intermediate generation analysis—as reported in the original RXNGraphormer study—could not be implemented in this reproduction. To maintain consistency with experimental protocols and avoid potential artifacts from unverified intermediate predictions, all analyses were restricted to substrate–product pairs only. This limitation applies uniformly to datasets marked with **no mech**.

---

## Evaluation Protocol

For the Sulfoxonium Ylide and Meta-C–H Functionalization datasets (both marked **no mech**), model evaluation was performed using pre-trained optimal checkpoints. These models were initially trained and validated to convergence, with the best-performing weights saved based on validation set performance.

During evaluation, the saved optimal models were directly loaded, and inference was conducted on the test set by modifying only the data path in `model/parameters.json`. This approach ensures consistent and reproducible assessment without retraining or fine-tuning.

To support flexible and comprehensive evaluation, the `eval_regression_performance` function in `rxngraphormer/eval.py` was enhanced to accommodate four distinct evaluation configurations through the following binary switches:
- Use of a specific validation/test split (`specific_val`)
- Activation of intermediate information inference (`eval(pretrained_config.model.use_mid_inf)`)

This expanded framework enables all four logical combinations of these settings, allowing for systematic ablation studies and robust performance benchmarking across different modeling assumptions and data configurations.

## 1.Shell Example

In [19]:
#  python train_model.py --config_json ./config/Test/sulfoxonium_seed68909_parameters.json
#  python train_model.py --config_json ./config/Test/meta_C_H_parameters.json

## 2.Base Function

In [20]:
import glob
import os
import warnings

from rxngraphormer.eval import eval_regression_performance

warnings.filterwarnings("ignore")
cur_dir = os.getcwd()
father_dir = os.path.abspath(os.path.join(cur_dir, '..'))
all_results = {}

In [21]:
def evaluate_task(task_name, pattern, results_dict, father_dir, cur_dir, specific_val=True):
    original_dir = os.getcwd()
    try:
        os.chdir(father_dir)
        path_lst = sorted(glob.glob(pattern), key=lambda x: os.path.basename(x))
        all_r2, all_mae, all_preds, all_targets = [], [], [], []
        all_train_r2, all_train_mae = [], []

        for path in path_lst:
            name = os.path.basename(path)
            # tr_r2, tr_mae, _, _, r2, mae, preds, targets = eval_regression_performance(
            #     path, ckpt_file="valid_checkpoint.pt", scale=100,
            #     yield_constrain=True, specific_val=True, return_train_results=True
            # )
            r2, mae, preds, targets = eval_regression_performance(
                path, ckpt_file="valid_checkpoint.pt", scale=100,
                yield_constrain=True, specific_val=specific_val
            )
            # print( f"{task_name}_Train/{name}, Train R2: {tr_r2:.4f},Train MAE: {tr_mae:.4f},Test R2: {r2:.4f},Test MAE: {mae:.4f}")
            print(f"{name}, Test R2: {r2:.4f},Test MAE: {mae:.4f}")

            all_r2.append(r2)
            all_mae.append(mae)
            all_preds.append(preds)
            all_targets.append(targets)

        results_dict[task_name] = [all_r2, all_mae, all_preds, all_targets]
    finally:
        os.chdir(cur_dir)

## 3.Sulfoxonium Dataset

In [22]:
Sulfoxonium_tasks = {'Sulfoxonium': False, "Sulfoxonium_eval": True}
for task in Sulfoxonium_tasks.keys():
    evaluate_task(task, f"./model_path/Test/{task}", all_results, father_dir, cur_dir,
                  specific_val=Sulfoxonium_tasks[task])

Sulfoxonium, Test R2: 0.9085,Test MAE: 5.7049
Sulfoxonium_eval, Test R2: 0.6048,Test MAE: 9.4620


## 4.Meta_C_H Dataset（no mech）

In [23]:
Meta_C_H_tasks = {'Meta_C_H': False, "Meta_C_H_eval": True, "Meta_C_H_strict": False, "Meta_C_H_strict_eval": True}
for task in Meta_C_H_tasks.keys():
    evaluate_task(task, f"./model_path/Test/{task}", all_results, father_dir, cur_dir,
                  specific_val=Meta_C_H_tasks[task])

Meta_C_H, Test R2: 0.8227,Test MAE: 5.7645
Meta_C_H_eval, Test R2: 0.7759,Test MAE: 6.4859
Meta_C_H_strict, Test R2: 0.8121,Test MAE: 5.9319
Meta_C_H_strict_eval, Test R2: -1.1942,Test MAE: 23.2152


## 5.Amide Coupling Reaction Datasets(HTE)
### 5.1 Introduced Intermediate

In [24]:
AmCouple_tasks = ['AmCouple_hte_random', 'AmCouple_hte_part', 'AmCouple_hte_full']
for task in AmCouple_tasks:
    evaluate_task(task, f"./model_path/Test/{task}", all_results, father_dir, cur_dir, specific_val=True)

AmCouple_hte_random, Test R2: 0.5761,Test MAE: 14.1392
AmCouple_hte_part, Test R2: 0.5908,Test MAE: 14.3076
AmCouple_hte_full, Test R2: 0.5763,Test MAE: 12.5638


### 5.2 No Intermediate Introduced

In [25]:
AmCouple_tasks = ['AmCouple_hte_random_nomech', 'AmCouple_hte_part_nomech', 'AmCouple_hte_full_nomech']
for task in AmCouple_tasks:
    evaluate_task(task, f"./model_path/Test/{task}", all_results, father_dir, cur_dir, specific_val=True)

AmCouple_hte_random_nomech, Test R2: 0.5844,Test MAE: 14.3055
AmCouple_hte_part_nomech, Test R2: 0.6583,Test MAE: 13.1205
AmCouple_hte_full_nomech, Test R2: 0.4430,Test MAE: 14.9298


### 5.3 Using Original Model's Intermediate

Selected condition Original model embedded with intermediate knowledge
- DCC Dataset
- EDC Dataset
- HATU Dataset
- PyBOP Dataset
- TBTU Dataset(no mech)
- HBTU Dataset(no mech)

In [34]:
AmCouple_conditions_tasks = [
    'DCC_random', 'DCC_part', 'DCC_full',
    'EDC_random', 'EDC_part', 'EDC_full',
    'HATU_random', 'HATU_part', 'HATU_full',
    'PyBOP_random', 'PyBOP_part', 'PyBOP_full',
    'HBTU_random_nomech', 'HBTU_part_nomech', 'HBTU_full_nomech',
    'TBTU_random_nomech', 'TBTU_part_nomech', 'TBTU_full_nomech',
]

for i, task in enumerate(AmCouple_conditions_tasks):
    evaluate_task(task, f"./model_path/Test/{task}", all_results, father_dir, cur_dir, specific_val=True)
    if (i + 1) % 3 == 0:
        print("-" * 50)

DCC_random, Test R2: 0.3654,Test MAE: 16.6271
DCC_part, Test R2: 0.2671,Test MAE: 15.9877
DCC_full, Test R2: -0.4138,Test MAE: 13.0522
--------------------------------------------------
EDC_random, Test R2: 0.2310,Test MAE: 18.4555
EDC_part, Test R2: 0.1951,Test MAE: 18.3237
EDC_full, Test R2: -0.0951,Test MAE: 22.2301
--------------------------------------------------
HATU_random, Test R2: 0.0756,Test MAE: 19.0816
HATU_part, Test R2: -0.0867,Test MAE: 20.5697
HATU_full, Test R2: -0.3705,Test MAE: 16.3399
--------------------------------------------------
PyBOP_random, Test R2: 0.3463,Test MAE: 15.5401
PyBOP_part, Test R2: 0.3794,Test MAE: 14.0850
PyBOP_full, Test R2: 0.2164,Test MAE: 15.5574
--------------------------------------------------
HBTU_random_nomech, Test R2: 0.2276,Test MAE: 18.0151
HBTU_part_nomech, Test R2: 0.0387,Test MAE: 18.7460
HBTU_full_nomech, Test R2: -0.1053,Test MAE: 17.8836
--------------------------------------------------
TBTU_random_nomech, Test R2: 0.4880,T

## 6.Amide Coupling Literature Dataset

In [33]:
 evaluate_task(task, f"./model_path/Test/AmCouple_lit", all_results, father_dir, cur_dir, specific_val=False)

AmCouple_lit, Test R2: 0.3506,Test MAE: 12.4794
