# Using Benchmark Systems

This notebook demonstrates how to use the `data` module to access remediated benchmark system inputs ready for use with the OpenFE toolkit.

The module provides methods to:
- List available benchmark sets and systems.
- Filter systems by tags or calculation types.
- Access ligand files and other system components.
- Load and inspect ligands for molecular modeling tasks.

## Setup

First, let's import the necessary functions and classes from the module. These include methods for listing benchmark sets, accessing systems, and working with ligands.

In [1]:
from openfe_benchmarks.data import (
    get_benchmark_data_system,
    get_benchmark_set_data_systems,
    PARTIAL_CHARGE_TYPES,
    BenchmarkIndex,
)

## Discovering Available Benchmark Sets

The module automatically discovers all available benchmark sets in the directory structure. Let's see what's available and explore the systems within each set.

In [2]:
index = BenchmarkIndex()
index.list_benchmark_sets()

['charge_annihilation_set',
 'fragments',
 'jacs_set',
 'janssen_bace',
 'mcs_docking_set',
 'merck',
 'miscellaneous_set',
 'water_set']

In [3]:
index.list_systems_by_benchmark_set("jacs_set")

['bace', 'cdk2', 'jnk1', 'mcl1', 'p38', 'ptp1b', 'thrombin', 'tyk2']

## Filtering Systems by Calculation Type

You can filter systems based on the type of calculation they support, such as `bfe` or `sfe`, and whether it contains `cofactors`. This is useful for selecting systems relevant to your specific modeling tasks.

In [4]:
print("Available tags:", index.list_available_tags())

Available tags: {'bfe', 'sfe', 'cofactor'}


In [5]:
index._data["systems"]

{'charge_annihilation_set': {'cdk2': ['bfe', 'sfe'],
  'dlk': ['bfe', 'sfe'],
  'egfr': ['bfe', 'sfe'],
  'ephx2': ['bfe', 'sfe'],
  'irak4_s2': ['bfe', 'sfe'],
  'irak4_s3': ['bfe', 'sfe'],
  'itk': ['bfe', 'sfe'],
  'jak1': ['bfe', 'sfe'],
  'jnk1': ['bfe', 'sfe'],
  'ptp1b': ['bfe', 'sfe'],
  'thrombin': ['bfe', 'sfe', 'cofactor'],
  'tyk2': ['bfe', 'sfe']},
 'fragments': {'hsp90_2rings': ['bfe', 'sfe'],
  'hsp90_single_ring': ['bfe', 'sfe'],
  'jak2_set1': ['bfe', 'sfe'],
  'jak2_set2': ['bfe', 'sfe'],
  'liga': ['bfe', 'sfe'],
  'mcl1': ['bfe', 'sfe'],
  'mup1': ['bfe', 'sfe'],
  'p38': ['bfe', 'sfe'],
  't4_lysozyme': ['bfe', 'sfe']},
 'jacs_set': {'bace': ['bfe', 'sfe'],
  'cdk2': ['bfe', 'sfe'],
  'jnk1': ['bfe', 'sfe'],
  'mcl1': ['bfe', 'sfe'],
  'p38': ['bfe', 'sfe'],
  'ptp1b': ['bfe', 'sfe'],
  'thrombin': ['bfe', 'sfe'],
  'tyk2': ['bfe', 'sfe']},
 'janssen_bace': {'bace_ciordia_prospective': ['bfe', 'sfe'],
  'bace_p3_arg368_in': ['bfe', 'sfe'],
  'ciordia_retro': ['bfe'

In [6]:
rbfe_systems = index.list_systems_by_tag(['bfe', 'cofactor'])
rbfe_systems

[('charge_annihilation_set', 'thrombin'),
 ('mcs_docking_set', 'hne'),
 ('merck', 'pfkfb3'),
 ('merck', 'tnks2'),
 ('water_set', 'hsp90_woodhead')]

## Loading a Benchmark System

Now let's load a specific benchmark system using the factory method. We'll use the HNE system from the MCS Docking set:

In [7]:
p38_system = get_benchmark_data_system('mcs_docking_set', 'hne')
p38_system

[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'hne' from benchmark set 'mcs_docking_set' with 5 ligand file(s), and 5 cofactor file(s).
Found protein file: True.
Found 1 ligand network files[0m


BenchmarkData(name='hne', benchmark_set='mcs_docking_set', protein=protein.pdb, ligands=['openeye_am1bcc', 'nagl_openff-gnn-am1bcc-1.0.0.pt', 'no_charges', 'antechamber_am1bcc', 'openeye_am1bccelf10'], cofactors=['antechamber_am1bcc', 'openeye_am1bccelf10', 'nagl_openff-gnn-am1bcc-1.0.0.pt', 'no_charges', 'openeye_am1bcc'], ligand_network=['industry_benchmarks_network']

## Accessing System Components

The `BenchmarkData` object provides easy access to all components:

### System Metadata

In [8]:
print(f"System name: {p38_system.name}")
print(f"Benchmark set: {p38_system.benchmark_set}")

System name: hne
Benchmark set: mcs_docking_set


In [9]:
print(p38_system.details)

# GSK Industry Benchmark
Prepared by Alexander Williams
## HNE MCS Set
## Software Used
1. Maestro Protein Prep v.2024-02
2. Cresset Flare v8
## Protein Preparation: Steps Taken
1. Protein from original structures was imported into Maestro and using the protein preparation panel the interactive mode was activated following settings were applied on the preprocess panel.
### Maestro Protein Preparation Workflow Selected Settings
   1. Convert selenomethionines to methionines
   2. Include Peptides when capping termini.
   3. For HNE, two additional hydrogens on ASN98 and ASN159 had to be added onto the sidechain nitrogen.

1. Cofactor EPE was extracted to its own SDF file cofactor.sdf, other cofactors NAG and FUC were deleted.
2. Using Flare v8. cofactors.sdf was loaded in and was resaved to add appropriate hydrogens not prepared by maestro.
3. Final NMA peptide prepared by maestro was converted to NME by using the builder panel and selecting other edits ... change atom properties.
4. Pr

### Protein Structure

In [10]:
print(f"Protein PDB file: {p38_system.protein}")
print(f"File exists: {p38_system.protein.exists()}")
print(f"File size: {p38_system.protein.stat().st_size / 1024:.2f} KB")

Protein PDB file: /Users/jenniferclark/bin/openfe-benchmarks/openfe_benchmarks/data/benchmark_systems/mcs_docking_set/hne/protein.pdb
File exists: True
File size: 311.10 KB


### Ligands with Different Partial Charges

The system can contain ligands with different partial charge types. Let's see what's available:

In [11]:
print(f"Available partial charge types in this module: {PARTIAL_CHARGE_TYPES}")
print(f"\nLigand files available for P38 system:")
for charge_type, ligand_path in p38_system.ligands.items():
    print(f"  - {charge_type}: {ligand_path.name}")

Available partial charge types in this module: ['antechamber_am1bcc', 'nagl_openff-gnn-am1bcc-1.0.0.pt', 'openeye_am1bcc', 'openeye_am1bccelf10']

Ligand files available for P38 system:
  - openeye_am1bcc: ligands_openeye_am1bcc.sdf
  - nagl_openff-gnn-am1bcc-1.0.0.pt: ligands_nagl_openff-gnn-am1bcc-1.0.0.pt.sdf
  - no_charges: ligands.sdf
  - antechamber_am1bcc: ligands_antechamber_am1bcc.sdf
  - openeye_am1bccelf10: ligands_openeye_am1bccelf10.sdf


### Cofactors

Some systems may include cofactors. Let's check:

In [12]:
if p38_system.cofactors:
    print("Cofactor files available:")
    for charge_type, cofactor_path in p38_system.cofactors.items():
        print(f"  - {charge_type}: {cofactor_path.name}")
else:
    print("No cofactors for this system.")

Cofactor files available:
  - antechamber_am1bcc: cofactors_antechamber_am1bcc.sdf
  - openeye_am1bccelf10: cofactors_openeye_am1bccelf10.sdf
  - nagl_openff-gnn-am1bcc-1.0.0.pt: cofactors_nagl_openff-gnn-am1bcc-1.0.0.pt.sdf
  - no_charges: cofactors.sdf
  - openeye_am1bcc: cofactors_openeye_am1bcc.sdf


### Networks

Systems include network files (e.g., LOMAP networks):

In [13]:
print("Network files available:")
for network_name, network_path in p38_system.ligand_networks.items():
    print(f"  - {network_name}: {network_path.stat().st_size / 1024:.2f} KB")

Network files available:
  - industry_benchmarks_network: 162.91 KB


## Working with Multiple Systems

Let's load and compare multiple systems from the same benchmark set:

In [14]:
# Load multiple systems
systems = get_benchmark_set_data_systems('jacs_set')

# Compare them
print("System comparison:")
print(f"{'System':<10} {'Networks':<10} {'Cofactors'} {'Charge Types':<30}")
print("="*70)
for name, system in systems.items():
    charge_types = ', '.join(system.ligands.keys())
    has_cofactors = 'Yes' if system.cofactors else 'No'
    has_network = 'Yes' if system.ligand_networks else 'No'
    print(f"{name:<10} {has_network:<10} {has_cofactors:<10} {charge_types:<30}")

[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'bace' from benchmark set 'jacs_set' with 5 ligand file(s), and 0 cofactor file(s).
Found protein file: True.
Found 1 ligand network files[0m
[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'cdk2' from benchmark set 'jacs_set' with 5 ligand file(s), and 0 cofactor file(s).
Found protein file: True.
Found 1 ligand network files[0m
[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'jnk1' from benchmark set 'jacs_set' with 5 ligand file(s), and 0 cofactor file(s).
Found protein file: True.
Found 1 ligand network files[0m
[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'mcl1' from benchmark set 'jacs_set' with 5 ligand file(s), and 0 cofactor file(s).
Found protein file: True.
Found 1 ligand network files[0m
[32m2026-02-04 13:49:18[0m | [1mINFO    [0m | [1mLoaded system 'p38' from benchmark set 'jacs_set' with 5 ligand file(s), and 0 cofactor file(s).
Found

System comparison:
System     Networks   Cofactors Charge Types                  
bace       Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
cdk2       Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
jnk1       Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
mcl1       Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
p38        Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
ptp1b      Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, openeye_am1bccelf10
thrombin   Yes        No         openeye_am1bcc, nagl_openff-gnn-am1bcc-1.0.0.pt, no_charges, antechamber_am1bcc, 

## Error Handling

The module provides helpful error messages when you try to access non-existent benchmark sets or systems:

In [15]:
# Try to load a non-existent benchmark set
try:
    system = get_benchmark_data_system('nonexistent_set', 'p38')
except ValueError as e:
    print(f"Error: {e}")

Error: Benchmark set 'nonexistent_set' not found. Available benchmark sets: ['charge_annihilation_set', 'fragments', 'jacs_set', 'janssen_bace', 'mcs_docking_set', 'merck', 'miscellaneous_set', 'water_set']


In [16]:
# Try to load a non-existent system
try:
    system = get_benchmark_data_system('jacs_set', 'nonexistent_system')
except ValueError as e:
    print(f"Error: {e}")

Error: System 'nonexistent_system' not found in benchmark set 'jacs_set'. Available systems in 'jacs_set': ['bace', 'cdk2', 'jnk1', 'mcl1', 'p38', 'ptp1b', 'thrombin', 'tyk2']
