Skip to content

Added the auto-database feature#2909

Open
alongd wants to merge 6 commits intomainfrom
auto_db
Open

Added the auto-database feature#2909
alongd wants to merge 6 commits intomainfrom
auto_db

Conversation

@alongd
Copy link
Copy Markdown
Member

@alongd alongd commented Mar 27, 2026

Adds automatic library and kinetics family selection to RMG. Users can now write thermoLibraries='auto' (and same for reaction libraries, transport, seeds, and kinetics families) in their input file, and RMG will pick the right libraries based on what species and reactor conditions are in the input (and based on the correctness of recommended_libraries.yml).

The selection logic detects elements (N, S, O, halogens, Li), reactor type (gas/liquid/surface), and temperature to trigger the appropriate
chemistry sets defined in a new recommended_libraries.yml file in RMG-database. Kinetics families are similarly auto-selected from the existing
recommended.py sets.

Key design choices:

  • 'auto' is opt-in, not the default, so existing input files work unchanged (backwards compatible)
  • Users can mix manual and auto: ['myLib', 'auto'] gives myLib higher priority
  • PAH formation libraries (70+ kinetics libs) are only auto-included for pure C/H pyrolysis; oxygenated systems need the explicit <PAH_libs> keyword
    to opt in
  • Families support ['!H_Abstraction', 'auto'] to exclude specific families from the auto set

A Preview notebook at ipython/auto_library_selection.ipynb

We should first merge the db branch ReactionMechanismGenerator/RMG-database#712

PR adapted from the existing implementation in T3 (libraries.yml and code)

@alongd alongd requested a review from Copilot March 27, 2026 23:53
@alongd
Copy link
Copy Markdown
Member Author

alongd commented Mar 28, 2026

If this feature is triggered, RMG reports something like:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Auto-selecting libraries for: [C/H/O] T_max=2000 K
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Chemistry sets triggered: primary, oxidation, CH_pyrolysis_core
  Thermo libraries (14): 
    primaryThermoLibrary, BurkeH2O2, NOx2018, Butadiene_Dimerization
    CurranPentane, Chernov, heavy_oil_ccsdtf12_1dHR, Klippenstein_Glarborg2016
    Spiekermann_refining_elementary_reactions, thermo_DFT_CCSDTF12_BAC
    DFT_QCI_thermo, CBS_QB3_1dHR, FFCM1(-), Narayanaswamy
  Reaction libraries (8): 
    primaryH2O2, FFCM1(-), NOx2018, 2006_Joshi_OH_CO, 2005_Senosiain_OH_C2H2
    C2H2_init, Klippenstein_Glarborg2016, Chernov
  Seed mechanisms (1): 
    primaryH2O2
  Transport libraries (4): 
    PrimaryTransportLibrary, OneDMinN2, NOx2018, GRI-Mech
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in “auto-database” mode to RMG that can automatically select thermo/kinetics/transport libraries, seed mechanisms, and kinetics families based on detected chemistry (elements, phase, surface, temperature), driven by recommended_libraries.yml in RMG-database and the existing kinetics/families/recommended.py sets.

Changes:

  • Add rmgpy/data/auto_database.py implementing chemistry detection + auto-selection logic (including <PAH_libs> handling and family exclusion via !FamilyName).
  • Extend input parsing and startup initialization to accept/pass through 'auto' and <PAH_libs> and to preserve reaction-library “output edge” flags.
  • Add unit tests + user documentation + a preview Jupyter notebook.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rmgpy/rmg/input.py Accepts 'auto' / <PAH_libs> tokens in database() and stores reaction libraries as strings + a sidecar reaction_libraries_output_edge set.
rmgpy/rmg/main.py Runs auto-selection during initialize() and converts reaction libraries back to (name, output_edge) tuples before database load.
rmgpy/data/auto_database.py New module implementing detection, YAML expansion, merging logic, and kinetics family resolution.
test/rmgpy/rmg/inputTest.py Updates reaction library parsing expectations; adds token-handling tests.
test/rmgpy/data/autoDatabaseTest.py New test suite for chemistry detection, YAML expansion/merge behavior, and end-to-end selection outcomes.
documentation/source/users/rmg/input.rst Documents 'auto', mixed manual/auto lists, and <PAH_libs> behavior + notebook preview.
ipython/auto_library_selection.ipynb Notebook to preview what auto-selection would choose for a given input file.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

{
"cell_type": "code",
"id": "eft98a4ciwl",
"source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, _to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = _to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')",
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook imports and calls _to_reaction_library_tuples, but the module exports to_reaction_library_tuples (no leading underscore). As written, running the notebook will raise ImportError/NameError in the “Actual Resolution” cell. Update the import and call sites to use the real function name.

Suggested change
"source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, _to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = _to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')",
"source": "from rmgpy.data.auto_database import auto_select_libraries, PAH_LIBS, to_reaction_library_tuples\n\n# Work on a fresh copy so we don't mutate the rmg object used above\nimport copy\nrmg2 = copy.deepcopy(rmg)\n\n# Run the same auto-selection that main.py would run\nauto_select_libraries(rmg2)\n\n# Convert reaction libraries to tuples (as main.py does before load_database)\nif isinstance(rmg2.reaction_libraries, list):\n output_edge = getattr(rmg2, 'reaction_libraries_output_edge', set())\n rmg2.reaction_libraries = to_reaction_library_tuples(rmg2.reaction_libraries, output_edge)\n\nhas_auto = any(\n getattr(rmg, attr, None) == 'auto'\n or (isinstance(getattr(rmg, attr, None), list) and 'auto' in getattr(rmg, attr))\n for attr in ('thermo_libraries', 'reaction_libraries', 'seed_mechanisms',\n 'transport_libraries', 'kinetics_families')\n)\n\nif not has_auto:\n print('The input file does not use \\'auto\\' in any database field.')\n print('The settings below are exactly what was specified in the input file.\\n')\n\nprint_list('Thermo libraries', rmg2.thermo_libraries)\n\nrxn_lib_names = [name for name, _ in rmg2.reaction_libraries] if isinstance(rmg2.reaction_libraries, list) else rmg2.reaction_libraries\nprint_list('Reaction libraries', rxn_lib_names or [])\n\nedge_libs = [name for name, flag in rmg2.reaction_libraries if flag] if isinstance(rmg2.reaction_libraries, list) else []\nif edge_libs:\n print(f'\\n (output unused edge reactions for: {\", \".join(edge_libs)})')\n\nprint_list('Seed mechanisms', rmg2.seed_mechanisms or [])\nprint_list('Transport libraries', rmg2.transport_libraries or [])\n\nif isinstance(rmg2.kinetics_families, list):\n print_list('Kinetics families', rmg2.kinetics_families)\nelse:\n print(f'\\nKinetics families: {rmg2.kinetics_families!r} (resolved at database load time)')",

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +411
def determine_chemistry_sets(profile: ChemistryProfile,
pah_libs_requested: bool = False,
) -> List[str]:
"""
Determine which chemistry sets to activate based on the detected profile.

CH pyrolysis logic:
- CH_pyrolysis_core is always added when C present AND T >= 800 K.
- PAH_formation is added when:
(a) C + T >= 800 K + no O in species (pure C/H pyrolysis), OR
(b) C + T >= 800 K + <PAH_libs> keyword requested by user.

Args:
profile: ChemistryProfile instance.
pah_libs_requested: bool, True if user included <PAH_libs> keyword.

Returns:
List of ChemistrySet values in priority order.
"""
sets = [ChemistrySet.PRIMARY]

if profile.has_nitrogen:
sets.append(ChemistrySet.NITROGEN)

if profile.has_sulfur:
sets.append(ChemistrySet.SULFUR)

if profile.has_oxygen:
sets.append(ChemistrySet.OXIDATION)

high_T_carbon = profile.has_carbon and profile.max_temperature >= CH_PYROLYSIS_T_THRESHOLD

if high_T_carbon:
sets.append(ChemistrySet.CH_PYROLYSIS_CORE)

if not profile.has_oxygen or pah_libs_requested:
sets.append(ChemistrySet.PAH_FORMATION)

if profile.has_liquid and profile.has_oxygen:
sets.append(ChemistrySet.LIQUID_OXIDATION)

if profile.has_surface:
sets.append(ChemistrySet.SURFACE)

if profile.has_surface and profile.has_nitrogen:
sets.append(ChemistrySet.SURFACE_NITROGEN)

if profile.has_halogens:
sets.append(ChemistrySet.HALOGENS)

if profile.has_electrochem:
sets.append(ChemistrySet.ELECTROCHEM)

return sets


def determine_kinetics_families(profile: ChemistryProfile) -> List[str]:
"""
Determine which kinetics family sets to activate based on the detected profile.

These correspond to named sets in RMG-database/input/kinetics/families/recommended.py.

Args:
profile: ChemistryProfile instance.

Returns:
List of FamilySet values to combine.
"""
family_sets = [FamilySet.DEFAULT]

if profile.has_carbon and profile.max_temperature >= CH_PYROLYSIS_T_THRESHOLD:
family_sets.append(FamilySet.CH_PYROLYSIS)

if profile.has_liquid and profile.has_oxygen:
family_sets.append(FamilySet.LIQUID_PEROXIDE)

if profile.has_surface:
family_sets.append(FamilySet.SURFACE)

if profile.has_halogens:
family_sets.append(FamilySet.HALOGENS)

if profile.has_electrochem:
family_sets.append(FamilySet.ELECTROCHEM)

return family_sets


def load_recommended_yml(database_directory: str) -> dict:
"""
Load the recommended_libraries.yml file from the RMG database.

Args:
database_directory: path to the RMG database 'input' directory.

Returns:
dict parsed from YAML.
"""
yml_path = os.path.join(database_directory, 'recommended_libraries.yml')
if not os.path.isfile(yml_path):
raise InputError(f"Could not find recommended_libraries.yml at {yml_path}. "
f"This file is required for 'auto' library selection.")
with open(yml_path, 'r') as f:
return yaml.safe_load(f)


def expand_chemistry_sets(recommended_data: dict,
set_names: List[str],
) -> Tuple[List[str], List[str], List[str], List[str]]:
"""
Expand named chemistry sets into concrete library lists.

Args:
recommended_data: dict from recommended_libraries.yml.
set_names: list of chemistry set names to expand.

Returns:
Tuple of (thermo_libraries, kinetics_libraries, transport_libraries, seed_libraries)
where each is a list of library name strings.
"""
# Primary must always be expanded first so its libraries have highest priority.
primary_val = ChemistrySet.PRIMARY.value
has_primary = any(str(s) == primary_val for s in set_names)
other_sets = [s for s in set_names if str(s) != primary_val]
set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets

thermo, kinetics, transport, seed = [], [], [], []

for set_name in set_names:
if set_name not in recommended_data:
raise InputError(f"Chemistry set '{set_name}' not found in recommended_libraries.yml. "
f"Available sets: {list(recommended_data.keys())}")
set_data = recommended_data[set_name]

for entry in set_data.get('thermo', []):
name = entry if isinstance(entry, str) else entry['name']
if name not in thermo:
thermo.append(name)

for entry in set_data.get('kinetics', []):
if isinstance(entry, str):
if entry not in kinetics:
kinetics.append(entry)
elif isinstance(entry, dict):
name = entry['name']
if entry.get('seed', False):
if name not in seed:
seed.append(name)
else:
if name not in kinetics:
kinetics.append(name)

for entry in set_data.get('transport', []):
name = entry if isinstance(entry, str) else entry['name']
if name not in transport:
transport.append(name)

return thermo, kinetics, transport, seed


def merge_with_user_libraries(user_spec: Any, auto_libs: List[str]) -> list:
"""
Merge user-specified libraries with auto-selected libraries,
respecting the position of the 'auto' token. <PAH_libs> tokens
are silently removed (they've already been used as a signal).

Args:
user_spec: the user's library specification. Can be:
- 'auto' (string): fully replace with auto_libs
- list containing 'auto' token: replace token in-place with auto_libs
- list without 'auto': return as-is (with <PAH_libs> stripped)
- None or []: return as-is
auto_libs: list of auto-selected library names.

Returns:
Resolved list of library names.
"""
if user_spec == AUTO:
return list(auto_libs)

if not isinstance(user_spec, list):
return user_spec

# Collect all user-specified library names (excluding special tokens)
user_lib_names = set()
for item in user_spec:
if item not in (AUTO, PAH_LIBS):
name = item[0] if isinstance(item, tuple) else item
user_lib_names.add(name)

# Filter auto libs to exclude any already specified by user
filtered_auto = [lib for lib in auto_libs if lib not in user_lib_names]

# Replace tokens in-place
result = []
for item in user_spec:
if item == AUTO:
result.extend(filtered_auto)
elif item == PAH_LIBS:
continue
else:
result.append(item)

return result
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations don’t match the actual values returned/consumed: determine_chemistry_sets() is annotated to return List[str] but returns ChemistrySet enum members, and determine_kinetics_families() similarly returns FamilySet members. Likewise merge_with_user_libraries() is annotated to return list but can return None or a non-list user_spec. Please update the return type hints (and any dependent parameter hints like expand_chemistry_sets(set_names=...)) to reflect the real runtime types, so static analysis and IDE tooling don’t mislead callers.

Copilot uses AI. Check for mistakes.
Comment on lines +330 to +332
has_primary = any(str(s) == primary_val for s in set_names)
other_sets = [s for s in set_names if str(s) != primary_val]
set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand_chemistry_sets() tries to ensure the primary set is expanded first, but it checks str(s) == 'primary'. For ChemistrySet enum members, str(ChemistrySet.PRIMARY) is 'ChemistrySet.PRIMARY', so has_primary will be false and the reordering logic won’t work if set_names contains enum members (as it does when passed the output of determine_chemistry_sets()). Consider normalizing via getattr(s, 'value', s) (or comparing s == ChemistrySet.PRIMARY / s.value) before doing the primary-first reordering.

Suggested change
has_primary = any(str(s) == primary_val for s in set_names)
other_sets = [s for s in set_names if str(s) != primary_val]
set_names = ([ChemistrySet.PRIMARY] if has_primary else []) + other_sets
# Normalize set names so we can handle both strings and ChemistrySet enum members.
normalized_names = [getattr(s, "value", s) for s in set_names]
has_primary = any(n == primary_val for n in normalized_names)
other_sets = [n for n in normalized_names if n != primary_val]
# After this point, set_names contains only string names (e.g., "primary").
set_names = ([primary_val] if has_primary else []) + other_sets

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +128
# Handle 'auto' token: pass through for later resolution by auto_select_libraries().
# '<PAH_libs>' is only valid as a token inside a list, not as a standalone value.
_LIST_TOKENS = (AUTO, PAH_LIBS)

if thermoLibraries == AUTO:
rmg.thermo_libraries = AUTO
else:
rmg.thermo_libraries = as_list(thermoLibraries, default=[])

if transportLibraries == AUTO:
rmg.transport_libraries = AUTO
else:
rmg.transport_libraries = as_list(transportLibraries, default=None)

# Store reaction libraries as plain strings; remember which ones had True option
# (the bool indicates "also output unused edge reactions to chemkin file")
if reactionLibraries == AUTO:
rmg.reaction_libraries = AUTO
rmg.reaction_libraries_output_edge = set()
else:
reaction_libraries = as_list(reactionLibraries, default=[])
rmg.reaction_libraries = []
rmg.reaction_libraries_output_edge = set()
for item in reaction_libraries:
if item in _LIST_TOKENS:
rmg.reaction_libraries.append(item)
elif isinstance(item, tuple):
name, option = item
rmg.reaction_libraries.append(name)
if option:
rmg.reaction_libraries_output_edge.add(name)
else:
rmg.reaction_libraries.append(item)

if seedMechanisms == AUTO:
rmg.seed_mechanisms = AUTO
else:
rmg.seed_mechanisms = as_list(seedMechanisms, default=[])
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says '<PAH_libs>' is only valid inside a list, but the code currently accepts thermoLibraries == PAH_LIBS / transportLibraries == PAH_LIBS / seedMechanisms == PAH_LIBS (and also reactionLibraries == PAH_LIBS) without raising. This leads to confusing downstream behavior (e.g., auto_select_libraries() will treat it as a special token and _log_lib_list() will iterate over a string). Please add explicit validation to raise InputError when any of these fields is set to '<PAH_libs>' as a standalone value, and also reject tuples like ('<PAH_libs>', True) in reactionLibraries.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants