# Advanced search

NIST Chemistry WebBook supports structure search, however to the best of our knowledge there is no straightforward way to implement it as a Python API. To overcome this problem, as well as WebBook's limitation of found compounds, **NistChemPy** package contains dataframe with the main info on all NIST Chemistry WebBook compounds:

In [1]:
import nistchempy as nist
import pandas as pd

pd.set_option('display.max_columns', None)
df = nist.get_all_data()
df

Unnamed: 0,ID,name,synonyms,formula,mol_weight,inchi,inchi_key,cas_rn,mol2D,mol3D,Gas phase thermochemistry data,Condensed phase thermochemistry data,Phase change data,Reaction thermochemistry data,Gas phase ion energetics data,Ion clustering data,IR Spectrum,THz IR spectrum,Mass spectrum (electron ionization),UV/Visible spectrum,Gas Chromatography,Vibrational and/or electronic energy levels,Constants of diatomic molecules,Henry's Law data,Fluid Properties,Computational Chemistry Comparison and Benchmark Database,Electron-Impact Ionization Cross Sections (on physics web site),Gas Phase Kinetics Database,Microwave spectra (on physics lab web site),NIST Atomic Spectra Database - Ground states and ionization energies (on physics web site),NIST Atomic Spectra Database - Levels Holdings (on physics web site),NIST Atomic Spectra Database - Lines Holdings (on physics web site),NIST Polycyclic Aromatic Hydrocarbon Structure Index,Reference simulation,Reference simulation: SPC/E Water,Reference simulation: TraPPE Carbon Dioxide,"X-ray Photoelectron Spectroscopy Database, version 5.0","NIST / TRC Web Thermo Tables, ""lite"" edition (thermophysical and thermochemical data)","NIST / TRC Web Thermo Tables, professional edition (thermophysical and thermochemical data)"
0,B100,iron oxide anion,,FeO-,71.8450,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,,,,,,,,,,,,,,,,,,,,,,
1,B1000,AsF3..Cl anion,,AsClF3-,167.3700,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,B1000000,AgH2-,,AgH2-,109.8846,,,,,,,,,,,,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,,,,,,,,,,,,,,,
3,B1000001,HAg(H2),,AgH3,110.8920,,,,,,,,,,,,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,,,,,,,,,,,,,,,
4,B1000002,AgNO+,,AgNO+,137.8738,,,,,,,,,,,,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=B100...,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129340,U99777,"Methyl 3-hydroxycholest-5-en-26-oate, TMS deri...","Methyl (25RS)-3β-hydroxy-5-cholesten-26-oate, ...",C31H54O3Si,502.8442,InChI=1S/C31H54O3Si/c1-21(10-9-11-22(2)29(32)3...,DNXGNXYNSBCWGX-QBUYVTDMSA-N,,https://webbook.nist.gov/cgi/cbook.cgi?Str2Fil...,,,,,,,,,,https://webbook.nist.gov/cgi/cbook.cgi?ID=U997...,,https://webbook.nist.gov/cgi/cbook.cgi?ID=U997...,,,,,,,,,,,,,,,,,,
129341,U99830,"2-Methyl-3-oxovaleric acid, O,O'-bis(trimethyl...","3-Oxopentanoic acid, 2-methyl, TMS\n2-Methyl-3...",C12H26O3Si2,274.5040,"InChI=1S/C12H26O3Si2/c1-9-11(14-16(3,4)5)10(2)...",LXAIQDVPXKOIGO-KHPPLWFESA-N,,https://webbook.nist.gov/cgi/inchi?Str2File=U9...,,,,,,,,,,https://webbook.nist.gov/cgi/inchi?ID=U99830&M...,,https://webbook.nist.gov/cgi/inchi?ID=U99830&M...,,,,,,,,,,,,,,,,,,
129342,U99942,3-Hydroxy-3-(4'-hydroxy-3'-methoxyphenyl)propi...,"Vanillylhydracrylic acid, tri-TMS\nVanillylhyd...",C19H36O5Si3,428.7426,InChI=1S/C19H36O5Si3/c1-21-18-13-15(11-12-16(1...,QCMUGKOFXVYNCF-UHFFFAOYSA-N,,https://webbook.nist.gov/cgi/inchi?Str2File=U9...,,,,,,,,,,https://webbook.nist.gov/cgi/inchi?ID=U99942&M...,,https://webbook.nist.gov/cgi/inchi?ID=U99942&M...,,,,,,,,,,,,,,,,,,
129343,U99947,"2-Propylpentanoic acid, 2,3,4,6-tetra(trimethy...","Valproic acid, glucuronide, TMS",C26H58O7Si4,595.0765,InChI=1S/C26H58O7Si4/c1-15-17-20(18-16-2)25(27...,OVXMRISJDUWFKB-UHFFFAOYSA-N,,https://webbook.nist.gov/cgi/inchi?Str2File=U9...,,,,,,,,,,https://webbook.nist.gov/cgi/inchi?ID=U99947&M...,,https://webbook.nist.gov/cgi/inchi?ID=U99947&M...,,,,,,,,,,,,,,,,,,


Its columns can be divided in 5 groups:

1. General properties:

      - `ID`: NIST Compound ID

      - `name`: chemical name

      - `synonyms`: synonyms

      - `formula`: chemical formula

      - `mol_weight`: molecular weigth

      - `inchi` / `inchi_key`: InChI / InChIKey strings

      - `cas_rn`: CAS Registry Number

2. Molecular files:

      - `mol2D` / `mol3D`: 2D and 3D MOL-files

3. NIST Chemistry WebBook data:

      - Gas phase thermochemistry data

      - Condensed phase thermochemistry data

      - Phase change data

      - Reaction thermochemistry data

      - Gas phase ion energetics data

      - Ion clustering data

      - IR Spectrum

      - THz IR spectrum

      - Mass spectrum (electron ionization)

      - UV/Visible spectrum

      - Gas Chromatography

      - Vibrational and/or electronic energy levels

      - Constants of diatomic molecules

      - Henry's Law data

      - Fluid Properties

4. NIST public data:

      - Computational Chemistry Comparison and Benchmark Database

      - Electron-Impact Ionization Cross Sections (on physics web site)

      - Gas Phase Kinetics Database

      - Microwave spectra (on physics lab web site)

      - NIST Atomic Spectra Database - Ground states and ionization energies (on physics web site)

      - NIST Atomic Spectra Database - Levels Holdings (on physics web site)

      - NIST Atomic Spectra Database - Lines Holdings (on physics web site)

      - NIST Polycyclic Aromatic Hydrocarbon Structure Index

      - Reference simulation

      - Reference simulation: SPC/E Water

      - Reference simulation: TraPPE Carbon Dioxide

      - X-ray Photoelectron Spectroscopy Database, version 5.0

5. NIST subscription data:

      - NIST / TRC Web Thermo Tables, "lite" edition (thermophysical and thermochemical data)

      - NIST / TRC Web Thermo Tables, professional edition (thermophysical and thermochemical data)


All columns except for the first group contain URLs for the corresponding data, allowing one to parse the relevant pages without the need to preload the compounds themselves:

In [2]:
col = 'NIST Atomic Spectra Database - Ground states and ionization energies (on physics web site)'
df.loc[~df[col].isna(), ['ID', 'inchi', col]]

Unnamed: 0,ID,inchi,NIST Atomic Spectra Database - Ground states and ionization energies (on physics web site)
11510,C10028145,InChI=1S/No,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
11593,C10043922,InChI=1S/Rn,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
11749,C10097322,InChI=1S/Br,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
16928,C12385136,InChI=1S/H,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
18624,C13494809,InChI=1S/Te,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
...,...,...,...
59700,C7440735,InChI=1S/Fr,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
59701,C7440746,InChI=1S/In,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
60912,C7704349,InChI=1S/S,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...
61010,C7723140,InChI=1S/P,https://physics.nist.gov/cgi-bin/ASD/ie.pl?spe...


This dataframe can be used to limit all entries to those ones with desired properties. To use short names for NIST Chemistry WebBook properties, one can use the `nist.get_search_parameters` function:

In [3]:
ps = nist.get_search_parameters()
ps

{'use_SI': 'Units for thermodynamic data, "SI" if True and "calories" if False',
 'match_isotopes': 'Exactly match the specified isotopes (formula search only)',
 'allow_other': 'Allow elements not specified in formula (formula search only)',
 'allow_extra': 'Allow more atoms of elements in formula than specified (formula search only)',
 'no_ion': 'Exclude ions from the search (formula search only)',
 'cTG': 'Gas phase thermochemistry data',
 'cTC': 'Condensed phase thermochemistry data',
 'cTP': 'Phase change data',
 'cTR': 'Reaction thermochemistry data',
 'cIE': 'Gas phase ion energetics data',
 'cIC': 'Ion clustering data',
 'cIR': 'IR Spectrum',
 'cTZ': 'THz IR spectrum',
 'cMS': 'Mass spectrum (electron ionization)',
 'cUV': 'UV/Visible spectrum',
 'cGC': 'Gas Chromatography',
 'cES': 'Vibrational and/or electronic energy levels',
 'cDI': 'Constants of diatomic molecules',
 'cSO': "Henry's Law data"}

In [4]:
pd.set_option('display.max_columns', 20)
sub = df.loc[~df.inchi.isna() & ~df.mol2D.isna() & ~df[ps['cMS']].isna() & ~df[ps['cUV']].isna()]
sub

Unnamed: 0,ID,name,synonyms,formula,mol_weight,inchi,inchi_key,cas_rn,mol2D,mol3D,...,NIST Atomic Spectra Database - Ground states and ionization energies (on physics web site),NIST Atomic Spectra Database - Levels Holdings (on physics web site),NIST Atomic Spectra Database - Lines Holdings (on physics web site),NIST Polycyclic Aromatic Hydrocarbon Structure Index,Reference simulation,Reference simulation: SPC/E Water,Reference simulation: TraPPE Carbon Dioxide,"X-ray Photoelectron Spectroscopy Database, version 5.0","NIST / TRC Web Thermo Tables, ""lite"" edition (thermophysical and thermochemical data)","NIST / TRC Web Thermo Tables, professional edition (thermophysical and thermochemical data)"
11398,C100016,p-Nitroaniline,"Benzenamine, 4-nitro-\nAniline, p-nitro-\np-Am...",C6H6N2O2,138.1240,InChI=1S/C6H6N2O2/c7-5-1-3-6(4-2-5)8(9)10/h1-4...,TYMLOMAKGOJONV-UHFFFAOYSA-N,100-01-6,https://webbook.nist.gov/cgi/inchi?Str2File=C1...,https://webbook.nist.gov/cgi/inchi?Str3File=C1...,...,,,,,,,,https://srdata.nist.gov/xps/SpectralByCompdDd/...,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
11399,C100027,"Phenol, 4-nitro-","Phenol, p-nitro-\np-Hydroxynitrobenzene\np-Nit...",C6H5NO3,139.1088,"InChI=1S/C6H5NO3/c8-6-3-1-5(2-4-6)7(9)10/h1-4,8H",BTJIUGUIPKRLHP-UHFFFAOYSA-N,100-02-7,https://webbook.nist.gov/cgi/inchi?Str2File=C1...,https://webbook.nist.gov/cgi/inchi?Str3File=C1...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
11418,C100094,"Benzoic acid, 4-methoxy-",p-Anisic acid\np-Methoxybenzoic acid\nDraconic...,C8H8O3,152.1473,InChI=1S/C8H8O3/c1-11-7-4-2-6(3-5-7)8(9)10/h2-...,ZEYHEAKUIGZSGI-UHFFFAOYSA-N,100-09-4,https://webbook.nist.gov/cgi/inchi?Str2File=C1...,https://webbook.nist.gov/cgi/inchi?Str3File=C1...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
11423,C100107,"Benzaldehyde, 4-(dimethylamino)-","Benzaldehyde, p-(dimethylamino)-\np-(Dimethyla...",C9H11NO,149.1897,InChI=1S/C9H11NO/c1-10(2)9-5-3-8(7-11)4-6-9/h3...,BGNGWHSBYQYVRX-UHFFFAOYSA-N,100-10-7,https://webbook.nist.gov/cgi/inchi?Str2File=C1...,https://webbook.nist.gov/cgi/inchi?Str3File=C1...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
11428,C100129,"Benzene, 1-ethyl-4-nitro-",p-Ethylnitrobenzene\np-Nitroethylbenzene\np-Ni...,C8H9NO2,151.1626,InChI=1S/C8H9NO2/c1-2-7-3-5-8(6-4-7)9(10)11/h3...,RESTWAHJFMZUIZ-UHFFFAOYSA-N,100-12-9,https://webbook.nist.gov/cgi/inchi?Str2File=C1...,https://webbook.nist.gov/cgi/inchi?Str3File=C1...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66269,C99934,"Acetophenone, 4'-hydroxy-","Ethanone, 1-(4-hydroxyphenyl)-\np-Hydroxyaceto...",C8H8O2,136.1479,"InChI=1S/C8H8O2/c1-6(9)7-2-4-8(10)5-3-7/h2-5,1...",TXFPEBPIARQUIG-UHFFFAOYSA-N,99-93-4,https://webbook.nist.gov/cgi/inchi?Str2File=C9...,https://webbook.nist.gov/cgi/inchi?Str3File=C9...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
66272,C99945,"Benzoic acid, 4-methyl-",p-Toluic acid\np-Methylbenzoic acid\np-Toluyli...,C8H8O2,136.1479,"InChI=1S/C8H8O2/c1-6-2-4-7(5-3-6)8(9)10/h2-5H,...",LPNBBFKOUUSUDB-UHFFFAOYSA-N,99-94-5,https://webbook.nist.gov/cgi/inchi?Str2File=C9...,https://webbook.nist.gov/cgi/inchi?Str3File=C9...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
66279,C99967,"Benzoic acid, 4-hydroxy-","Benzoic acid, p-hydroxy-\np-Hydroxybenzoic aci...",C7H6O3,138.1207,"InChI=1S/C7H6O3/c8-6-3-1-5(2-4-6)7(9)10/h1-4,8...",FJKROLUGYXJWQN-UHFFFAOYSA-N,99-96-7,https://webbook.nist.gov/cgi/inchi?Str2File=C9...,https://webbook.nist.gov/cgi/inchi?Str3File=C9...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...
66284,C99978,"Benzenamine, N,N,4-trimethyl-","p-Toluidine, N,N-dimethyl-\np-Methyl-N,N-dimet...",C9H13N,135.2062,"InChI=1S/C9H13N/c1-8-4-6-9(7-5-8)10(2)3/h4-7H,...",GYVGXEWAOAAJEU-UHFFFAOYSA-N,99-97-8,https://webbook.nist.gov/cgi/inchi?Str2File=C9...,https://webbook.nist.gov/cgi/inchi?Str3File=C9...,...,,,,,,,,,,https://wtt-pro.nist.gov/wtt-pro/index.html?cm...


Also one can run a substructure search, e.g. to get only non-aromatic compounds:

In [5]:
from rdkit import Chem

# supress rdkit warnings
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

# prepare molecules for search
mols = [(ID, Chem.MolFromInchi(inchi)) for ID, inchi in zip(sub.ID, sub.inchi)]
mols = [(ID, mol) for ID, mol in mols if mol]

# search
pat = Chem.MolFromSmarts('[a]')
hits = [ID for ID, mol in mols if not mol.HasSubstructMatch(pat)]
print(f'{len(hits)} of {len(sub)} compounds were selected')

338 of 1555 compounds were selected


Those compounds can be retrieved via `nist.get_compound` function.