# NistChemPy Tutorial

## Compound and Spectrum

To get NIST compound initialize `Compound` object with NIST ID. The main properties including name, chemical formula, InChI, and links to physico-chemical data will be parsed:

In [1]:
import nistchempy as nist
X = nist.Compound('C85018')
X.__dict__

{'ID': 'C85018',
 'name': 'Phenanthrene',
 'synonyms': ['Phenanthren', 'Phenanthrin', 'Phenantrin'],
 'formula': 'C14 H10',
 'mol_weight': 178.2292,
 'inchi': 'InChI=1S/C14H10/c1-3-7-13-11(5-1)9-10-12-6-2-4-8-14(12)13/h1-10H',
 'inchi_key': 'YNPNZTXNASCQKK-UHFFFAOYSA-N',
 'cas_rn': '85-01-8',
 'IR': [],
 'TZ': [],
 'MS': [],
 'UV': [],
 'mol2D': None,
 'mol3D': None,
 'data_refs': {'mol2D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str2File=C85018',
  'mol3D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str3File=C85018',
  'cTG': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C85018&Units=SI&Mask=1#Thermo-Gas'],
  'cTC': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C85018&Units=SI&Mask=2#Thermo-Condensed'],
  'cTP': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C85018&Units=SI&Mask=4#Thermo-Phase'],
  'cTR': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C85018&Units=SI&Mask=8#Thermo-React'],
  'cSO': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C85018&Units=SI&Mask=10#Solubility'],
  'cIE': ['https:/

Abbreviations of available data types can be viewed using the `print_search_params` function:

In [2]:
nist.print_search_parameters()

Units      :   Units for thermodynamic data, "SI" or "CAL" for calorie-based
MatchIso   :   Exactly match the specified isotopes (formula search only)
AllowOther :   Allow elements not specified in formula (formula search only)
AllowExtra :   Allow more atoms of elements in formula than specified (formula search only)
NoIon      :   Exclude ions from the search (formula search only)
cTG        :   Contains gas-phase thermodynamic data
cTC        :   Contains condensed-phase thermodynamic data
cTP        :   Contains phase-change thermodynamic data
cTR        :   Contains reaction thermodynamic data
cIE        :   Contains ion energetics thermodynamic data
cIC        :   Contains ion cluster thermodynamic data
cIR        :   Contains IR data
cTZ        :   Contains THz IR data
cMS        :   Contains MS data
cUV        :   Contains UV/Vis data
cGC        :   Contains gas chromatography data
cES        :   Contains vibrational and electronic energy levels
cDI        :   Contains constant

MOL files containing 2D and 3D coordinates and spectroscopic data will not be loaded due to the additional request required for each property. They can be downloaded later:

In [3]:
X.get_3D() # X.get_2d() for 2D coordinates
print(X.mol3D[:1000])


  NIST    07011517253D 1   1.00000  -539.53865
Copyright by the U.S. Sec. Commerce on behalf of U.S.A. All rights reserved.
 24 26  0  0  0  0  0  0  0  0999 V2000
    4.2671    4.2111    6.0319 C    0  0  0  0  0  0  0  0  0  0  0  0
    3.4011    3.3615    5.3683 C    0  0  0  0  0  0  0  0  0  0  0  0
    3.4337    3.2256    3.9602 C    0  0  0  0  0  0  0  0  0  0  0  0
    5.2115    4.9687    5.3136 C    0  0  0  0  0  0  0  0  0  0  0  0
    5.2684    4.8584    3.9386 C    0  0  0  0  0  0  0  0  0  0  0  0
    4.3927    3.9962    3.2378 C    0  0  0  0  0  0  0  0  0  0  0  0
    4.4609    3.8894    1.8079 C    0  0  0  0  0  0  0  0  0  0  0  0
    2.5375    2.3405    3.2259 C    0  0  0  0  0  0  0  0  0  0  0  0
    2.6439    2.2686    1.8051 C    0  0  0  0  0  0  0  0  0  0  0  0
    1.5565    1.5396    3.8570 C    0  0  0  0  0  0  0  0  0  0  0  0
    3.6253    3.0639    1.1234 C    0  0  0  0  0  0  0  0  0  0  0  0
    1.7801    1.4135    1.0811 C    0  


In [4]:
X.get_spectra('IR')
X.IR

[Spectrum(C85018, IR spectrum #0),
 Spectrum(C85018, IR spectrum #1),
 Spectrum(C85018, IR spectrum #2),
 Spectrum(C85018, IR spectrum #3),
 Spectrum(C85018, IR spectrum #4)]

The spectra are stored as a list, and each contains the text of a JCAMP-DX file:

In [5]:
spec = X.IR[2]
print(spec.compound, spec.spec_type, spec.spec_idx)
print('='*20)
print(spec.jdx_text[:1000])

Compound(C85018) IR 2
##TITLE=PHENANTHRENE
##JCAMP-DX=4.24
##DATA TYPE=INFRARED SPECTRUM
##CLASS=COBLENTZ
##ORIGIN=CENTRE D'ETUDES NUCLEAIRES DE GRENOBLE
##OWNER=COBLENTZ SOCIETY
Collection (C) 2018 copyright by the U.S. Secretary of Commerce
on behalf of the United States of America. All rights reserved.
##DATE=Not specified, most likely prior to 1970
##CAS REGISTRY NO=85-01-8
##MOLFORM=C14 H10
##SOURCE REFERENCE=COBLENTZ NO. 4253
##$NIST SOURCE=COBLENTZ
##$NIST IMAGE=cob4253
##SPECTROMETER/DATA SYSTEM=Not specified, most likely a prism, grating, or hybrid spectrometer.
##STATE=SOLUTION (SATURATED IN HEPTANE)
##PATH LENGTH=0.05 CM
$$PURITY 99.99%
##SAMPLING PROCEDURE=TRANSMISSION
##RESOLUTION=4
##DATA PROCESSING=DIGITIZED BY NIST FROM HARD COPY
##XUNITS=MICROMETERS
##YUNITS=TRANSMITTANCE
##XFACTOR=1.000000
##YFACTOR=1
##DELTAX=000.011124
##FIRSTX=14.665
##LASTX=35.1221
##FIRSTY=0.843
##MAXX=35.1221
##MINX=14.665
##MAXY=0.93
##MINY=0.358
##NPOINTS=1840
##XYDATA=(X++(Y..Y))
14.665000 0.

## Search

There are four available search types: by [name](https://webbook.nist.gov/chemistry/name-ser/), [InChI](https://webbook.nist.gov/chemistry/inchi-ser/), [CAS RN](https://webbook.nist.gov/chemistry/cas-ser/), and [chemical formula](https://webbook.nist.gov/chemistry/form-ser/). In addition to the main identifier, you can limit the search using several parameters, which can be using the `print_search_params` function:

In [6]:
nist.print_search_parameters()

Units      :   Units for thermodynamic data, "SI" or "CAL" for calorie-based
MatchIso   :   Exactly match the specified isotopes (formula search only)
AllowOther :   Allow elements not specified in formula (formula search only)
AllowExtra :   Allow more atoms of elements in formula than specified (formula search only)
NoIon      :   Exclude ions from the search (formula search only)
cTG        :   Contains gas-phase thermodynamic data
cTC        :   Contains condensed-phase thermodynamic data
cTP        :   Contains phase-change thermodynamic data
cTR        :   Contains reaction thermodynamic data
cIE        :   Contains ion energetics thermodynamic data
cIC        :   Contains ion cluster thermodynamic data
cIR        :   Contains IR data
cTZ        :   Contains THz IR data
cMS        :   Contains MS data
cUV        :   Contains UV/Vis data
cGC        :   Contains gas chromatography data
cES        :   Contains vibrational and electronic energy levels
cDI        :   Contains constant

These options can be specified when initializing the `Search` object or later in the find_compounds method as `**kwargs`:

In [7]:
search = nist.Search(NoIon = True, cMS = True)
search.parameters

SearchParameters(Units=SI, NoIon=True, cMS=True)

After setting parameters you can start searching compounds. Let's start with the name search. Search object have four properties, which are updated after each run of `find_compounds` method:
* `success`: was the search successful?
* `lost`: did the search stay within the limit of 400 compounds?
* `IDs`: NIST IDs of found compounds (`Compound` objects are not initialized here to prevent wasting time on internet requests);
* `compounds`: list of `Compound` objects, which is empty after search.

In [8]:
search.find_compounds(identifier = '1,2,3*-butane', search_type = 'name')
print(search)
print(search.success, search.lost, search.IDs, search.compounds)

Search(Success=True, Lost=False, Found=4)
True False ['C1871585', 'C298180', 'C1529686', 'C1464535'] []


After search finished, you can initialize `Compound` objects:

In [9]:
search.load_found_compounds()
print(search.compounds)
print(search.compounds[0].name)
print(search.compounds[0].synonyms)

[Compound(C1871585), Compound(C298180), Compound(C1529686), Compound(C1464535)]
Propane, 1,2,3-trichloro-2-methyl-
['1,2,3-Trichloro-2-methylpropane', '1,2,3-Trichloroisobutane']


Search by CAS registry number and InChI ignores some search parameters. Let's exemplify this on AgCl. Even though there are no available MS data for AgCl, the output contains it:

In [10]:
search.find_compounds('7783-90-6', 'cas')
search.parameters

SearchParameters(Units=SI, NoIon=True, cMS=True)

In [11]:
search.load_found_compounds()
X = search.compounds[0]
X.data_refs

{'mol2D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str2File=C7783906',
 'mol3D': 'https://webbook.nist.gov/cgi/cbook.cgi?Str3File=C7783906',
 'cTC': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C7783906&Units=SI&Mask=2#Thermo-Condensed'],
 'cTP': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C7783906&Units=SI&Mask=4#Thermo-Phase'],
 'cTR': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C7783906&Units=SI&Mask=8#Thermo-React'],
 'cIE': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C7783906&Units=SI&Mask=20#Ion-Energetics'],
 'cDI': ['https://webbook.nist.gov/cgi/cbook.cgi?ID=C7783906&Units=SI&Mask=1000#Diatomic']}

Search by chemical formula is the most powerful way of retrieving data. The only problem is the possibility that the number of found entries will exceed the limit of 400 compounds. To check if this happened, you need to get the `lost` property:

In [12]:
search = nist.Search(NoIon = True, cMS = True)
search.find_compounds('C6H*O?', 'formula')
search

Search(Success=True, Lost=True, Found=400)

To overcome that when searching for a large number of substances, try to break the chemical formula into subsets:

In [13]:
overflows = []
for i in range(1, 7):
    search.find_compounds(f'C6H?O{i}', 'formula')
    overflows.append( (len(search.IDs), search.lost) )
overflows

[(170, False), (178, False), (80, False), (42, False), (7, False), (24, False)]

This strategy can be used to combine search results and use the found identifiers to collect spectroscopic data.

## Extracted data on NIST compounds

Limiting search results to 400 substances and the impossibility to create an external API for the search by substructure brings significant inconvenience to the search process. To overcome this problem, we extracted all NIST Chemistry WebBook compounds using the [sitemap](https://webbook.nist.gov/sitemap_index.xml) and organized the data as a pandas data frame. It consists of 24 columns:
* columns **1–7** contains the compound description:

In [14]:
df = nist.get_all_data()
df.loc[:, df.columns[:7]]

Unnamed: 0,ID,name,formula,mol_weight,inchi,inchi_key,cas_rn
0,B100,iron oxide anion,FeO-,71.8450,,,
1,B1000,AsF3..Cl anion,AsClF3-,167.3700,,,
2,B1000000,AgH2-,AgH2-,109.8846,,,
3,B1000001,HAg(H2),AgH3,110.8920,,,
4,B1000002,AgNO+,AgNO+,137.8738,,,
...,...,...,...,...,...,...,...
129000,U99777,"Methyl 3-hydroxycholest-5-en-26-oate, TMS deri...",C31 H54 O3 Si,502.8442,InChI=1S/C31H54O3Si/c1-21(10-9-11-22(2)29(32)3...,DNXGNXYNSBCWGX-QBUYVTDMSA-N,
129001,U99830,"2-Methyl-3-oxovaleric acid, O,O'-bis(trimethyl...",C12 H26 O3 Si2,274.5040,"InChI=1S/C12H26O3Si2/c1-9-11(14-16(3,4)5)10(2)...",LXAIQDVPXKOIGO-KHPPLWFESA-N,
129002,U99942,3-Hydroxy-3-(4'-hydroxy-3'-methoxyphenyl)propi...,C19 H36 O5 Si3,428.7426,InChI=1S/C19H36O5Si3/c1-21-18-13-15(11-12-16(1...,QCMUGKOFXVYNCF-UHFFFAOYSA-N,
129003,U99947,"2-Propylpentanoic acid, 2,3,4,6-tetra(trimethy...",C26 H58 O7 Si4,595.0765,InChI=1S/C26H58O7Si4/c1-15-17-20(18-16-2)25(27...,OVXMRISJDUWFKB-UHFFFAOYSA-N,


* columns **8–23** correspond to the available compound properties, including atomic coordinates, spectra, and thermodynamic data (for the full description see the `print_search_params` function):

In [15]:
df.loc[:, df.columns[7:]]

Unnamed: 0,mol2D,mol3D,cIR,cTZ,cMS,cUV,cGC,cTG,cTC,cTP,cSO,cTR,cIE,cIC,cES,cDI
0,False,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False
1,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129000,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False
129001,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False
129002,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False
129003,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False


This data can be easily used to get the full list of compounds with the desired properties, and the use of chemoinformatic libraries will allow filtering substances by structure:

In [16]:
IDs = df.ID[~df.inchi.isna() & df.cMS & df.cUV]
compounds = [nist.Compound(ID) for ID in IDs[:5]]
compounds

[Compound(C100016),
 Compound(C100027),
 Compound(C100094),
 Compound(C100107),
 Compound(C100129)]