# Label QM9 as Synthesizable or Not
QM9 is a well-known is a well-known dataset of molecular energies spanning most molecules with sizes between 1 and 9 heavy atoms. 
Given the small size of molecules, we assume that if it has not been reported in the literature and referenced in PubChem by now then it is not synthesizable.
Not a perfect assumption, but one we can use to get a rough idea about thermodynamic bounds of molecular synthesizability.

In [1]:
from emin.source import get_inchi_keys_from_pubchem
from rdkit import Chem, RDLogger
from pathlib import Path
from tqdm import tqdm
import pandas as pd
import requests
RDLogger.DisableLog('rdApp.*')

## Download QM9-G4MP2
We are going to use [a version of QM9 with energies computed at the high-accuracy, G4MP2 level](https://pubs.rsc.org/en/content/articlehtml/2019/sc/c9sc02834j) as a starting point.

In [2]:
%%time
qm9 = pd.read_json('https://github.com/globus-labs/g4mp2-atomization-energy/raw/master/data/output/g4mp2_data.json.gz', lines=True)
print(f'Loaded {len(qm9)} molecules')

Loaded 130258 molecules
CPU times: user 3.12 s, sys: 1.42 s, total: 4.54 s
Wall time: 6.42 s


Remove duplicates

In [3]:
qm9.sort_values('g4mp2_0k', ascending=True)
qm9.drop_duplicates('inchi_0', inplace=True, keep='first')
print(f'Trimmed down to {len(qm9)} unique molecules')

Trimmed down to 126405 unique molecules


## Compute Composition
Get the chemical composition of each, which we're going to use to find whether they are in PubChem

In [4]:
def get_composition(inchi: str):
    """Get the chemical composition from an InChI string
    
    Args:
        inchi: InChI string
    Returns:
        Chemical formula
    """
    return inchi.split("/")[1]

In [5]:
qm9['formula'] = qm9.inchi_1.apply(get_composition)

## Find in PubChem
PubChem has a [fantastic API](https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest) and we can use it to find which molecules held within QM9 are also in PubChem

Start by getting an InChI Key, which we can use to detect whether molecules are held in PubChem

In [6]:
%%time
qm9['inchi_key'] = qm9['inchi_1'].apply(Chem.MolFromInchi).apply(lambda x: Chem.MolToInchiKey(x) if x is not None else x)

CPU times: user 30.9 s, sys: 3.02 s, total: 34 s
Wall time: 34 s


Label whether every entry is in PubChem

In [7]:
qm9['in_pubchem'] = None

In [9]:
for formula, group in tqdm(qm9.groupby('formula')):
    if not all(x is not None for x in group['in_pubchem']):  # Allows restarting if the cell fails
        known_inchi_keys = get_inchi_keys_from_pubchem(formula)
        in_pubchem = group['inchi_key'].apply(known_inchi_keys.__contains__)
        qm9.loc[in_pubchem.index, 'in_pubchem'] = in_pubchem

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 702/702 [07:27<00:00,  1.57it/s]


In [10]:
frac = qm9['in_pubchem'].mean()
print(f'{frac*100:.1f}% of QM9 is in PubChem')

16.9% of QM9 is in PubChem


Save the content to disk

In [11]:
data_dir = Path('data')
data_dir.mkdir(exist_ok=True)
qm9.to_json(data_dir / 'qm9.json.gz', lines=True, index=False, orient='records')

Save the SMILES strings

In [12]:
with (data_dir / 'qm9.sdf').open('w') as fp:
    for smiles in qm9['smiles_0']:
        print(smiles, file=fp)