# QM9 Molecule Dataset [1, 2]

#### Format of the XYZ-like files [1]

- 1$^{st}$ line: number of atoms ($n_a$) that composes the molecule


- 2$^{nd}$ line: scalar properties, as defined by the following sequence

| index number | observable | units | description |
| :- | :- | :- | :- |
| 1 | tag | | gdb9 string |
| 2 | identifier | | Unique integer identifier of molecule |
| 3 | A | GHz |	Rotational constant |
| 4 | B | GHz |	Rotational constant |
| 5 | C | GHz |	Rotational constant |
| 6 | μ | Debye | Dipole moment magnitude |
| 7 | α | $a^3_0$ | Isotropic polarizability |
| 8 | ϵHOMO | Hartree | Energy of HOMO |
| 9 | ϵLUMO | Hartree | Energy of LUMO |
| 10 | ϵgap | Hartree |Gap (ϵLUMO−ϵHOMO) |
| 11 | 〈R2〉 | $a^2_0$ | Electronic spatial extent |
| 12 | zpve | Hartree | Zero point vibrational energy |
| 13 | U(0K) | Hartree | Internal energy at 0.0 K |
| 14 | U(298K) | Hartree | Internal energy at 298.15 K |
| 15 | H(298K) | Hartree | Enthalpy at 298.15 K |
| 16 | G(298K) | Hartree | Free energy at 298.15 K |
| 17 | Cv(298K) | $\frac{cal}{mol K}$ | Heat capacity at 298.15 K |

- 3$^{rd}$ to ($n_a$ + 2) lines: properties as defined by the following sequence

| description | units |
| :- | :- |
| Element abbreviation (e.g. C) |   |
| Coordinate x | Å | 
| Coordinate y | Å |
| Coordinate z | Å |
| Mulliken partial atomic charges | e |

- $n_a$ + 3 lines: harmonic vibrational frequencies (cm$^{-1}$), where the total number of frequence are defined by
    - linear molecules = $3 * n_a − 5$
    - non-linear molecules $3 * n_a − 6$
    
    
- $n_a$ + 4 lines: SMILES strings from GDB-17 and from B3LYP relaxation


- $n_a$ + 5 lines: InChI strings for Corina and B3LYP geometries

---
#### Sources:
1. Ramakrishnan, R., Dral, P., Rupp, M. et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 1, 140022 (2014). https://doi.org/10.1038/sdata.2014.22
2. http://quantum-machine.org/datasets

In [2]:
import csv
import sys

from pathlib import Path

Specify the following directories:
1. the path to the original qm9 XYZ-like files (i.e. they do have .xyz file extensions, even though the are not XYZ files)
2. the path to where you want to save the cleaned up files

---

#### Note
Below for testing, I have 10 qm9 XYZ-like files in a subdirectory called `structures_tests`
 (e.g. cp structures/dsgdb9nsd_00001*.xyz structures_tests/).
 
I am saving the cleaned up files to `structures_clean`.

For the full downloaded data set, I untarred the file (`dsgdb9nsd.xyz.tar`) in the folder `structures`. Know that there are a large number of files that results.

---

In [3]:
path_qm9_original = Path('structures')
path_qm9_clean = Path('structures_clean')

In [4]:
def replace_hidden_char(input_file: str=None, source_dir: str=None, output_dir: str=None):
    ''' Replaces the following hidden characters in a file:
            a. tabs (i.e. `\t`)
            b. double while spaces (i.e. `  `), and
            c. end-of-line while spaces, while preserving the carriage return

        The improved new file is save in a specified directory.

        Input
            input_file: file name
            source_dir: directory for where the input_file is located
            output_dir: directory for where the ouput_file will be written
        Output
            a new file with the same prefix name, but whose extension is changed
                to 'txt' and then saved to the output_dir
    '''

    if not isinstance(input_file, str):
        sys.exit(f'Error: {input_file} is not a string.')
    elif not isinstance(source_dir, str):
        sys.exit(f'Error: {source_dir} is not a string.')
    elif not isinstance(output_dir, str):
        sys.exit(f'Error: {output_dir} is not a string.')
    else:
        output = input_file.replace(source_dir, output_dir).replace('xyz', 'txt')

        with open(input_file) as file_in, open(output, 'w') as file_out:
            for line in file_in:
                file_out.write(line.replace('\t', ' ').replace('  ', ' ').replace(' \n', '\n'))

In [5]:
files = list(path_qm9_original.glob('*.xyz'))

for file in files:
    replace_hidden_char(input_file=str(file),
                        source_dir=str(path_qm9_original),
                        output_dir=str(path_qm9_clean))

In [6]:
def process_qm9_file(input_file: str=None, source_dir: str=None):
    ''' Process a single XYZ-like file from the QM9 dataset.

        Input
            input_file: file name
            source_dir: directory for where the input_file is located
        Return
            n_atoms (int): number of atoms in molecule
            scalar_dict (dictionary): scalar properties
            coords (list): element, x coordinate, y coordiante, z coordinate, Mulliken partial atomic charge
            freq (list): vibrational frequencies
            smiles (list): SMILES strings
            inchi (list): InChi strings
    '''

    n_atoms = None
    scalar_values = None
    coords = []
    frequencies = None
    smiles = None
    inchi = None

    if not isinstance(input_file, str):
        sys.exit(f'Error: {input_file} is not a string.')
    elif not isinstance(source_dir, str):
        sys.exit(f'Error: {source_dir} is not a string.')
    else:
        with open(input_file) as file:
            for line_number, line in enumerate(file):
                if line_number == 0:
                    n_atoms = int(line)
                elif line_number == 1:
                    scalar_values = line.split()
                elif 2 <= line_number <= n_atoms + 1:   ## coordinates
                    for atom in range(0, n_atoms, 1):
                        if line_number == atom + 2:     ## skip the first two lines
                            coords.append(line.split())
                elif line_number == n_atoms + 2:
                    freq = line.split()
                elif line_number == n_atoms + 3:
                    smiles = line.split()
                elif line_number == n_atoms + 4:
                    inchi = line.split()

    if len(coords) != n_atoms:
        sys.exit(f'Error: the n_atoms does not match the number of coordinate sets (x,y,z) '
                 f'read in ({input_file}).')

    scalar_observables = ['tag', 'number',
                          'A', 'B', 'C',
                          'mu', 'alpha',
                          'HOMO', 'LUPO', 'gap',
                          'r2',
                          'ZPVE', 'U0', 'U', 'H', 'G', 'Cv']

    if len(scalar_values) == 17:
        scalar_dict = dict(zip(scalar_observables, scalar_values))
    else:
        sys.exit(f'Error: there is an incorrect number of scalar propertes ({input_file} line 2).')

    return n_atoms, scalar_dict, coords, freq, smiles, inchi

In [7]:
def split_coord_mulliken(elem_coord_mulliken: list=None, input_file: str=None):
    ''' Takes the 'coordinate' part of QM9's XYZ-like file and
            splits it into true XYZ content (i.e. element, x,y,z-coordinates)
            and the Mulliken partial atomic charges (PAC).
            
        Input
            elem_coord_mulliken (list of lists): element, x,y,z-coordinates, Mulliken PAC
        Output
            a new file that represents a true XYZ formatted file, including the
                QM9 molecule idenifier
        Return
            atom_xyz (list of lists): element, x,y,z-coordinates
            mulliken_dict (dict): a dictionary that provides the element and mulliken charge
    '''
    
    if not isinstance(elem_coord_mulliken, list):
        sys.exit(f'Error: {elem_coord_mulliken} is not a list.')
    elif not isinstance(input_file, str):
        sys.exit(f'Error: {input_file} is not a string.')
    else:
        ## create and save a true xyz formatted fiel
        atom_xyz = [elem_xyz[:-1] for elem_xyz in elem_coord_mulliken]

        output = input_file.replace('txt', 'xyz')
        
        with open(output,"w") as file:
            writer = csv.writer(file, delimiter=' ', quoting = csv.QUOTE_NONE)
            writer.writerows([{n_atoms}])
            writer.writerows([[f"{scalar_dict['tag']}_molecule_number_{scalar_dict['number']}"]])
            writer.writerows(atom_xyz)

        ## create mulliken partial atomic charge dictionary
        element_number = []
        mulliken_pac = []
        
        for number, content in enumerate(elem_coord_mulliken):
            element_number.append(f'{content[0]}{number}')
            mulliken_pac.append(float(content[-1]))

        mulliken_dict = dict(zip(element_number, mulliken_pac))
        
        return atom_xyz, mulliken_dict

In [8]:
files = list(path_qm9_clean.glob('*.txt'))

for file in files:
    #print(str(file))
    process_qm9_file(input_file=str(file), source_dir=str(path_qm9_clean))
    n_atoms, scalar_dict, coords, freq, smiles, inchi = process_qm9_file(input_file=str(file),
                                                                         source_dir=str(path_qm9_clean))

    atom_xyz, mulliken_charges = split_coord_mulliken(elem_coord_mulliken=coords, input_file=str(file))

    #print(n_atoms)
    #print(scalar_dict)
    #print(atom_xyz)
    #print(freq)
    #print(smiles)
    #print(inchi)
    #print(mulliken_charges)

    linear = None
    if 3 * n_atoms - 5 == len(freq):
        linear = True
    elif 3 * n_atoms - 6 == len(freq):
        linear = False
    #print(f'Molecule has a linear geometry: {linear}')

    print()

structures_clean/dsgdb9nsd_125225.txt
18
{'tag': 'gdb', 'number': '125225', 'A': '5.18676', 'B': '1.00079', 'C': '0.88165', 'mu': '2.4847', 'alpha': '80.76', 'HOMO': '-0.2044', 'LUPO': '0.0303', 'gap': '0.2347', 'r2': '1335.1945', 'ZPVE': '0.149004', 'U0': '-398.126029', 'U': '-398.117489', 'H': '-398.116544', 'G': '-398.160206', 'Cv': '31.161'}
[['C', '0.0819259979', '1.5046804323', '0.0379205798'], ['C', '0.0060284406', '0.0151434746', '-0.046636406'], ['C', '0.1057230572', '-0.8480748649', '-1.1181440296'], ['C', '-0.0526481266', '-2.1427080298', '-0.5629800029'], ['N', '-0.2357463599', '-2.0970816957', '0.7527775634'], ['N', '-0.1988171951', '-0.7718789346', '1.0405520581'], ['N', '-0.0924445885', '-3.3475976214', '-1.2790227992'], ['C', '1.1753692478', '-3.9739077579', '-1.6688048893'], ['C', '0.3250726642', '-4.5659042572', '-0.5987234273'], ['H', '0.9219962281', '1.836002537', '0.6594334047'], ['H', '0.2173663288', '1.9292322623', '-0.958717299'], ['H', '-0.8323416726', '1.93491

21
{'tag': 'gdb', 'number': '106797', 'A': '2.47217', 'B': '1.42117', 'C': '1.08612', 'mu': '0.9586', 'alpha': '78.04', 'HOMO': '-0.2281', 'LUPO': '0.0752', 'gap': '0.3033', 'r2': '1187.2201', 'ZPVE': '0.182415', 'U0': '-424.217521', 'U': '-424.20848', 'H': '-424.207536', 'G': '-424.250908', 'Cv': '34.894'}
[['C', '-0.0685219115', '1.4819785423', '-0.0808243188'], ['C', '0.0441675504', '-0.0091746803', '0.0693384505'], ['C', '1.3523454842', '-0.7490606444', '-0.0964914294'], ['C', '0.616864799', '-0.7607931754', '1.2463456659'], ['C', '1.2495520088', '-0.1192229159', '2.4526960417'], ['O', '2.1346082054', '-0.9993585097', '3.1360822391'], ['C', '-0.3269448431', '-1.9367462004', '1.4653340745'], ['C', '-1.1480023813', '-1.9803869481', '0.161037382'], ['O', '-1.1214647009', '-0.6614740385', '-0.3828337303'], ['H', '0.8263503268', '1.9843095462', '0.2950772511'], ['H', '-0.1889473679', '1.7477828212', '-1.1360105477'], ['H', '-0.9388503945', '1.860624924', '0.4641281562'], ['H', '1.385064

15
{'tag': 'gdb', 'number': '23256', 'A': '2.99209', 'B': '1.50218', 'C': '1.08721', 'mu': '1.8504', 'alpha': '71.47', 'HOMO': '-0.2297', 'LUPO': '-0.075', 'gap': '0.1547', 'r2': '1062.148', 'ZPVE': '0.112597', 'U0': '-453.887471', 'U': '-453.879938', 'H': '-453.878994', 'G': '-453.919666', 'Cv': '28.711'}
[['O', '-0.0536089971', '1.2928168397', '0.232120298'], ['N', '0.0049726238', '-0.0818735214', '0.1516331783'], ['C', '-1.1418803592', '-0.6415736258', '0.0412230413'], ['C', '-1.2746804335', '-2.1079774195', '-0.1131160949'], ['C', '-2.1770147084', '-2.8000348264', '0.856349612'], ['N', '-2.6831453583', '-2.4431042486', '-0.46687532'], ['C', '-3.4527644324', '-1.1903546067', '-0.4697770396'], ['C', '-2.5259446893', '-0.0393587975', '-0.0461903917'], ['O', '-2.8515703917', '1.1034219114', '0.1419022092'], ['H', '0.8781012324', '1.5358163688', '0.2896102358'], ['H', '-0.4885837015', '-2.6575745989', '-0.618862613'], ['H', '-2.0490874709', '-3.8646309058', '1.0283060367'], ['H', '-2.52

21
{'tag': 'gdb', 'number': '114344', 'A': '2.74499', 'B': '1.33648', 'C': '1.20884', 'mu': '0.795', 'alpha': '78.56', 'HOMO': '-0.2342', 'LUPO': '0.0833', 'gap': '0.3175', 'r2': '1161.7497', 'ZPVE': '0.182649', 'U0': '-424.214084', 'U': '-424.205142', 'H': '-424.204198', 'G': '-424.247819', 'Cv': '33.634'}
[['C', '-0.4884969132', '1.1385483464', '-0.1678086787'], ['O', '-0.2397499582', '-0.2535562419', '-0.2653911068'], ['C', '0.7720421011', '-0.5804758686', '-1.1757818597'], ['C', '1.0109517183', '-2.1061299879', '-1.1375938786'], ['C', '0.7335854959', '-2.5913508468', '-2.5529529668'], ['C', '-0.6640187522', '-2.4215823341', '-3.1423172217'], ['C', '-1.9200440569', '-2.2211230839', '-2.3250543296'], ['C', '0.3916025018', '-1.3627288627', '-3.3459060657'], ['O', '0.4043582147', '-0.2323693835', '-2.5084833276'], ['H', '0.4236029198', '1.6840839548', '0.1177661944'], ['H', '-0.8611832383', '1.5463901282', '-1.1139089407'], ['H', '-1.24235117', '1.2733695489', '0.6106961075'], ['H', '1

17
{'tag': 'gdb', 'number': '26951', 'A': '3.0731', 'B': '1.38129', 'C': '1.03576', 'mu': '0.4488', 'alpha': '76.54', 'HOMO': '-0.2228', 'LUPO': '0.0075', 'gap': '0.2303', 'r2': '1154.8888', 'ZPVE': '0.136776', 'U0': '-418.014112', 'U': '-418.006078', 'H': '-418.005134', 'G': '-418.047305', 'Cv': '29.555'}
[['C', '0.1228725442', '1.5043473597', '0.2325865305'], ['C', '-0.0052285204', '0.0473552574', '0.0014441156'], ['C', '-1.0569541838', '-0.7861753748', '-0.2087008196'], ['N', '-0.5907031627', '-2.0919472335', '-0.3697588732'], ['C', '0.6894356591', '-1.9828941441', '-0.2514441212'], ['O', '1.1323752833', '-0.7266622522', '-0.0256063171'], ['N', '-2.4094007951', '-0.4086148328', '-0.2588780345'], ['C', '-3.2721106893', '-1.1524962342', '-1.192282626'], ['C', '-3.3882334596', '-1.3741126784', '0.2689868068'], ['H', '0.5893157865', '1.7228734399', '1.1998791699'], ['H', '0.7282277035', '1.987095337', '-0.5429793048'], ['H', '-0.8765620651', '1.9438087769', '0.2194875794'], ['H', '1.437

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

