In [1]:
from torch_geometric.datasets import QM9


qm9 = QM9('./datasets/QM9')

Downloading https://data.pyg.org/datasets/qm9_v3.zip
Extracting datasets/QM9/raw/qm9_v3.zip
Processing...
Using a pre-processed version of the dataset. Please install 'rdkit' to alternatively process the raw data.
Done!


In [8]:
qm9

QM9(130831)

In [10]:
qm9[0]

Data(x=[5, 11], edge_index=[2, 8], edge_attr=[8, 4], y=[1, 19], pos=[5, 3], idx=[1], name='gdb_1', z=[5])

How do we convert this Graph to a Picture of a Molecule?

In [15]:
qm9[0].num_nodes

5

5 Atoms inside this Graph.

In [14]:
qm9[0].pos # Position Matrix for each of the Atoms

tensor([[-1.2700e-02,  1.0858e+00,  8.0000e-03],
        [ 2.2000e-03, -6.0000e-03,  2.0000e-03],
        [ 1.0117e+00,  1.4638e+00,  3.0000e-04],
        [-5.4080e-01,  1.4475e+00, -8.7660e-01],
        [-5.2380e-01,  1.4379e+00,  9.0640e-01]])

Lets see what information is available for each of the nodes

In [17]:
qm9[0].x

tensor([[0., 1., 0., 0., 0., 6., 0., 0., 0., 0., 4.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

What is this?

According to the implementation Details the node features are the concatination of two things.

There is a types Dict:

`{'H': 0, 'C': 1, 'N': 2, 'O': 3, 'F': 4}`

type_idx is an array of numbers associated with each symbol in the given dict.

It is then passed through `F.one_hot` function.


In [18]:
# Lets do an example:

import torch
import torch.nn.functional as F

F.one_hot(torch.tensor([0, 0, 1, 0, 0]), num_classes=5)

tensor([[1, 0, 0, 0, 0],
        [1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0],
        [1, 0, 0, 0, 0]])

In [19]:
qm9[0].x

tensor([[0., 1., 0., 0., 0., 6., 0., 0., 0., 0., 4.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

The first 5 indices are one hot encoded Atom type according to the given dict.
Then 6 other atom features are also added.

`[atomic_number, aromatic, sp, sp2, sp3, num_hs]`

`atomic_number` is obvious.

`aromatic` 1 if the atom is aromatic and 0 if it is not.

`sp, sp2, sp3` are different hybridizations for the atom.

`num_hs` number of hydrogen atoms connected to the atom.

1. The atomic numbers are put inside an array called `atomic_number` z is the tensor of this array.

2. A new tensor called `hs` is created. which is (z == 1)
3. `hs = (z == 1).to(torch.float)`
4. `num_hs = scatter(hs[row], col, dim_size=N).tolist()`

`num_hs` seems complicated! Maybe it is the number of Hydrogen atoms connected to an atom? Seems like it because when you look at `x` you can see that there are `4` hydrogen vertices and one `c` vertex.

What is `edge_attr`?

There is a dict:

`bonds = {BT.SINGLE: 0, BT.DOUBLE: 1, BT.TRIPLE: 2, BT.AROMATIC: 3}`

This specifies with type of bond an edge is.

The edges are bidirectional. meaning the number of edges will be always even. lets check that.


In [21]:
qm9[0].edge_attr

tensor([[1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.],
        [1., 0., 0., 0.]])

This basically is saying that all of the bonds are single.

`z` is the atomic number.

In [23]:
# Lets check if installing RDKit will make a difference in the targets or not?
qm9[0].y

tensor([[    0.0000,    13.2100,   -10.5499,     3.1865,    13.7363,    35.3641,
             1.2177, -1101.4878, -1101.4098, -1101.3840, -1102.0229,     6.4690,
           -17.1722,   -17.2868,   -17.3897,   -16.1519,   157.7118,   157.7100,
           157.7070]])

In [24]:
!pip install rdkit


Collecting rdkit
  Using cached rdkit-2022.3.5-cp38-cp38-macosx_11_0_arm64.whl (30.3 MB)
Collecting Pillow
  Downloading Pillow-9.2.0-cp38-cp38-macosx_11_0_arm64.whl (2.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m941.2 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hInstalling collected packages: Pillow, rdkit
Successfully installed Pillow-9.2.0 rdkit-2022.3.5


In [26]:
import rdkit

from torch_geometric.datasets import QM9


qm9 = QM9('./datasets/QM9')

qm9[0].y

Downloading https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/molnet_publish/qm9.zip
Extracting datasets/QM9/raw/qm9.zip
Downloading https://ndownloader.figshare.com/files/3195404


tensor([[    0.0000,    13.2100,   -10.5499,     3.1865,    13.7363,    35.3641,
             1.2177, -1101.4878, -1101.4098, -1101.3840, -1102.0229,     6.4690,
           -17.1722,   -17.2868,   -17.3897,   -16.1519,   157.7118,   157.7100,
           157.7070]])

The Answer is no.