<h1>Loading QM9 dataset and Create DataLoaders with features based on the GeoMol Paper<h1/>

In this Notebook, we recreate the featurization functions made by the GeoMol[] paper. We will try to fix the issues that happened and gain a deeper understanding of how it actually was implemented.

### 1) Import necessary libraries

In [1]:
import torch
from torch_geometric.datasets import QM9
import pickle
from rdkit import Chem
import torch_geometric.data.data
from torch_geometric.data import Dataset, Data, Batch, InMemoryDataset
from torch_geometric.loader import DataLoader
#from torch.utils.data import DataLoader
import os
import os.path as osp
import sys
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
from torch_geometric.data.collate import collate
from tqdm import tqdm
import glob
from rdkit.Chem.rdchem import HybridizationType
from rdkit.Chem.rdchem import BondType as BT
from rdkit import Chem
from torch.nn.functional import one_hot
from torch import scatter
import random
from rdkit.Chem.rdchem import ChiralType
import numpy as np
from torchvision import transforms
from torch import Tensor
from dig.threedgraph.method import SphereNet 
from dig.threedgraph.evaluation import ThreeDEvaluator
from dig.threedgraph.method import run

In [7]:
sys.path.append('..')
from Helper_Libraries.featurization_qm9 import qm9_data

In [2]:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda', index=0)

Now we load the QM9 dataset from https://github.com/klicperajo/dimenet/tree/master/data which is already processed

In [3]:
datasettorch = QM9(root='../../../others_approaches/embeddings_nets/DIG-dig-stable/tutorials/KDD2022/dataset/')

Let's see how the data is represented in this form.

In [6]:
print(datasettorch[0])

Data(x=[5, 11], edge_index=[2, 8], edge_attr=[8, 4], y=[1, 19], pos=[5, 3], z=[5], name='gdb_1', idx=[1])


we can access the SMILES rpresentation using the "name" feature. We can also add different featrures to the nodes as we want. According to the code provided by torch_geometric.datasets.qm9 source, not all features from the GeoMol paper are used. so we need to be able to create data objects containing all the necessary features. So we need to create a function that iterates through all the qm9 dataset and turns them into data objects suitable for dataloader objects. And define the input dimenssion of the models based on the data.

Let's first investigate the data in their folders.

In [4]:
path_to_data = '../../../others_approaches/conformation_generation/GeoMol/data/QM9/'


Now we take a look at the data

In [None]:
mol_file = open(path_to_data+'/c1cc2c(cn1)CCC2.pickle', 'rb')

# dump information to that file
data = pickle.load(mol_file)

# close the file
mol_file.close()

print(data.keys())
mol = data['conformers'][0]['rd_mol']
print(Chem.MolToMolBlock(mol))

### 2) Create an InMemoryDataset class
#### InMemoryDataset class is a Pytorch Geometric data structure that is used to store graph-based datasets. It is perfect for graph neural networks. It stores all the different graphs in one graph whith a map to access each one individually.

This class will utilise a custom function called "featurize()" to build the graph objects with the desired set of features.

So we need to have our own DataSet class that contains all data objects of qm9 molecules

In [5]:
qm9_set = qm9_data(root= path_to_data)

Processing...
  0%|          | 1/133232 [00:00<05:25, 409.28it/s]


{'geom_id': 121529326, 'set': 1, 'degeneracy': 1, 'totalenergy': -5.20677199, 'relativeenergy': 0.0, 'boltzmannweight': 1.0, 'conformerweights': [1.0], 'rd_mol': <rdkit.Chem.rdchem.Mol object at 0x7fa65f594630>}
{'geom_id': 122742594, 'set': 1, 'degeneracy': 1, 'totalenergy': -9.42852736, 'relativeenergy': 0.0, 'boltzmannweight': 1.0, 'conformerweights': [1.0], 'rd_mol': <rdkit.Chem.rdchem.Mol object at 0x7fa65f5947c0>}


NameError: name 'y' is not defined

In [None]:
print(train_dataset.data)

In [None]:
loader = DataLoader(train_dataset, batch_size=2)

In [None]:
for step, batch_data in enumerate(tqdm(loader)):
    print(type(batch_data.y))


In [None]:
target = 'boltzmann_weight' # mu, alpha, homo, lumo, gap, r2, zpve, U0, U, H, G, Cv
qm9_set.data.y = qm9_set.data[target]


split_idx = qm9_set.get_idx_split(len(qm9_set.data.y), train_size=300, valid_size=100, seed=42)

train_dataset, valid_dataset, test_dataset = qm9_set[split_idx['train']], qm9_set[split_idx['valid']], qm9_set[split_idx['test']]


In [None]:
model = SphereNet(energy_and_force=False, cutoff=5.0, num_layers=4, 
        hidden_channels=128, out_channels=1, int_emb_size=64, 
        basis_emb_size_dist=8, basis_emb_size_angle=8, basis_emb_size_torsion=8, out_emb_channels=256, 
        num_spherical=3, num_radial=6, envelope_exponent=5, 
        num_before_skip=1, num_after_skip=2, num_output_layers=3, use_node_features=True
        )
loss_func = torch.nn.L1Loss()
evaluation = ThreeDEvaluator()

In [None]:
print(type(train_dataset))                                    

In [None]:
run3d = run()
run3d.run(device, train_dataset, valid_dataset, test_dataset, 
        model, loss_func, evaluation, 
        epochs=20, batch_size=4, vt_batch_size=4, lr=0.0005, lr_decay_factor=0.5, lr_decay_step_size=15)