# Predicting GroundState Energy from Coulomb Matrices

This [dataset](https://www.kaggle.com/datasets/burakhmmtgl/energy-molecule) contains ground state energies of 16,242 molecules calculated by quantum mechanical simulations.

The data contains 1277 columns. The first 1275 columns are entries in the Coulomb matrix that act as molecular features. The 1276th column is the Pubchem Id where the molecular structures are obtained. The 1277th column is the atomization energy calculated by simulations using the Quantum Espresso package.

In the csv file, the first column (X1) is the data index.

The dataset was used for a [publication using a tree based ML Framework](https://arxiv.org/pdf/1609.07124.pdf).

## Getting the data

In this test we do NOT use PyTorch DataLoaders. Instead, we fabricate the batched tensors by hand and ensure that the data is accessed correctly by the training loop.

We create a class that inherits from torch's `Dataset` class to create our own dataset object. It must include the three given methods.

In [61]:
import os
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
import pytorch_lightning as pl

In [62]:
def GS_from_CM_Dataset(path_to_data, transform=None):
    if os.path.exists(path_to_data):
        data = pd.read_csv(path_to_data)
    else:
        raise FileExistsError('No dataset found')
    data = data.sample(frac=1)
    X = data.drop(columns=['id', 'pubchem_id', 'Eat'])
    y = data['Eat']
    
    return torch.tensor(X.to_numpy()).to(torch.float), torch.tensor(y.to_numpy()).to(torch.float)

In [63]:
dataset = GS_from_CM_Dataset(path_to_data='../PyTorch_examples/data/GS_from_CM/GS_from_CM_data.csv')
N_all = len(dataset[0])
subset = 0.9

X_train, y_train, X_test, y_test = \
    dataset[0][:int(N_all*subset)], dataset[1][:int(N_all*subset)], dataset[0][int(N_all*subset):], dataset[1][int(N_all*subset):]
N_train = X_train[0].shape[0]
# print(datatrain[0].shape[0] + datatest[1].shape[0], dataset[0].shape[0])
# print(datatrain[0].shape, datatrain[1].shape, datatest[0].shape, datatest[1].shape)


def tensorBatch(tensor, batch_size):
    """I think this only works for one and two dimensional tensors
    Returns the batched tensor
    """
    N_batch = tensor.shape[0] // batch_size
    N_elems = N_batch * batch_size
    
    print(f'Losing {tensor.shape[0] - N_elems} datapoints due to batching.')
    
    return tensor[:N_elems].reshape(N_batch, batch_size, -1).squeeze(), N_batch

batch_size = 128

# Check that it works. I think it does.
# print(X_train.shape, tensorBatch(X_train, batch_size)[0].shape)
# print(y_train.shape, tensorBatch(y_train, batch_size)[0].shape)
# print(X_test.shape, tensorBatch(X_test, batch_size)[0].shape)
# print(y_test.shape, tensorBatch(y_test, batch_size)[0].shape)

X_train, N_batch = tensorBatch(X_train, batch_size)
y_train, _ = tensorBatch(y_train, batch_size)
# X_test, y_test = [tensorBatch(tens, batch_size) for tens in test_data]

def getDataLoader(X, y):
    if X.shape[0] != y.shape[0]:
        raise IndexError(f'Number of batches is different for the inputs',
                         f'First argument -> {X.shape[0]} != {y.shape[0]} <- Second argument')
    batched = []
    for i in range(X.shape[0]):
        batched.append((X[i], y[i]))
    return batched

train_dataload = getDataLoader(X_train, y_train)
test_dataload = (X_test, y_test)

input_len = X_train.shape[-1]

Losing 25 datapoints due to batching.
Losing 25 datapoints due to batching.


Now, instead of vanilla PyTorch, we will create the model using Lightning.

In [64]:
from torch import nn
import torch.nn.functional as F

class GS_from_CM_model(pl.LightningModule):
    
    def __init__(self):
        print('I am here')
        super().__init__()
        self.energy = nn.Sequential(
            nn.Linear(input_len, 16, bias=False),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1)
        )
    
    def forward(self, X):
        return self.energy(X)
    
    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=1e-3)
    
    def training_step(self, train_batch, batch_idx):
        X, y = train_batch
        pred = self.energy(X)
        loss = F.mse_loss(pred, y)
        self.log('train_loss', loss)
        return loss


In [65]:
model = GS_from_CM_model()  # Instantiate the model

I am here


In [66]:
trainer = pl.Trainer(max_epochs=20)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [67]:
trainer.fit(model=model, train_dataloaders=iter(train_dataload))  # WHY PASSED AS ITER WORKS¿?¿?!!


  | Name   | Type       | Params
--------------------------------------
0 | energy | Sequential | 20.5 K
--------------------------------------
20.5 K    Trainable params
0         Non-trainable params
20.5 K    Total params
0.082     Total estimated model params size (MB)


Epoch 0: : 19it [00:00, 162.02it/s, loss=117, v_num=1]

  loss = F.mse_loss(pred, y)


Epoch 19: : 0it [00:00, ?it/s, loss=99.8, v_num=1]      
