# Predicting GroundState Energy from Coulomb Matrices

This [dataset](https://www.kaggle.com/datasets/burakhmmtgl/energy-molecule) contains ground state energies of 16,242 molecules calculated by quantum mechanical simulations.

The data contains 1277 columns. The first 1275 columns are entries in the Coulomb matrix that act as molecular features. The 1276th column is the Pubchem Id where the molecular structures are obtained. The 1277th column is the atomization energy calculated by simulations using the Quantum Espresso package.

In the csv file, the first column (X1) is the data index.

The dataset was used for a [publication using a tree based ML Framework](https://arxiv.org/pdf/1609.07124.pdf).

## Getting the data

In this test we do NOT use PyTorch DataLoaders. Instead, we fabricate the batched tensors by hand and ensure that the data is accessed correctly by the training loop.

We create a class that inherits from torch's `Dataset` class to create our own dataset object. It must include the three given methods.

In [2]:
import os
import pandas as pd
from torch.utils.data import Dataset, DataLoader
import torch
import numpy as np
import pytorch_lightning as pl

Create the dataset and dataloaders for pytorch lightining. Note that the data reading is performed at the `__init__` and the `__getindex__` simply access it. This is done on purpose so that everytime the data is accessed it does not have to read/load it.

In [3]:
class GS_from_CM_Dataset(Dataset):
    def __init__(self, path_to_data):
        if os.path.exists(path_to_data):
            self.data = pd.read_csv(path_to_data)
        else:
            raise FileExistsError('No dataset found')
        self.data = self.data.sample(frac=1)
        self.X = torch.tensor(self.data.drop(columns=['id', 'pubchem_id', 'Eat']).to_numpy()).to(torch.float)
        self.y = torch.tensor(self.data['Eat'].to_numpy()).to(torch.float)
        self.length = self.y.shape[0]
    
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]


In [6]:
data = GS_from_CM_Dataset(path_to_data='../PyTorch_examples/data/GS_from_CM/GS_from_CM_data.csv')

batch_size = 128

data_train, data_test = torch.utils.data.random_split(data, (int(len(data)*0.8), len(data)-int(len(data)*0.8)))
data_train, data_val = torch.utils.data.random_split(data_train, (int(len(data_train)*0.8), len(data_train)-int(len(data_train)*0.8)))

print(f'Train data      -> {len(data_train)}')
print(f'Validation data -> {len(data_test)}')
print(f'Test data       -> {len(data_val)}')
print(f'Total data      -> {len(data)} = {len(data_train)} + {len(data_test)} + {len(data_val)}')

train_loader, test_loader, val_loader = [DataLoader(s, batch_size, shuffle=True, num_workers=4) for s in (data_train, data_test, data_val)]

Train data      -> 10394
Validation data -> 3249
Test data       -> 2599
Total data      -> 16242 = 10394 + 3249 + 2599


## Create the Model

Now, instead of vanilla PyTorch, we will create the model using Lightning.

In [7]:
from torch import nn
import torch.nn.functional as F

class GS_from_CM_model(pl.LightningModule):
    
    def __init__(self, input_len):
        super().__init__()
        self.energy = nn.Sequential(
            nn.Linear(input_len, 16, bias=False),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1)
        )
    
    def forward(self, X):
        return self.energy(X)
    
    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=1e-3)
    
    def training_step(self, train_batch, batch_idx):
        X, y = train_batch
        pred = self.energy(X)
        loss = F.mse_loss(pred.squeeze(), y.squeeze())
        self.log('train_loss', loss)
        return loss

    def test_step(self, test_batch, batch_idx):
        X, y = test_batch
        pred = self.energy(X)
        loss = F.mse_loss(pred.squeeze(), y.squeeze())
        self.log('test_loss', loss)
    
    def validation_step(self, val_batch, batch_idx):
        X, y = val_batch
        pred = self.energy(X)
        loss = F.mse_loss(pred.squeeze(), y.squeeze())
        self.log('val_log', loss)


## Training and testing the model

We can now directly train the model automatically, performing all test and validation steps.

In [8]:
model = GS_from_CM_model(input_len=data_train[0][0].shape[-1])  # Instantiate the model

# Get the trainer
trainer = pl.Trainer(max_epochs=20)

# Train the model
trainer.fit(model=model, train_dataloaders=train_loader, val_dataloaders=val_loader)  # WHY PASSED AS ITER WORKS¿?¿?!!

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name   | Type       | Params
--------------------------------------
0 | energy | Sequential | 20.5 K
--------------------------------------
20.5 K    Trainable params
0         Non-trainable params
20.5 K    Total params
0.082     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(


Epoch 19: 100%|██████████| 103/103 [00:02<00:00, 45.27it/s, loss=13.8, v_num=14]


In [9]:
trainer.test(model, dataloaders=test_loader)

  rank_zero_warn(


Testing DataLoader 0: 100%|██████████| 26/26 [00:00<00:00, 93.34it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss            13.09735107421875
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_loss': 13.09735107421875}]

To see the results we can use `tensorboard --logdir=PyTorchLightning_examples/lightning_logs/` in the terminal