# Open MatSci ML Toolkit Tutorial: Training your Custom Model

In this tutorial, we demonstrate how to setup a _**Open MatSci ML Toolkit**_ experiment starting from selecting a dataset to implementing your own custom graph neural network (GNN) model. This workflow is recommended for testing custom models on your development machine, such as a laptop, before deploying them on a cluster or a machine with multiple GPUs and training them on the full dataset. _**Open MatSci ML Toolkit**_ exposes different interfaces (as base abstract classes) that, with the help of [pytorch-lighting](https://www.pytorchlightning.ai/), enable the user to get running from the ground up in a couple of lines of code. This is what this tutorial aims to achieve.

Let's start by importing a couple of useful libraries below. These include the standard python library, [pytorch](https://pytorch.org/), and [dgl](https://www.dgl.ai/).

In [None]:
# Copyright (C) 2022 Intel Corporation
# SPDX-License-Identifier: MIT License

import warnings

import pytorch_lightning as pl
import dgl, torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

Next we import the specific abstract classes that implement the _**Open MatSci ML Toolkit**_ interface. In particular, we need a data module and a model module given below. For this tutorial, we choose to focus on the structure to energy/forces task (S2EF) and proceed to import the appropriate modules. The dataset is given by `S2EFDGLDataModule` which allows access to the development dataset (via the `from_devset` method) provided with _**Open MatSci ML Toolkit**_ while the model module is given by `S2EFLitModule` ensures that the developed model interfaces properly with _**Open MatSci ML Toolkit**_'s data pipeline and pytorch lighting. In particular, it implements a `forward` and `training_step` needed for the specific task. The `AbstractEnergyModel` registers the model with pytroch lighting and specifies that the output should be energy.

In [None]:
from matsciml.lightning.data_utils import S2EFDGLDataModule
from matsciml.models import AbstractEnergyModel, S2EFLitModule

For reproducibility, we laverage the set seed mechanics of both pytorch lighting and dgl.

In [None]:
SEED = 42

pl.seed_everything(SEED)
dgl.seed(SEED)

# Model Definition

This section discusses how to construct a model and integrate it into **Open MatSci ML Toolkit** to run on the OCP dataset. The steps are pretty simple and can be summarized as follows:

1. Start by implementing/choosing a graph neural network layer
2. Using this layer, construct a layered model that subclasses `AbstractEnergyModel`; this interfaces the model with pytorch lighting. Note that any `AbstractEnergyModel` must output a scalar value representing energy
3. Implement any customization for the model datapipeline; this can be achieved by editing `S2EFLitModule` and its associated method `_get_inputs`

For best practices on designing DGL models, please refer to our model guideline given [here](matsciml/models/README.md).

Now we can look at an example of how to build a new model and integrate it with pytorch lighting:

In [None]:
# select/define a convolution layer and create a model

from dgl.nn import GraphConv, AvgPooling


class GraphConvModel(AbstractEnergyModel):
    def __init__(self, num_layers, in_dim, hidden_dim):
        super().__init__()
        sizes = [in_dim] + [hidden_dim] * num_layers
        layers = []
        for indx, (_in, _out) in enumerate(zip(sizes[:-1], sizes[1:])):
            layers.append(
                GraphConv(
                    _in, _out, activation=F.silu if indx < num_layers - 1 else None
                )
            )
        self.convs = nn.ModuleList(layers)

        self.readout = AvgPooling()

        output_dim = 1  # energy is a scalar
        self.proj = nn.Linear(hidden_dim, output_dim)

    def forward(self, graph, features):
        for layer in self.convs:
            features = layer(graph, features)
        pooled = self.readout(graph, features)
        out = self.proj(pooled).squeeze()
        return out

Below we specialize our model inputs by subclassing `S2EFLitModule`. This ensures that the dgl graph object is expanded properly according to our model definition. If you model accepts a dgl model then this step is not required.

In [None]:
# implement custom pipeline
from dgl import AddSelfLoop


class S2EFLitModule(S2EFLitModule):
    def _get_inputs(self, batch):
        graph_transform = AddSelfLoop()
        graph = graph_transform(batch.get("graph"))
        features = []
        for _, val in graph.ndata.items():
            features.append(val if val.dim() > 1 else val.view(-1, 1))
        features = torch.hstack(features)
        return graph, features

Below we instantiate the gnn and the corresponding mode.

In [None]:
# Configuration
REGRESS_FORCES = False

In [None]:
gnn = GraphConvModel(num_layers=3, in_dim=9, hidden_dim=128)
# create the S2EF task; lr and gamma are inconsequential because we create
# our own optimizer below
model = S2EFLitModule(gnn, lr=1e-3, gamma=0.1, regress_forces=REGRESS_FORCES)

# Data Module

Each data module, including `S2EFDGLDataModule`, includes a method to load the smaller development set. This method is `from_devset` and accepts similar arguments such as loading parameters.

In [None]:
# Configuration
BATCH_SIZE = 16
NUM_WORKERS = 0

In [None]:
# grab the devset; we will create our own data loader but we can rely
# on the `DataModule` to grab splits
data_module = S2EFDGLDataModule.from_devset(
    batch_size=BATCH_SIZE, num_workers=NUM_WORKERS
)

# Training

In [None]:
# Configuration
MAX_EPOCHS = 100

In [None]:
trainer = pl.Trainer(max_epochs=MAX_EPOCHS)

In [None]:
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    trainer.fit(model, datamodule=data_module)

# Results

In [None]:
%load_ext tensorboard
%tensorboard --logdir $trainer.logger.log_dir