# SchNet S2EF training example

The purpose of this notebook is to demonstrate some of the basics of the Open Catalyst Project's (OCP) codebase and data. In this example, we will train a schnet model for predicting the energy and forces of a given structure (S2EF task). First, ensure you have installed the OCP ocp repo and all the dependencies according to the [README](https://github.com/Open-Catalyst-Project/ocp/blob/master/README.md).

Disclaimer: This notebook is for tutorial purposes, it is unlikely it will be practical to train baseline models on our larger datasets using this format. As a next step, we recommend trying the command line examples. 

## Imports

In [2]:
import torch
import os
from ocpmodels.trainers import ForcesTrainer, EnergyTrainer
from ocpmodels import models
from ocpmodels.common import logger
from ocpmodels.common.utils import setup_logging
setup_logging()

In [3]:
# a simple sanity check that a GPU is available
if torch.cuda.is_available():
    print("True")
else:
    print("False")

True


## The essential steps for training an OCP model

1) Download data

2) Preprocess data (if necessary)

3) Define or load a configuration (config), which includes the following
   
   - task
   - model
   - optimizer
   - dataset
   - trainer

4) Train

5) Depending on the model/task there might be intermediate relaxation step

6) Predict

## Dataset

This examples uses the LMDB generated from the following [tutorial](http://laikapack.cheme.cmu.edu/notebook/open-catalyst-project/mshuaibi/notebooks/projects/ocp/docs/source/tutorials/lmdb_dataset_creation.ipynb). Please run that notebook before moving on. Alternatively, if you have other LMDBs available you may specify that instead.

In [13]:
# set the path to your local lmdb directory
train_src = "s2ef"

## Define config

For this example, we will explicitly define the config; however, a set of default config files exists in the config folder of this repository. Default config yaml files can easily be loaded with the `build_config` util (found in `ocp/ocpmodels/common/utils.py`). Loading a yaml config is preferrable when launching jobs from the command line. We have included our best models' config files [here](https://github.com/Open-Catalyst-Project/ocp/tree/master/configs/s2ef).

**Task** 

In [14]:
task = {
    'dataset': 'trajectory_lmdb', # dataset used for the S2EF task
    'description': 'Regressing to energies and forces for DFT trajectories from OCP',
    'type': 'regression',
    'metric': 'mae',
    'labels': ['potential energy'],
    'grad_input': 'atomic forces',
    'train_on_free_atoms': True,
    'eval_on_free_atoms': True
}

**Model** - SchNet for this example

In [15]:
model = {
    'name': 'schnet',
    'hidden_channels': 1024, # if training is too slow for example purposes reduce the number of hidden channels
    'num_filters': 256,
    'num_interactions': 3,
    'num_gaussians': 200,
    'cutoff': 6.0
}

**Optimizer**

In [16]:
optimizer = {
    'batch_size': 16, # if hitting GPU memory issues, lower this
    'eval_batch_size': 8,
    'num_workers': 8,
    'lr_initial': 0.0001,
    'scheduler': "ReduceLROnPlateau",
    'mode': "min",
    'factor': 0.8,
    'patience': 3,
    'max_epochs': 80,
    'max_epochs': 1, # used for demonstration purposes
    'force_coefficient': 100,
}

**Dataset**

For simplicity, `train_src` is used for all the train/val/test sets. Feel free to update with the actual S2EF val and test sets, but it does require additional downloads and preprocessing. If you desire to normalize your targets, `normalize_labels` must be set to `True` and corresponding `mean` and `stds` need to be specified. These values have been precomputed for you and can be found in any of the [`base.yml`](https://github.com/Open-Catalyst-Project/ocp/blob/master/configs/s2ef/20M/base.yml#L5-L9) config files.

In [17]:
dataset = [
{'src': train_src, 'normalize_labels': False}, # train set 
{'src': train_src}, # val set (optional)
{'src': train_src} # test set (optional - writes predictions to disk)
]

**Trainer**

Use the `ForcesTrainer` for the S2EF and IS2RS tasks, and the `EnergyTrainer` for the IS2RE task 

In [18]:
trainer = ForcesTrainer(
    task=task,
    model=model,
    dataset=dataset,
    optimizer=optimizer,
    identifier="SchNet-example",
    run_dir="./", # directory to save results if is_debug=False. Prediction files are saved here so be careful not to override!
    is_debug=False, # if True, do not save checkpoint, logs, or results
    #is_vis=False,
    print_every=5,
    seed=0, # random seed to use
    logger="tensorboard", # logger of choice (tensorboard and wandb supported)
    local_rank=0,
    amp=False, # use PyTorch Automatic Mixed Precision (faster training and less memory usage)
)

amp: false
cmd:
  checkpoint_dir: ./checkpoints/2022-04-21-14-18-40-SchNet-example
  commit: c933847
  identifier: SchNet-example
  logs_dir: ./logs/tensorboard/2022-04-21-14-18-40-SchNet-example
  print_every: 5
  results_dir: ./results/2022-04-21-14-18-40-SchNet-example
  seed: 0
  timestamp_id: 2022-04-21-14-18-40-SchNet-example
dataset:
  normalize_labels: false
  src: s2ef
gpus: 1
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
optim:
  batch_size: 16
  eval_batch_size: 8
  factor: 0.8
  force_coefficient: 100
  lr_initial: 0.0001
  max_epochs: 1
  mode: min
  num_workers: 8
  patience: 3
  scheduler: ReduceLROnPlateau
slurm: {}
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regres

AssertionError: No LMDBs found in 's2ef'

Traceback (most recent call last):
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/site-packages/traitlets/config/application.py", line 846, in launch_instance
    app.start()
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/site-packages/ipykernel/kernelapp.py", line 677, in start
    self.io_loop.start()
  File "/home/mila/a/alexandre.duval/.conda/envs/ocp/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/home/mila/a/alexandre.duval/.conda/env

## Check the model

In [21]:
print(trainer.model)

OCPDataParallel(
  (module): SchNetWrap(hidden_channels=1024, num_filters=256, num_interactions=3, num_gaussians=200, cutoff=6.0)
)


## Train

In [22]:
trainer.train()

AttributeError: 'tuple' object has no attribute 'shape'

### Load Checkpoint
Once training has completed a `Trainer` class, by default, is loaded with the best checkpoint as determined by training or validation (if available) metrics. To load a `Trainer` class directly with a pretrained model, specify the `checkpoint_path` as defined by your previously trained model (`checkpoint_dir`):

In [37]:
checkpoint_path = os.path.join(trainer.config["cmd"]["checkpoint_dir"], "checkpoint.pt")
checkpoint_path

'./checkpoints/2021-09-04-08-51-28-SchNet-example/checkpoint.pt'

In [38]:
model = {
    'name': 'schnet',
    'hidden_channels': 1024, # if training is too slow for example purposes reduce the number of hidden channels
    'num_filters': 256,
    'num_interactions': 3,
    'num_gaussians': 200,
    'cutoff': 6.0
}

pretrained_trainer = ForcesTrainer(
    task=task,
    model=model,
    dataset=dataset,
    optimizer=optimizer,
    identifier="SchNet-example",
    run_dir="./", # directory to save results if is_debug=False. Prediction files are saved here so be careful not to override!
    is_debug=False, # if True, do not save checkpoint, logs, or results
    is_vis=False,
    print_every=10,
    seed=0, # random seed to use
    logger="tensorboard", # logger of choice (tensorboard and wandb supported)
    local_rank=0,
    amp=False, # use PyTorch Automatic Mixed Precision (faster training and less memory usage)
)

pretrained_trainer.load_checkpoint(checkpoint_path=checkpoint_path)

amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-09-04-08-51-28-SchNet-example
  commit: 98a06d8
  identifier: SchNet-example
  logs_dir: ./logs/tensorboard/2021-09-04-08-51-28-SchNet-example
  print_every: 10
  results_dir: ./results/2021-09-04-08-51-28-SchNet-example
  seed: 0
  timestamp_id: 2021-09-04-08-51-28-SchNet-example
dataset:
  normalize_labels: false
  src: s2ef
gpus: 1
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
optim:
  batch_size: 16
  eval_batch_size: 8
  factor: 0.8
  force_coefficient: 100
  lr_initial: 0.0001
  max_epochs: 1
  mode: min
  num_workers: 8
  patience: 3
  scheduler: ReduceLROnPlateau
slurm: {}
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regre



2021-09-04 08:51:51 (INFO): Loading checkpoint from: ./checkpoints/2021-09-04-08-51-28-SchNet-example/checkpoint.pt


## Predict

If a test has been provided in your config, predictions are generated and written to disk automatically upon training completion. Otherwise, to make predictions on unseen data a `torch.utils.data` DataLoader object must be constructed. Here we reference our test set to make predictions on. Predictions are saved in `{results_file}.npz` in your `results_dir`.

In [39]:
# make predictions on the existing test_loader
predictions = pretrained_trainer.predict(pretrained_trainer.test_loader, results_file="s2ef_results", disable_tqdm=False)

2021-09-04 08:51:51 (INFO): Predicting on test.


device 0: 100%|██████████| 79/79 [00:01<00:00, 44.68it/s]

2021-09-04 08:51:53 (INFO): Writing results to ./results/2021-09-04-08-51-28-SchNet-example/s2ef_s2ef_results.npz





In [40]:
energies = predictions["energy"]
forces = predictions["forces"]

## Energy IS2RE

In [4]:
# Config files found in configs/is2re/

dataset = [{'src': '/network/projects/_groups/ocp/oc20/is2re/10k/train/data.lmdb',
  'normalize_labels': True,
  'target_mean': -1.525913953781128,
  'target_std': 2.279365062713623},
 {'src': '/network/projects/_groups/ocp/oc20/is2re/all/val_id/data.lmdb'}]

model = {
    'name': 'schnet',
    'hidden_channels': 384, # if training is too slow for example purposes reduce the number of hidden channels
    'num_filters': 128,
    'num_interactions': 4,
    'num_gaussians': 200,
    'cutoff': 6.0,
    'use_pbc': True,
    'regress_forces': False
}

optimizer = {
    'batch_size': 32, # if hitting GPU memory issues, lower this
    'eval_batch_size': 32,
    'num_workers': 16,
    'lr_initial': 0.0005,
    'lr_gamma': 0.1,
    'lr_milestones': # steps at which lr_initial <- lr_initial * lr_gamma
        [15625, 31250, 46875],
    'warmup_steps': 9375,
    'warmup_factor': 0.2,
    'max_epochs': 30
}

task = {
    'dataset': 'single_point_lmdb', # dataset used for the S2EF task
    'description': "Relaxed state energy prediction from initial structure.",
    'type': 'regression',
    'metric': 'mae',
    'labels': ['relaxed energy']
}

trainer = EnergyTrainer(
    task=task,
    model=model,
    dataset=dataset,
    optimizer=optimizer,
    identifier="SchNet-example-energy",
    run_dir="./", # directory to save results if is_debug=False. Prediction files are saved here so be careful not to override!
    is_debug=False, # if True, do not save checkpoint, logs, or results
    #is_vis=False,
    print_every=5,
    seed=0, # random seed to use
    logger="tensorboard", # logger of choice (tensorboard and wandb supported)
    local_rank=0,
    amp=False, # use PyTorch Automatic Mixed Precision (faster training and less memory usage)
)

amp: false
cmd:
  checkpoint_dir: ./checkpoints/2022-04-21-14-01-36-SchNet-example-energy
  commit: c933847
  identifier: SchNet-example-energy
  logs_dir: ./logs/tensorboard/2022-04-21-14-01-36-SchNet-example-energy
  print_every: 5
  results_dir: ./results/2022-04-21-14-01-36-SchNet-example-energy
  seed: 0
  timestamp_id: 2022-04-21-14-01-36-SchNet-example-energy
dataset:
  normalize_labels: true
  src: /network/projects/_groups/ocp/oc20/is2re/10k/train/data.lmdb
  target_mean: -1.525913953781128
  target_std: 2.279365062713623
gpus: 1
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 384
  num_filters: 128
  num_gaussians: 200
  num_interactions: 4
  regress_forces: false
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 32
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 15625
  - 31250
  - 46875
  max_epochs: 30
  num_workers: 16
  warmup_factor: 0.2
  warmup_steps: 9375
slurm: {}
task:
  dataset: single_point_lmdb
  descriptio



In [5]:
print(trainer.model)

OCPDataParallel(
  (module): SchNetWrap(hidden_channels=384, num_filters=128, num_interactions=4, num_gaussians=200, cutoff=6.0)
)


In [6]:
trainer.train()

energy_mae: 5.45e+01, energy_mse: 4.72e+03, energy_within_threshold: 0.00e+00, loss: 2.39e+01, lr: 1.00e-04, epoch: 1.60e-02, step: 5.00e+00
energy_mae: 3.81e+01, energy_mse: 2.66e+03, energy_within_threshold: 0.00e+00, loss: 1.67e+01, lr: 1.00e-04, epoch: 3.19e-02, step: 1.00e+01
energy_mae: 3.15e+01, energy_mse: 1.82e+03, energy_within_threshold: 0.00e+00, loss: 1.38e+01, lr: 1.01e-04, epoch: 4.79e-02, step: 1.50e+01
energy_mae: 2.74e+01, energy_mse: 1.75e+03, energy_within_threshold: 0.00e+00, loss: 1.20e+01, lr: 1.01e-04, epoch: 6.39e-02, step: 2.00e+01
energy_mae: 1.88e+01, energy_mse: 6.48e+02, energy_within_threshold: 0.00e+00, loss: 8.24e+00, lr: 1.01e-04, epoch: 7.99e-02, step: 2.50e+01
energy_mae: 2.53e+01, energy_mse: 1.17e+03, energy_within_threshold: 0.00e+00, loss: 1.11e+01, lr: 1.01e-04, epoch: 9.58e-02, step: 3.00e+01
energy_mae: 2.87e+01, energy_mse: 1.68e+03, energy_within_threshold: 0.00e+00, loss: 1.26e+01, lr: 1.01e-04, epoch: 1.12e-01, step: 3.50e+01
energy_mae: 1

device 0: 100%|██████████| 780/780 [02:20<00:00,  5.56it/s]

2022-04-21 14:06:35 (INFO): energy_mae: 9.1358, energy_mse: 175.6113, energy_within_threshold: 0.0023, loss: 4.0093, epoch: 1.0000





energy_mae: 8.94e+00, energy_mse: 1.54e+02, energy_within_threshold: 0.00e+00, loss: 3.92e+00, lr: 1.13e-04, epoch: 1.01e+00, step: 3.15e+02
energy_mae: 6.94e+00, energy_mse: 8.10e+01, energy_within_threshold: 0.00e+00, loss: 3.04e+00, lr: 1.14e-04, epoch: 1.02e+00, step: 3.20e+02
energy_mae: 5.70e+00, energy_mse: 5.19e+01, energy_within_threshold: 0.00e+00, loss: 2.50e+00, lr: 1.14e-04, epoch: 1.04e+00, step: 3.25e+02
energy_mae: 6.15e+00, energy_mse: 5.33e+01, energy_within_threshold: 0.00e+00, loss: 2.70e+00, lr: 1.14e-04, epoch: 1.05e+00, step: 3.30e+02
energy_mae: 8.04e+00, energy_mse: 1.24e+02, energy_within_threshold: 0.00e+00, loss: 3.53e+00, lr: 1.14e-04, epoch: 1.07e+00, step: 3.35e+02
energy_mae: 1.49e+01, energy_mse: 5.91e+02, energy_within_threshold: 0.00e+00, loss: 6.54e+00, lr: 1.14e-04, epoch: 1.09e+00, step: 3.40e+02
energy_mae: 7.02e+00, energy_mse: 7.63e+01, energy_within_threshold: 0.00e+00, loss: 3.08e+00, lr: 1.15e-04, epoch: 1.10e+00, step: 3.45e+02
energy_mae: 1

KeyboardInterrupt: 

In [12]:
# make predictions on the existing test_loader
predictions = trainer.predict(trainer.test_loader, results_file="is2re_results", disable_tqdm=False)

2022-04-21 14:10:29 (INFO): Predicting on test.


AssertionError: 

In [10]:
# Load best checkpoint

checkpoint_path = os.path.join(trainer.config["cmd"]["checkpoint_dir"], "checkpoint.pt")
checkpoint_path

pretrained_trainer = EnergyTrainer(
    task=task,
    model=model,
    dataset=dataset,
    optimizer=optimizer,
    identifier="SchNet-example-energy",
    run_dir="./", # directory to save results if is_debug=False. Prediction files are saved here so be careful not to override!
    is_debug=False, # if True, do not save checkpoint, logs, or results
    #is_vis=False,
    print_every=10,
    seed=0, # random seed to use
    logger="tensorboard", # logger of choice (tensorboard and wandb supported)
    local_rank=0,
    amp=False, # use PyTorch Automatic Mixed Precision (faster training and less memory usage)
)

pretrained_trainer.load_checkpoint(checkpoint_path=checkpoint_path)

KeyError: 'name'