Error Loading pre-trained checkpoints #174

chaitjo · 2021-01-08T04:18:15Z

Hi! It seems that the updates to the codebase have made the pre-trained checkpoints released earlier to be incompatible with the current models. Here, I tried loading a pre-trained SchNet downloaded for S2EF from MODELS.md and following the instructions in TRAIN.md:

$ python main.py --mode predict --config-yml configs/s2ef/200k/schnet/schnet.yml --checkpoint pretrained/schnet_200k.pt 
amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-01-08-12-16-00
  identifier: ''
  logs_dir: ./logs/tensorboard/2021-01-08-12-16-00
  print_every: 10
  results_dir: ./results/2021-01-08-12-16-00
  seed: 0
  timestamp: 2021-01-08-12-16-00
dataset:
  grad_target_mean: 0.0
  grad_target_std: 2.887317180633545
  normalize_labels: true
  src: data/s2ef/200k/train/
  target_mean: -0.7554450631141663
  target_std: 2.887317180633545
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 16
  force_coefficient: 100
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 5
  - 8
  - 10
  max_epochs: 30
  num_workers: 64
  warmup_epochs: 3
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
test_dataset:
  src: data/s2ef/all/val_id/
val_dataset:
  src: data/s2ef/all/val_id/

### Loading dataset: trajectory_lmdb
### Loading model: schnet
### Loaded SchNet with 5704193 parameters.
NOTE: model gradient logging to tensorboard not yet supported.
### Loading checkpoint from: pretrained/schnet_200k.pt
Traceback (most recent call last):
  File "main.py", line 140, in <module>
    Runner()(config)
  File "main.py", line 59, in __call__
    trainer.load_pretrained(config["checkpoint"])
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py", line 352, in load_pretrained
    self.model.load_state_dict(checkpoint["state_dict"])
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for OCPDataParallel:
        Missing key(s) in state_dict: "module.atomic_mass", "module.embedding.weight", "module.distance_expansion.offset", "module.interactions.0.mlp.0.weight", "module.interactions.0.mlp.0.bias", "module.interactions.0.mlp.2.weight", "module.interactions.0.mlp.2.bias", "module.interactions.0.conv.lin1.weight", "module.interactions.0.conv.lin2.weight", "module.interactions.0.conv.lin2.bias", "module.interactions.0.conv.nn.0.weight", "module.interactions.0.conv.nn.0.bias", "module.interactions.0.conv.nn.2.weight", "module.interactions.0.conv.nn.2.bias", "module.interactions.0.lin.weight", "module.interactions.0.lin.bias", "module.interactions.1.mlp.0.weight", "module.interactions.1.mlp.0.bias", "module.interactions.1.mlp.2.weight", "module.interactions.1.mlp.2.bias", "module.interactions.1.conv.lin1.weight", "module.interactions.1.conv.lin2.weight", "module.interactions.1.conv.lin2.bias", "module.interactions.1.conv.nn.0.weight", "module.interactions.1.conv.nn.0.bias", "module.interactions.1.conv.nn.2.weight", "module.interactions.1.conv.nn.2.bias", "module.interactions.1.lin.weight", "module.interactions.1.lin.bias", "module.interactions.2.mlp.0.weight", "module.interactions.2.mlp.0.bias", "module.interactions.2.mlp.2.weight", "module.interactions.2.mlp.2.bias", "module.interactions.2.conv.lin1.weight", "module.interactions.2.conv.lin2.weight", "module.interactions.2.conv.lin2.bias", "module.interactions.2.conv.nn.0.weight", "module.interactions.2.conv.nn.0.bias", "module.interactions.2.conv.nn.2.weight", "module.interactions.2.conv.nn.2.bias", "module.interactions.2.lin.weight", "module.interactions.2.lin.bias", "module.lin1.weight", "module.lin1.bias", "module.lin2.weight", "module.lin2.bias". 
        Unexpected key(s) in state_dict: "module.module.atomic_mass", "module.module.embedding.weight", "module.module.distance_expansion.offset", "module.module.interactions.0.mlp.0.weight", "module.module.interactions.0.mlp.0.bias", "module.module.interactions.0.mlp.2.weight", "module.module.interactions.0.mlp.2.bias", "module.module.interactions.0.conv.lin1.weight", "module.module.interactions.0.conv.lin2.weight", "module.module.interactions.0.conv.lin2.bias", "module.module.interactions.0.conv.nn.0.weight", "module.module.interactions.0.conv.nn.0.bias", "module.module.interactions.0.conv.nn.2.weight", "module.module.interactions.0.conv.nn.2.bias", "module.module.interactions.0.lin.weight", "module.module.interactions.0.lin.bias", "module.module.interactions.1.mlp.0.weight", "module.module.interactions.1.mlp.0.bias", "module.module.interactions.1.mlp.2.weight", "module.module.interactions.1.mlp.2.bias", "module.module.interactions.1.conv.lin1.weight", "module.module.interactions.1.conv.lin2.weight", "module.module.interactions.1.conv.lin2.bias", "module.module.interactions.1.conv.nn.0.weight", "module.module.interactions.1.conv.nn.0.bias", "module.module.interactions.1.conv.nn.2.weight", "module.module.interactions.1.conv.nn.2.bias", "module.module.interactions.1.lin.weight", "module.module.interactions.1.lin.bias", "module.module.interactions.2.mlp.0.weight", "module.module.interactions.2.mlp.0.bias", "module.module.interactions.2.mlp.2.weight", "module.module.interactions.2.mlp.2.bias", "module.module.interactions.2.conv.lin1.weight", "module.module.interactions.2.conv.lin2.weight", "module.module.interactions.2.conv.lin2.bias", "module.module.interactions.2.conv.nn.0.weight", "module.module.interactions.2.conv.nn.0.bias", "module.module.interactions.2.conv.nn.2.weight", "module.module.interactions.2.conv.nn.2.bias", "module.module.interactions.2.lin.weight", "module.module.interactions.2.lin.bias", "module.module.lin1.weight", "module.module.lin1.bias", "module.module.lin2.weight", "module.module.lin2.bias".

The text was updated successfully, but these errors were encountered:

chaitjo · 2021-01-08T04:20:22Z

Note that I had updated the dataset config section to test the pre-trained model on the val set:

dataset:
  - src: data/s2ef/200k/train/
    normalize_labels: True
    target_mean: -0.7554450631141663
    target_std: 2.887317180633545
    grad_target_mean: 0.0
    grad_target_std: 2.887317180633545
  - src: data/s2ef/all/val_id/
  - src: data/s2ef/all/val_id/

mshuaibii · 2021-01-08T04:22:47Z

Hi @chaitjo -

This seems independent of the recent changes. The pretrained models were trained with DDP and you are attempting to load them without DDP (we can update the docs to clarify this) - hence the module.module vs module. Can you try modifying this line: https://github.com/Open-Catalyst-Project/ocp/blob/c62eb4a8e97ffeb15f857383f160b0f4711c55f0/main.py#L59 to:

trainer.load_pretrained(config["checkpoint"], ddp_to_dp=True)

Otherwise you could run this with DDP and only specify 1 gpu to avoid modifying main.py.

chaitjo · 2021-01-08T04:34:40Z

Thanks for the prompt response @mshuaibii. I see, it works now!

Just anecdotally, do you think a single server with 2-4 GPU cards is sufficient for someone to play around with new ideas on OCP with the 200K data split? (Obviously, to do something more serious, I understand one would need larger compute!)

chaitjo · 2021-01-08T04:52:37Z

Additionally, is there a way to evaluate/test pre-trained models with multi-GPU on a single server, without using distributed data parallel?

mshuaibii · 2021-01-08T05:35:18Z

Thanks for the prompt response @mshuaibii. I see, it works now!

Just anecdotally, do you think a single server with 2-4 GPU cards is sufficient for someone to play around with new ideas on OCP with the 200K data split? (Obviously, to do something more serious, I understand one would need larger compute!)

Yup - a few of us at CMU have been exploring ideas with similar resources. Although it's a little harder to iterate several new ideas as 200k on 2-4 GPUs could take a few days. As far as the IS2RE splits - 2-4 GPUs can be used comfortably for all data splits.

Additionally, is there a way to evaluate/test pre-trained models with multi-GPU on a single server, without using distributed data parallel?

Multi-GPU is enabled in the repo through distributed data parallel, so no. DDP doesn't require multiple nodes to function - it can be run on a single machine just fine with the following command:

python -u -m torch.distributed.launch --nproc_per_node=8 main.py --distributed ... ...

Was there a specific issue you were encountering that prevented you from using DDP on your machine?

chaitjo closed this as completed Jan 8, 2021

mshuaibii mentioned this issue Jan 8, 2021

Clarify pretrained models docs (#174) #175

Merged

chaitjo mentioned this issue Jan 13, 2021

Using distributed data parallel on single machine with multi-GPU #178

Closed

mshuaibii mentioned this issue Feb 25, 2021

flag to load ddp cp without ddp #204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error Loading pre-trained checkpoints #174

Error Loading pre-trained checkpoints #174

chaitjo commented Jan 8, 2021

chaitjo commented Jan 8, 2021

mshuaibii commented Jan 8, 2021 •

edited

chaitjo commented Jan 8, 2021

chaitjo commented Jan 8, 2021

mshuaibii commented Jan 8, 2021

Error Loading pre-trained checkpoints #174

Error Loading pre-trained checkpoints #174

Comments

chaitjo commented Jan 8, 2021

chaitjo commented Jan 8, 2021

mshuaibii commented Jan 8, 2021 • edited

chaitjo commented Jan 8, 2021

chaitjo commented Jan 8, 2021

mshuaibii commented Jan 8, 2021

mshuaibii commented Jan 8, 2021 •

edited