Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Loading pre-trained checkpoints #174

Closed
chaitjo opened this issue Jan 8, 2021 · 5 comments
Closed

Error Loading pre-trained checkpoints #174

chaitjo opened this issue Jan 8, 2021 · 5 comments

Comments

@chaitjo
Copy link

chaitjo commented Jan 8, 2021

Hi! It seems that the updates to the codebase have made the pre-trained checkpoints released earlier to be incompatible with the current models. Here, I tried loading a pre-trained SchNet downloaded for S2EF from MODELS.md and following the instructions in TRAIN.md:

$ python main.py --mode predict --config-yml configs/s2ef/200k/schnet/schnet.yml --checkpoint pretrained/schnet_200k.pt 
amp: false
cmd:
  checkpoint_dir: ./checkpoints/2021-01-08-12-16-00
  identifier: ''
  logs_dir: ./logs/tensorboard/2021-01-08-12-16-00
  print_every: 10
  results_dir: ./results/2021-01-08-12-16-00
  seed: 0
  timestamp: 2021-01-08-12-16-00
dataset:
  grad_target_mean: 0.0
  grad_target_std: 2.887317180633545
  normalize_labels: true
  src: data/s2ef/200k/train/
  target_mean: -0.7554450631141663
  target_std: 2.887317180633545
logger: tensorboard
model: schnet
model_attributes:
  cutoff: 6.0
  hidden_channels: 1024
  num_filters: 256
  num_gaussians: 200
  num_interactions: 3
  use_pbc: true
optim:
  batch_size: 32
  eval_batch_size: 16
  force_coefficient: 100
  lr_gamma: 0.1
  lr_initial: 0.0005
  lr_milestones:
  - 5
  - 8
  - 10
  max_epochs: 30
  num_workers: 64
  warmup_epochs: 3
  warmup_factor: 0.2
task:
  dataset: trajectory_lmdb
  description: Regressing to energies and forces for DFT trajectories from OCP
  eval_on_free_atoms: true
  grad_input: atomic forces
  labels:
  - potential energy
  metric: mae
  train_on_free_atoms: true
  type: regression
test_dataset:
  src: data/s2ef/all/val_id/
val_dataset:
  src: data/s2ef/all/val_id/

### Loading dataset: trajectory_lmdb
### Loading model: schnet
### Loaded SchNet with 5704193 parameters.
NOTE: model gradient logging to tensorboard not yet supported.
### Loading checkpoint from: pretrained/schnet_200k.pt
Traceback (most recent call last):
  File "main.py", line 140, in <module>
    Runner()(config)
  File "main.py", line 59, in __call__
    trainer.load_pretrained(config["checkpoint"])
  File "/mnt/work/chaitanya/ocp-models/ocpmodels/trainers/base_trainer.py", line 352, in load_pretrained
    self.model.load_state_dict(checkpoint["state_dict"])
  File "/home/chaitanya/miniconda3/envs/ocp-models/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for OCPDataParallel:
        Missing key(s) in state_dict: "module.atomic_mass", "module.embedding.weight", "module.distance_expansion.offset", "module.interactions.0.mlp.0.weight", "module.interactions.0.mlp.0.bias", "module.interactions.0.mlp.2.weight", "module.interactions.0.mlp.2.bias", "module.interactions.0.conv.lin1.weight", "module.interactions.0.conv.lin2.weight", "module.interactions.0.conv.lin2.bias", "module.interactions.0.conv.nn.0.weight", "module.interactions.0.conv.nn.0.bias", "module.interactions.0.conv.nn.2.weight", "module.interactions.0.conv.nn.2.bias", "module.interactions.0.lin.weight", "module.interactions.0.lin.bias", "module.interactions.1.mlp.0.weight", "module.interactions.1.mlp.0.bias", "module.interactions.1.mlp.2.weight", "module.interactions.1.mlp.2.bias", "module.interactions.1.conv.lin1.weight", "module.interactions.1.conv.lin2.weight", "module.interactions.1.conv.lin2.bias", "module.interactions.1.conv.nn.0.weight", "module.interactions.1.conv.nn.0.bias", "module.interactions.1.conv.nn.2.weight", "module.interactions.1.conv.nn.2.bias", "module.interactions.1.lin.weight", "module.interactions.1.lin.bias", "module.interactions.2.mlp.0.weight", "module.interactions.2.mlp.0.bias", "module.interactions.2.mlp.2.weight", "module.interactions.2.mlp.2.bias", "module.interactions.2.conv.lin1.weight", "module.interactions.2.conv.lin2.weight", "module.interactions.2.conv.lin2.bias", "module.interactions.2.conv.nn.0.weight", "module.interactions.2.conv.nn.0.bias", "module.interactions.2.conv.nn.2.weight", "module.interactions.2.conv.nn.2.bias", "module.interactions.2.lin.weight", "module.interactions.2.lin.bias", "module.lin1.weight", "module.lin1.bias", "module.lin2.weight", "module.lin2.bias". 
        Unexpected key(s) in state_dict: "module.module.atomic_mass", "module.module.embedding.weight", "module.module.distance_expansion.offset", "module.module.interactions.0.mlp.0.weight", "module.module.interactions.0.mlp.0.bias", "module.module.interactions.0.mlp.2.weight", "module.module.interactions.0.mlp.2.bias", "module.module.interactions.0.conv.lin1.weight", "module.module.interactions.0.conv.lin2.weight", "module.module.interactions.0.conv.lin2.bias", "module.module.interactions.0.conv.nn.0.weight", "module.module.interactions.0.conv.nn.0.bias", "module.module.interactions.0.conv.nn.2.weight", "module.module.interactions.0.conv.nn.2.bias", "module.module.interactions.0.lin.weight", "module.module.interactions.0.lin.bias", "module.module.interactions.1.mlp.0.weight", "module.module.interactions.1.mlp.0.bias", "module.module.interactions.1.mlp.2.weight", "module.module.interactions.1.mlp.2.bias", "module.module.interactions.1.conv.lin1.weight", "module.module.interactions.1.conv.lin2.weight", "module.module.interactions.1.conv.lin2.bias", "module.module.interactions.1.conv.nn.0.weight", "module.module.interactions.1.conv.nn.0.bias", "module.module.interactions.1.conv.nn.2.weight", "module.module.interactions.1.conv.nn.2.bias", "module.module.interactions.1.lin.weight", "module.module.interactions.1.lin.bias", "module.module.interactions.2.mlp.0.weight", "module.module.interactions.2.mlp.0.bias", "module.module.interactions.2.mlp.2.weight", "module.module.interactions.2.mlp.2.bias", "module.module.interactions.2.conv.lin1.weight", "module.module.interactions.2.conv.lin2.weight", "module.module.interactions.2.conv.lin2.bias", "module.module.interactions.2.conv.nn.0.weight", "module.module.interactions.2.conv.nn.0.bias", "module.module.interactions.2.conv.nn.2.weight", "module.module.interactions.2.conv.nn.2.bias", "module.module.interactions.2.lin.weight", "module.module.interactions.2.lin.bias", "module.module.lin1.weight", "module.module.lin1.bias", "module.module.lin2.weight", "module.module.lin2.bias".
@chaitjo
Copy link
Author

chaitjo commented Jan 8, 2021

Note that I had updated the dataset config section to test the pre-trained model on the val set:

dataset:
  - src: data/s2ef/200k/train/
    normalize_labels: True
    target_mean: -0.7554450631141663
    target_std: 2.887317180633545
    grad_target_mean: 0.0
    grad_target_std: 2.887317180633545
  - src: data/s2ef/all/val_id/
  - src: data/s2ef/all/val_id/

@mshuaibii
Copy link
Collaborator

mshuaibii commented Jan 8, 2021

Hi @chaitjo -

This seems independent of the recent changes. The pretrained models were trained with DDP and you are attempting to load them without DDP (we can update the docs to clarify this) - hence the module.module vs module. Can you try modifying this line: https://github.com/Open-Catalyst-Project/ocp/blob/c62eb4a8e97ffeb15f857383f160b0f4711c55f0/main.py#L59 to:

trainer.load_pretrained(config["checkpoint"], ddp_to_dp=True)

Otherwise you could run this with DDP and only specify 1 gpu to avoid modifying main.py.

@chaitjo chaitjo closed this as completed Jan 8, 2021
@chaitjo
Copy link
Author

chaitjo commented Jan 8, 2021

Thanks for the prompt response @mshuaibii. I see, it works now!

Just anecdotally, do you think a single server with 2-4 GPU cards is sufficient for someone to play around with new ideas on OCP with the 200K data split? (Obviously, to do something more serious, I understand one would need larger compute!)

@chaitjo
Copy link
Author

chaitjo commented Jan 8, 2021

Additionally, is there a way to evaluate/test pre-trained models with multi-GPU on a single server, without using distributed data parallel?

@mshuaibii
Copy link
Collaborator

Thanks for the prompt response @mshuaibii. I see, it works now!

Just anecdotally, do you think a single server with 2-4 GPU cards is sufficient for someone to play around with new ideas on OCP with the 200K data split? (Obviously, to do something more serious, I understand one would need larger compute!)

Yup - a few of us at CMU have been exploring ideas with similar resources. Although it's a little harder to iterate several new ideas as 200k on 2-4 GPUs could take a few days. As far as the IS2RE splits - 2-4 GPUs can be used comfortably for all data splits.

Additionally, is there a way to evaluate/test pre-trained models with multi-GPU on a single server, without using distributed data parallel?

Multi-GPU is enabled in the repo through distributed data parallel, so no. DDP doesn't require multiple nodes to function - it can be run on a single machine just fine with the following command:

python -u -m torch.distributed.launch --nproc_per_node=8 main.py --distributed ... ...

Was there a specific issue you were encountering that prevented you from using DDP on your machine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants