-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
❓ Questions and Help
For context, I trained a lot of models for many weeks, tracking the loss and accuracy for train, validation, and test steps.
Now, I wanted to evaluate more metrics for the test data set, more specifically, I added recall, confusion matrix, and precision metrics (from ptl.metrics module) to the test_step and test_epoch_end methods in the lighting module.
Also, I replaced my custom accuracy with the class-based Accuracy implemented on the ptl.metrics package.
When I try to test my model and get the metrics for the trained model on the test set, I get this error loading the checkpoint:
Traceback (most recent call last):
File "model_manager.py", line 283, in <module>
helper.test()
File "model_manager.py", line 119, in test
self.trainer.test(self.module)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 748, in test
results = self.__test_given_model(model, test_dataloaders)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 813, in __test_given_model
results = self.fit(model)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 459, in fit
results = self.accelerator_backend.train()
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 61, in train
self.trainer.train_loop.setup_training(model)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 174, in setup_training
self.trainer.checkpoint_connector.restore_weights(model)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 75, in restore_weights
self.restore(self.trainer.resume_from_checkpoint, on_gpu=self.trainer.on_gpu)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 107, in restore
self.restore_model_state(model, checkpoint)
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 128, in restore_model_state
model.load_state_dict(checkpoint['state_dict'])
File "/data/anaconda3/envs/aidio2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for L_WavenetLSTMClassifier:
Unexpected key(s) in state_dict: "train_acc.correct", "train_acc.total", "val_acc.correct", "val_acc.total", "test_acc.correct", "test_acc.total".
What is your question?
In my case, it's impossible to train again the models because it takes many weeks. So I just wonder if there is a way to load the already trained model anyway and obtain the updated test metrics by a test cycle.
Actually I just care about loading the parameters of the model to run the test cycle. I can't understand why it's so important to load other things up, those old metrics don't appear so vital to me.
What is you tired?
I read that this exception it's generated by torch model.load_state_dict method, and can be avoided with strict=false parameter.
In my case I load the trained model with the resume_from_checkpoint parameter of the pytorch-lightning trainer class, so i have no clue to try to get closer to load this.
What's your environment?
- OS: [e.g. iOS, Linux, Win]: Win
- Version [e.g. 0.5.2.1]: Latest master branch (November 13rd, 2020)