Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume training #13

Closed
nemeziz69 opened this issue Mar 15, 2023 · 5 comments
Closed

Resume training #13

nemeziz69 opened this issue Mar 15, 2023 · 5 comments

Comments

@nemeziz69
Copy link

My training stop due to PC accidentally shut down. Is it possible to resume back the training? If yes, how I'm gonna do it?

@Daraan
Copy link

Daraan commented Mar 15, 2023

Please correct me as I could be wrong. Here they use the pytorch-accelerated framework, I do not see checkpointing activated by default - so I sadly doubt that it is not possible to recover it.

You have to do it manually enable checkpointing beforehand with the trainer; see here https://pytorch-accelerated.readthedocs.io/en/latest/callbacks.html.

@varshanth
Copy link

varshanth commented Mar 21, 2023

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

@nemeziz69
Copy link
Author

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:

    if RESUME_LOCAL_PATH is not None:
        print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
        trainer.load_checkpoint(RESUME_LOCAL_PATH)
        print(optimizer)

    # run training
    trainer.train(
        num_epochs=num_epochs,
        train_dataset=train_yds,
        eval_dataset=eval_yds,
        per_device_batch_size=batch_size,
        create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
            num_warmup_epochs=NUM_WARMUP_EPOCH,
            num_cooldown_epochs=NUM_COOLDOWN_EPOCH,
            k_decay=2,
        ),
        collate_fn=yolov7_collate_fn,
        gradient_accumulation_steps=num_accumulate_steps,
    )

But the error occur,

raceback (most recent call last):
  File "train.py", line 405, in <module>
    main()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function
    return func(**args)
  File "train.py", line 389, in main
    trainer.train(
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train
    self._run_training()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training
    self._run_train_epoch(self._train_dataloader)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch
    self.optimizer_step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step
    self.optimizer.step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step
    self.optimizer.step(closure)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step
    F.sgd(params_with_grad,
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1

Is my implementation correct?

@varshanth
Copy link

varshanth commented May 30, 2023

From the example provided in the train_cars.py, the best model is saved as "best_model.pt". You can use the load_checkpoint method before you start the training to resume from the last checkpoint. The optimizer state_dict should also be saved so you should be able to resume training with the checkpointed LR as well. You just need to modify your existing training script a bit.

According to @varshanth , I tried to use load_checkpoint method before trainer.train to resume my training. Something like this:

    if RESUME_LOCAL_PATH is not None:
        print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
        trainer.load_checkpoint(RESUME_LOCAL_PATH)
        print(optimizer)

    # run training
    trainer.train(
        num_epochs=num_epochs,
        train_dataset=train_yds,
        eval_dataset=eval_yds,
        per_device_batch_size=batch_size,
        create_scheduler_fn=CosineLrScheduler.create_scheduler_fn(
            num_warmup_epochs=NUM_WARMUP_EPOCH,
            num_cooldown_epochs=NUM_COOLDOWN_EPOCH,
            k_decay=2,
        ),
        collate_fn=yolov7_collate_fn,
        gradient_accumulation_steps=num_accumulate_steps,
    )

But the error occur,

raceback (most recent call last):
  File "train.py", line 405, in <module>
    main()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\func_to_script\core.py", line 108, in scripted_function
    return func(**args)
  File "train.py", line 389, in main
    trainer.train(
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 468, in train
    self._run_training()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 679, in _run_training
    self._run_train_epoch(self._train_dataloader)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 758, in _run_train_epoch
    self.optimizer_step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\pytorch_accelerated\trainer.py", line 323, in optimizer_step
    self.optimizer.step()
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\accelerate\optimizer.py", line 134, in step
    self.optimizer.step(closure)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\sgd.py", line 110, in step
    F.sgd(params_with_grad,
  File "D:\NAVIS_ODKI_Yolov7\env_cpu_38\lib\site-packages\torch\optim\_functional.py", line 173, in sgd
    buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1

Is my implementation correct?

The error you received basically says that the model that you instantiated and the model that you loaded through the checkpoint are not the same. There is a mismatch in the parameter shapes that the optimizer had for calculating the gradient with momentum and the parameters loaded through the checkpoint for that particular layer. Can you please double-check to see if the model you trained and the model you loaded are the same with no changes made between the save and the load?

@nemeziz69
Copy link
Author

Hi @varshanth , I'm confirm that model trained and model loaded are the same model, with no changes made. However, when printing out the optimizer before and after the load_checkpoint method, there's something difference called initial_lr:

print(optimizer)
if RESUME_LOCAL_PATH is not None:
    print(f"Resume load checkpoint from: {RESUME_LOCAL_PATH}")
    trainer.load_checkpoint(RESUME_LOCAL_PATH)
    print(optimizer)

Output:

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.01
    momentum: 0.937
    nesterov: True
    weight_decay: 0

Parameter Group 1
    dampening: 0
    lr: 0.01
    momentum: 0.937
    nesterov: True
    weight_decay: 0.000515625
)
Resume load checkpoint from: 230530_145306_ep1_0.00771408797687863_best_model.pt
SGD (
Parameter Group 0
    dampening: 0
    initial_lr: 0.01
    lr: 1e-06
    momentum: 0.937
    nesterov: True
    weight_decay: 0

Parameter Group 1
    dampening: 0
    initial_lr: 0.01
    lr: 1e-06
    momentum: 0.937
    nesterov: True
    weight_decay: 0.000515625
)

Is this could be the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants