-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Describe your question
I'm currently working on a transfer learning project for ASR and I followed the related tutorial. Since I'm not able to perform an entire training in a single session, I need to use checkpoints and resume from them.
The step I'm doing are the following:
- Save the model after training:
quartznet.save_to("./out/my_model.nemo")
- Restore the model when resuming:
quartznet = nemo_asr.models.EncDecCTCModel.restore_from("./out/my_model.nemo")
- Restore the pytorch-lightning trainer:
logger = TensorBoardLogger(
save_dir=os.getcwd(),
version=3,
name='lightning_logs'
)
trainer = pl.Trainer(
gpus=1, max_epochs=30, precision=16, amp_level='O1', checkpoint_callback=True,
resume_from_checkpoint='lightning_logs/version_3/checkpoints/epoch=18-step=68893.ckpt', logger=logger)
- Assign parameters and trainer to the model again:
# Set learning rate
params['model']['optim']['lr'] = 0.001
# Set NovoGrad optimizer betas
params['model']['optim']['betas'] = [0.95, 0.25]
# Set CosineAnnealing learning rate policy's warmup ratio
params['model']['optim']['sched']['warmup_ratio'] = 0.12
# Set training and validation labels
params['model']['train_ds']['labels'] = italian_vocabulary
params['model']['validation_ds']['labels'] = italian_vocabulary
# Set batch size
params['model']['train_ds']['batch_size'] = 16
params['model']['validation_ds']['batch_size'] = 16
# Assign trainer to the model
quartznet.set_trainer(trainer)
# Point to the data we'll use for fine-tuning as the training set
quartznet.setup_training_data(train_data_config=params['model']['train_ds'])
# Point to the new validation data for fine-tuning
quartznet.setup_validation_data(val_data_config=params['model']['validation_ds'])
# Add changes to quartznet model
quartznet.setup_optimization(optim_config=DictConfig(params['model']['optim']))
- Resume training:
trainer.fit(quartznet)
I would like to know if this is the correct way to resume from checkpoint with the old trainer and model trained before, or if I'm missing some steps (or making redundant ones).
Before I tried to restore only the model and declare a new trainer (trainer = pl.Trainer(gpus=1, max_epochs=20, precision=16, amp_level='O1', checkpoint_callback=True)), then train for the same number of epochs of the first training session, but in the end I noticed the resulting WER was roughly the same, so I assumed the problem was that I hadn't restored my old trainer too.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install:
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]
Environment details
- OS version: Ubuntu 18.04.5 LTS
- PyTorch version: pytorch lightning 1.1.5
- Python version: 3.6.9
Additional context
GPU model: NVIDIA Quadro RTX 4000
CUDA version: 10.1