This repository was archived by the owner on Nov 21, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 75
This repository was archived by the owner on Nov 21, 2022. It is now read-only.
HFSaveCheckpoint does not work with deepspeed #273
Copy link
Copy link
Open
Labels
bug / fixSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed
Description
🐛 Bug
HFSaveCheckpoint does not save a HuggingFace checkpoint when the model is trained with deepspeed. No message or warning appears to indicate that the HF checkpoint did not save.
To Reproduce
Use the HFSaveCheckpoint callback when training with deepspeed. I encountered this on both a multinode (Azure) and a single node (local) environment.
Code sample
import os
from typing import Any, List, Optional, Dict
from pytorch_lightning.utilities.types import _PATH
from pytorch_lightning import Trainer
from transformers import AutoTokenizer
from pytorch_lightning import seed_everything
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
from lightning_transformers.plugins.checkpoint import HFSaveCheckpoint
from pytorch_lightning.callbacks import ModelCheckpoint
model_arch="prajjwal1/bert-tiny"
seed_everything(102938, workers = True)
tokenizer = AutoTokenizer.from_pretrained(model_arch)
dm = TextClassificationDataModule(
tokenizer = tokenizer,
train_val_split = 0.01, # Split used for creating a validation set out of the training set
dataset_name = "glue",
dataset_config_name = "cola",
batch_size=1,
num_workers=os.cpu_count(),
padding="max_length",
truncation=True,
max_length=512
)
model = TextClassificationTransformer(
pretrained_model_name_or_path=model_arch
)
checkpoint_callback = ModelCheckpoint(save_top_k=1,monitor="val_loss")
trainer = Trainer(
accelerator='gpu',
plugins=HFSaveCheckpoint(model=model),
callbacks=[checkpoint_callback],
logger=False,
enable_checkpointing=True,
log_every_n_steps=10,
limit_train_batches=30,
limit_val_batches=10,
max_epochs=2,
strategy='deepspeed_stage_3'
)
trainer.fit(model,dm)
print(f"Best model path: {checkpoint_callback.best_model_path}")Expected behavior
Either a warning is thrown or the HF model saves properly.
Environment
Lightning transformers 0.2.1
* CUDA:
- GPU:
- Quadro T2000 with Max-Q Design
- available: True
- version: 11.3
* Packages:
- numpy: 1.22.3
- pyTorch_debug: False
- pyTorch_version: 1.11.0
- pytorch-lightning: 1.6.4
- tqdm: 4.64.0
* System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.13
- version: #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bug / fixSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is needed