Skip to content
This repository was archived by the owner on Nov 21, 2022. It is now read-only.
This repository was archived by the owner on Nov 21, 2022. It is now read-only.

HFSaveCheckpoint does not work with deepspeed #273

@jessecambon

Description

@jessecambon

🐛 Bug

HFSaveCheckpoint does not save a HuggingFace checkpoint when the model is trained with deepspeed. No message or warning appears to indicate that the HF checkpoint did not save.

To Reproduce

Use the HFSaveCheckpoint callback when training with deepspeed. I encountered this on both a multinode (Azure) and a single node (local) environment.

Code sample

import os
from typing import Any, List, Optional, Dict
from pytorch_lightning.utilities.types import _PATH

from pytorch_lightning import Trainer
from transformers import AutoTokenizer
from pytorch_lightning import seed_everything

from lightning_transformers.task.nlp.text_classification import (
    TextClassificationDataModule,
    TextClassificationTransformer,
)
from lightning_transformers.plugins.checkpoint import HFSaveCheckpoint
from pytorch_lightning.callbacks import ModelCheckpoint


model_arch="prajjwal1/bert-tiny"


seed_everything(102938, workers = True)
tokenizer = AutoTokenizer.from_pretrained(model_arch)


dm = TextClassificationDataModule(
        tokenizer = tokenizer,
        train_val_split = 0.01, # Split used for creating a validation set out of the training set
        dataset_name = "glue",
        dataset_config_name = "cola",
        batch_size=1,
        num_workers=os.cpu_count(),
        padding="max_length",
        truncation=True,
        max_length=512
    )

model = TextClassificationTransformer(
    pretrained_model_name_or_path=model_arch
    )

checkpoint_callback = ModelCheckpoint(save_top_k=1,monitor="val_loss")

trainer = Trainer(
accelerator='gpu',
plugins=HFSaveCheckpoint(model=model),
callbacks=[checkpoint_callback],
logger=False,
enable_checkpointing=True,
log_every_n_steps=10,
limit_train_batches=30,
limit_val_batches=10,
max_epochs=2,
strategy='deepspeed_stage_3'
) 

trainer.fit(model,dm)

print(f"Best model path: {checkpoint_callback.best_model_path}")

Expected behavior

Either a warning is thrown or the HF model saves properly.

Environment

Lightning transformers 0.2.1

* CUDA:
        - GPU:
                - Quadro T2000 with Max-Q Design
        - available:         True
        - version:           11.3
* Packages:
        - numpy:             1.22.3
        - pyTorch_debug:     False
        - pyTorch_version:   1.11.0
        - pytorch-lightning: 1.6.4
        - tqdm:              4.64.0
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.8.13
        - version:           #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug / fixSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions