Skip to content

optimizer states are saved in different foldersΒ #12864

@ShaneTian

Description

@ShaneTian

πŸ› Bug

For a step, the model states, optim states are split (saved) into different folders.

$ tree .
.
β”œβ”€β”€ T5-monitor-epoch=0-step=200-val_loss=0.1618-val_ppl=1.1909-f1_score=0.3613.ckpt
β”‚   β”œβ”€β”€ checkpoint
β”‚   β”‚   β”œβ”€β”€ mp_rank_00_model_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_0_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_1_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_3_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_4_mp_rank_00_optim_states.pt
β”‚   β”‚   └── zero_pp_rank_6_mp_rank_00_optim_states.pt
β”‚   β”œβ”€β”€ latest
β”‚   └── zero_to_fp32.py
β”œβ”€β”€ T5-monitor-epoch=0-step=200-val_loss=0.1618-val_ppl=1.1909-f1_score=0.6021.ckpt
β”‚   └── checkpoint
β”‚       β”œβ”€β”€ zero_pp_rank_2_mp_rank_00_optim_states.pt
β”‚       β”œβ”€β”€ zero_pp_rank_5_mp_rank_00_optim_states.pt
β”‚       └── zero_pp_rank_7_mp_rank_00_optim_states.pt
β”œβ”€β”€ T5-monitor-epoch=1-step=568-val_loss=0.0352-val_ppl=1.0372-f1_score=0.8835.ckpt
β”‚   β”œβ”€β”€ checkpoint
β”‚   β”‚   β”œβ”€β”€ mp_rank_00_model_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_0_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_1_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_3_mp_rank_00_optim_states.pt
β”‚   β”‚   β”œβ”€β”€ zero_pp_rank_4_mp_rank_00_optim_states.pt
β”‚   β”‚   └── zero_pp_rank_6_mp_rank_00_optim_states.pt
β”‚   β”œβ”€β”€ latest
β”‚   └── zero_to_fp32.py
└── T5-monitor-epoch=1-step=568-val_loss=0.0352-val_ppl=1.0372-f1_score=0.9235.ckpt
    └── checkpoint
        β”œβ”€β”€ zero_pp_rank_2_mp_rank_00_optim_states.pt
        β”œβ”€β”€ zero_pp_rank_5_mp_rank_00_optim_states.pt
        └── zero_pp_rank_7_mp_rank_00_optim_states.pt

8 directories, 22 files

I use DeepSpeed strategy, which runs in 8 A100 cards.

Related code:

def validation_step(self, batch, batch_idx):
    inputs = self.create_inputs(batch)
    outputs = self(inputs)
    loss = outputs["loss"]
    ppl = torch.exp(loss)
    metrics = {"val/loss": loss, "val/ppl": ppl.detach()}
    self.log_dict(metrics, prog_bar=True, logger=True, on_epoch=True, batch_size=inputs["batch_size"][0])
    if self.metric is not None:
        pred_seqs = self.generate(inputs)
        metrics["val_pred_seqs"] = pred_seqs
        metrics["gold_seqs"] = inputs["gold_seqs"]
    return metrics

def validation_epoch_end(self, validation_step_outputs):
    if self.metric is not None:
        pred_seqs = []
        gold_seqs = []
        for batch_out in validation_step_outputs:
            pred_seqs.extend(batch_out["val_pred_seqs"])
            gold_seqs.extend(batch_out["gold_seqs"])
        eval_results = self.metric.evaluate(pred_seqs, gold_seqs)
        self.log_dict(eval_results, prog_bar=True, logger=True)

To Reproduce

Expected behavior

Environment

  • CUDA:
    - GPU:
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - A100-SXM4-40GB
    - available: True
    - version: 11.3
  • Packages:
    - numpy: 1.20.1
    - pyTorch_debug: False
    - pyTorch_version: 1.10.2+cu113
    - pytorch-lightning: 1.6.0
    - tqdm: 4.63.1
  • System:
    - OS: Linux
    - architecture:
    - 64bit
    -
    - processor: x86_64
    - python: 3.7.10
    - version: Proposal for helpΒ #1 SMP Fri Mar 19 10:07:22 CST 2021

Additional context

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @SeanNaren @akihironitta

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions