-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingstrategy: deepspeed
Milestone
Description
π Bug
For a step, the model states, optim states are split (saved) into different folders.
$ tree .
.
βββ T5-monitor-epoch=0-step=200-val_loss=0.1618-val_ppl=1.1909-f1_score=0.3613.ckpt
β βββ checkpoint
β β βββ mp_rank_00_model_states.pt
β β βββ zero_pp_rank_0_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_1_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_3_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_4_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_6_mp_rank_00_optim_states.pt
β βββ latest
β βββ zero_to_fp32.py
βββ T5-monitor-epoch=0-step=200-val_loss=0.1618-val_ppl=1.1909-f1_score=0.6021.ckpt
β βββ checkpoint
β βββ zero_pp_rank_2_mp_rank_00_optim_states.pt
β βββ zero_pp_rank_5_mp_rank_00_optim_states.pt
β βββ zero_pp_rank_7_mp_rank_00_optim_states.pt
βββ T5-monitor-epoch=1-step=568-val_loss=0.0352-val_ppl=1.0372-f1_score=0.8835.ckpt
β βββ checkpoint
β β βββ mp_rank_00_model_states.pt
β β βββ zero_pp_rank_0_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_1_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_3_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_4_mp_rank_00_optim_states.pt
β β βββ zero_pp_rank_6_mp_rank_00_optim_states.pt
β βββ latest
β βββ zero_to_fp32.py
βββ T5-monitor-epoch=1-step=568-val_loss=0.0352-val_ppl=1.0372-f1_score=0.9235.ckpt
βββ checkpoint
βββ zero_pp_rank_2_mp_rank_00_optim_states.pt
βββ zero_pp_rank_5_mp_rank_00_optim_states.pt
βββ zero_pp_rank_7_mp_rank_00_optim_states.pt
8 directories, 22 files
I use DeepSpeed strategy, which runs in 8 A100 cards.
Related code:
def validation_step(self, batch, batch_idx):
inputs = self.create_inputs(batch)
outputs = self(inputs)
loss = outputs["loss"]
ppl = torch.exp(loss)
metrics = {"val/loss": loss, "val/ppl": ppl.detach()}
self.log_dict(metrics, prog_bar=True, logger=True, on_epoch=True, batch_size=inputs["batch_size"][0])
if self.metric is not None:
pred_seqs = self.generate(inputs)
metrics["val_pred_seqs"] = pred_seqs
metrics["gold_seqs"] = inputs["gold_seqs"]
return metrics
def validation_epoch_end(self, validation_step_outputs):
if self.metric is not None:
pred_seqs = []
gold_seqs = []
for batch_out in validation_step_outputs:
pred_seqs.extend(batch_out["val_pred_seqs"])
gold_seqs.extend(batch_out["gold_seqs"])
eval_results = self.metric.evaluate(pred_seqs, gold_seqs)
self.log_dict(eval_results, prog_bar=True, logger=True)To Reproduce
Expected behavior
Environment
- CUDA:
- GPU:
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- A100-SXM4-40GB
- available: True
- version: 11.3 - Packages:
- numpy: 1.20.1
- pyTorch_debug: False
- pyTorch_version: 1.10.2+cu113
- pytorch-lightning: 1.6.0
- tqdm: 4.63.1 - System:
- OS: Linux
- architecture:
- 64bit
-
- processor: x86_64
- python: 3.7.10
- version: Proposal for helpΒ #1 SMP Fri Mar 19 10:07:22 CST 2021
Additional context
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @SeanNaren @akihironitta
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingstrategy: deepspeed