Fix resuming the tqdm progress bar #13962

awaelchli · 2022-08-01T13:13:31Z

What does this PR do?

The changes in #12889 have caused #13124.

This PR reverts #12889 (except tests) and adds the fix to call tqdm.initial = 0, which prevents from tqdm incorrectly computing the estimates for ETA and iterations/s. All tests added in #12889 remain as they are not affected by reverting this.

Since this is a visual bug, and the tqdm produces these estimates at runtime, I don't really know how to test this. Suggestions by reviewers are welcome!

Repro example used to debug:

import os
import shutil
from time import sleep

import torch
from torch.utils.data import DataLoader, Dataset

from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import ModelCheckpoint


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        sleep(.1)
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        print()
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("val_loss", loss)
        print()

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 32), batch_size=2)

    if os.path.exists("lightning_logs"):
        shutil.rmtree("lightning_logs")

    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=1,
        enable_model_summary=False,
        enable_progress_bar=True,
        callbacks=ModelCheckpoint(monitor="train_loss", save_top_k=-1, every_n_train_steps=1),
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=train_data)

    trainer = Trainer(
        default_root_dir=os.getcwd(),
        max_epochs=3,
        enable_model_summary=False,
        enable_progress_bar=True,
        callbacks=ModelCheckpoint(monitor="train_loss", save_top_k=-1, every_n_train_steps=1),
    )
    trainer.fit(
        model, train_dataloaders=train_data, ckpt_path="lightning_logs/version_0/checkpoints/epoch=0-step=3.ckpt"
    )


if __name__ == "__main__":
    run()

Does your PR introduce any breaking changes? If yes, please list them.

No.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

cc @Borda @tchaton @rohitgr7 @awaelchli @akihironitta

for more information, see https://pre-commit.ci

…/neg-progress

for more information, see https://pre-commit.ci

…/neg-progress

…eg-progress"

src/pytorch_lightning/callbacks/progress/tqdm_progress.py

codecov · 2022-08-01T22:55:31Z

Codecov Report

Merging #13962 (67b62e6) into master (af07e75) will decrease coverage by 9%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #13962     +/-   ##
=========================================
- Coverage      86%      76%     -9%     
=========================================
  Files         341      341             
  Lines       26563    26565      +2     
=========================================
- Hits        22732    20221   -2511     
- Misses       3831     6344   +2513

src/pytorch_lightning/callbacks/progress/tqdm_progress.py

awaelchli added 2 commits August 1, 2022 15:11

fix

ce9c88b

repro

ffa1e06

github-actions bot added the pl Generic label for PyTorch Lightning package label Aug 1, 2022

pre-commit-ci bot and others added 9 commits August 1, 2022 13:16

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c8df08

for more information, see https://pre-commit.ci

fix

63e90c6

Merge remote-tracking branch 'origin/bugfix/neg-progress' into bugfix…

0f26018

…/neg-progress

update

8fa966b

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc4c6dd

for more information, see https://pre-commit.ci

update changelog

01d4f52

apply

b4382ff

Merge remote-tracking branch 'origin/bugfix/neg-progress' into bugfix…

ccec21e

…/neg-progress

Auto stash before merge of "bugfix/neg-progress" and "origin/bugfix/n…

82ae210

…eg-progress"

awaelchli added bug Something isn't working progress bar: tqdm labels Aug 1, 2022

awaelchli added this to the pl:1.7.x milestone Aug 1, 2022

awaelchli changed the title ~~[WIP] Fix resuming the tqdm progress bar~~ Fix resuming the tqdm progress bar Aug 1, 2022

rohitgr7 reviewed Aug 1, 2022

View reviewed changes

src/pytorch_lightning/callbacks/progress/tqdm_progress.py Show resolved Hide resolved

remove example

aea1313

awaelchli marked this pull request as ready for review August 1, 2022 22:27

awaelchli requested review from williamFalcon, tchaton, carmocca, Borda, kaushikb11 and justusschock as code owners August 1, 2022 22:27

awaelchli added the priority: 0 High priority task label Aug 2, 2022

Borda approved these changes Aug 2, 2022

View reviewed changes

mergify bot added the has conflicts label Aug 2, 2022

rohitgr7 reviewed Aug 2, 2022

View reviewed changes

src/pytorch_lightning/callbacks/progress/tqdm_progress.py Show resolved Hide resolved

src/pytorch_lightning/callbacks/progress/tqdm_progress.py Show resolved Hide resolved

rohitgr7 approved these changes Aug 2, 2022

View reviewed changes

Merge branch 'master' into bugfix/neg-progress

67b62e6

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Aug 2, 2022

rohitgr7 modified the milestones: pl:1.7.x, pl:1.7 Aug 2, 2022

kaushikb11 approved these changes Aug 2, 2022

View reviewed changes

awaelchli merged commit f576ed3 into master Aug 2, 2022

awaelchli deleted the bugfix/neg-progress branch August 2, 2022 11:34

awaelchli mentioned this pull request Aug 2, 2022

Resuming from a mid-epoch checkpoint produces negative time estimates #13124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resuming the tqdm progress bar #13962

Fix resuming the tqdm progress bar #13962

awaelchli commented Aug 1, 2022 •

edited by github-actions bot

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading

Fix resuming the tqdm progress bar #13962

Fix resuming the tqdm progress bar #13962

Conversation

awaelchli commented Aug 1, 2022 • edited by github-actions bot Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

codecov bot commented Aug 1, 2022 • edited Loading

Codecov Report

awaelchli commented Aug 1, 2022 •

edited by github-actions bot

Loading

codecov bot commented Aug 1, 2022 •

edited

Loading