Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorBoardLogger has the wrong epoch numbers much more than the fact #19828

Open
AlbireoBai opened this issue Apr 30, 2024 · 2 comments
Open
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.1.x

Comments

@AlbireoBai
Copy link

Bug description

I used the following code to log the metrics, but I found that the epoch recorded in the tensorboard logger is much more than it should have:

def training_step(self, batch, batch_idx):
    x, y  = batch
    y_hat = self.forward(x)
    loss = torch.sqrt(self.loss_fn(y_hat,y))
    self.log("train_loss", loss, logger=True, prog_bar=True, on_epoch=True)
    return loss 
    
def validation_step(self, batch, batch_idx):
    x, y  = batch
    y_hat = self.forward(x)
    loss = torch.sqrt(self.loss_fn(y_hat,y))
    self.log("valid_loss", loss, logger=True, prog_bar=True, on_epoch=True)
    return loss

pl.Train(..., logger=TensorBoardLogger(save_dir='store',version=log_path), ....)

In the configure, I set max_epoch=10000, but in the logger, I got epoches more than 650k:
d9dc51214ca78a81ba849ff967f459f
image

What version are you seeing the problem on?

v2.1

How to reproduce the bug

def training_step(self, batch, batch_idx):
        x, y  = batch
        y_hat = self.forward(x)
        loss = torch.sqrt(self.loss_fn(y_hat,y))
        self.log("train_loss", loss, logger=True, prog_bar=True, on_epoch=True)
        return loss 
        
    def validation_step(self, batch, batch_idx):
        x, y  = batch
        y_hat = self.forward(x)
        loss = torch.sqrt(self.loss_fn(y_hat,y))
        self.log("valid_loss", loss, logger=True, prog_bar=True, on_epoch=True)
        return loss

   pl.Train(..., logger=TensorBoardLogger(save_dir='store',version=log_path), ....) # u can use any path you like

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): 
#- PyTorch Lightning Version (e.g., 1.5.0): 2.1.3
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9): 2.1.2
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): pip 
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

@AlbireoBai AlbireoBai added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 30, 2024
@ryan597
Copy link
Contributor

ryan597 commented May 3, 2024

The graphs shown are correct, the x-axis is the steps, the y-axis shows your epoch number. In the second image, the x-axis is again the steps.

I can see that your epochs go to 10,000 from the y-axis in the first image

@AlbireoBai
Copy link
Author

The graphs shown are correct, the x-axis is the steps, the y-axis shows your epoch number. In the second image, the x-axis is again the steps.

I can see that your epochs go to 10,000 from the y-axis in the first image

The graphs shown are correct, the x-axis is the steps, the y-axis shows your epoch number. In the second image, the x-axis is again the steps.

I can see that your epochs go to 10,000 from the y-axis in the first image

Thanks a lot, I have realised that I need to transfer the y-axis in the 1st image to the x-axis in the 2nd image if I want to get a epoch-val_loss image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.1.x
Projects
None yet
Development

No branches or pull requests

2 participants