logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

CherylChaoNYCU · 2023-05-25T03:22:16Z

Bug description

I've been using simplet5 for text summerization training.
There wasn't any value error few days ago, but when I start to train my model again today, I got this error:

[/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/tensorboard.py](https://localhost:8080/#) in _get_next_version(self)
    305             if self._fs.isdir(d) and bn.startswith("version_"):
    306                 dir_ver = bn.split("_")[1].replace("/", "")
--> 307                 existing_versions.append(int(dir_ver))
    308         if len(existing_versions) == 0:
    309             return 0

ValueError: invalid literal for int() with base 10: '7 (2)'

by the way the logger ids triggered by the model.train()function below:

model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
            eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE], 
            source_max_token_len=MAX_LEN, 
            target_max_token_len=SUMMARY_LEN, 
            batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

source code for simple t5 using logger:
https://github.com/Shivanandroy/simpleT5/blob/main/simplet5/simplet5.py

why is this happening? Thanks.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

from simplet5 import SimpleT5

#load pretrain
model = SimpleT5()
model.load_model("t5",'/content/gdrive/MyDrive/HW5_HL_gen/t5model/simplet5-best', use_gpu=True)
#
#model.from_pretrained(model_type="t5", model_name="t5-base")
MAX_EPOCHS = 3

torch.cuda.memory_summary(device=None, abbreviated=False)
torch.utils.checkpoint

model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
            eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE], 
            source_max_token_len=MAX_LEN, 
            target_max_token_len=SUMMARY_LEN, 
            batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

Error messages and logs

ValueError                                Traceback (most recent call last)
<ipython-input-8-93e45f3194f4> in <cell line: 13>()
     11 torch.utils.checkpoint
     12 
---> 13 model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
     14             eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE],
     15             source_max_token_len=MAX_LEN,

11 frames
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/tensorboard.py in _get_next_version(self)
    305             if self._fs.isdir(d) and bn.startswith("version_"):
    306                 dir_ver = bn.split("_")[1].replace("/", "")
--> 307                 existing_versions.append(int(dir_ver))
    308         if len(existing_versions) == 0:
    309             return 0

ValueError: invalid literal for int() with base 10: '7 (2)'

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

train on google colab!

cc @awaelchli

The text was updated successfully, but these errors were encountered:

Mionies · 2023-06-18T14:53:23Z

Maybe the directories names under the log directory, if you train several models -- the directories are then saved with "(2)".

awaelchli · 2023-06-20T00:25:41Z

@Mionies That's right. @CherylChaoNYCU for some reason you ended up with folders that have an invalid format. The Trainer would normally save the data in folders like log_dir_name/version_5 where the last part is a number. Perhaps you copied files around manually and then the file browser automatically renamed the colliding files. Is that possible?

On our end, we could improve the error message.

renero · 2023-08-20T18:23:43Z

Same problem here. The solution was to rename/remove the invalid name for the TensorBoard logs directory.

Thanks!

CherylChaoNYCU added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 25, 2023

github-actions bot added the ver: 2.0.x label May 25, 2023

Borda added logger: tensorboard and removed needs triage Waiting to be triaged by maintainers labels Jun 19, 2023

awaelchli added this to the 2.0.x milestone Jun 20, 2023

awaelchli self-assigned this Jul 27, 2023

Borda modified the milestones: 2.0.x, 2.1.x Oct 12, 2023

awaelchli mentioned this issue Oct 31, 2023

Fix parsing of version in TensorBoardLogger and CSVLogger #18897

Merged

awaelchli closed this as completed in #18897 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

CherylChaoNYCU commented May 25, 2023 •

edited by github-actions bot

Mionies commented Jun 18, 2023 •

edited

awaelchli commented Jun 20, 2023

renero commented Aug 20, 2023

logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

Comments

CherylChaoNYCU commented May 25, 2023 • edited by github-actions bot

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Mionies commented Jun 18, 2023 • edited

awaelchli commented Jun 20, 2023

renero commented Aug 20, 2023

CherylChaoNYCU commented May 25, 2023 •

edited by github-actions bot

Mionies commented Jun 18, 2023 •

edited