Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logger in tensorboard.py: ValueError: invalid literal for int() with base 10: '7 (2)' #17691

Closed
CherylChaoNYCU opened this issue May 25, 2023 · 3 comments · Fixed by #18897
Closed
Assignees
Labels
Milestone

Comments

@CherylChaoNYCU
Copy link

CherylChaoNYCU commented May 25, 2023

Bug description

I've been using simplet5 for text summerization training.
There wasn't any value error few days ago, but when I start to train my model again today, I got this error:

[/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/tensorboard.py](https://localhost:8080/#) in _get_next_version(self)
    305             if self._fs.isdir(d) and bn.startswith("version_"):
    306                 dir_ver = bn.split("_")[1].replace("/", "")
--> 307                 existing_versions.append(int(dir_ver))
    308         if len(existing_versions) == 0:
    309             return 0

ValueError: invalid literal for int() with base 10: '7 (2)'

by the way the logger ids triggered by the model.train()function below:

model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
            eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE], 
            source_max_token_len=MAX_LEN, 
            target_max_token_len=SUMMARY_LEN, 
            batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

source code for simple t5 using logger:
https://github.com/Shivanandroy/simpleT5/blob/main/simplet5/simplet5.py

why is this happening? Thanks.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

from simplet5 import SimpleT5

#load pretrain
model = SimpleT5()
model.load_model("t5",'/content/gdrive/MyDrive/HW5_HL_gen/t5model/simplet5-best', use_gpu=True)
#
#model.from_pretrained(model_type="t5", model_name="t5-base")
MAX_EPOCHS = 3

torch.cuda.memory_summary(device=None, abbreviated=False)
torch.utils.checkpoint

model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
            eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE], 
            source_max_token_len=MAX_LEN, 
            target_max_token_len=SUMMARY_LEN, 
            batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

Error messages and logs

ValueError                                Traceback (most recent call last)
<ipython-input-8-93e45f3194f4> in <cell line: 13>()
     11 torch.utils.checkpoint
     12 
---> 13 model.train(train_df=df[0:(int)(0.8*TRAINNING_SIZE)],
     14             eval_df=df[(int)(0.8*TRAINNING_SIZE):TRAINNING_SIZE],
     15             source_max_token_len=MAX_LEN,

11 frames
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/tensorboard.py in _get_next_version(self)
    305             if self._fs.isdir(d) and bn.startswith("version_"):
    306                 dir_ver = bn.split("_")[1].replace("/", "")
--> 307                 existing_versions.append(int(dir_ver))
    308         if len(existing_versions) == 0:
    309             return 0

ValueError: invalid literal for int() with base 10: '7 (2)'

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

train on google colab!

cc @awaelchli

@CherylChaoNYCU CherylChaoNYCU added bug Something isn't working needs triage Waiting to be triaged by maintainers labels May 25, 2023
@Mionies
Copy link

Mionies commented Jun 18, 2023

Maybe the directories names under the log directory, if you train several models -- the directories are then saved with "(2)".

@Borda Borda added logger: tensorboard and removed needs triage Waiting to be triaged by maintainers labels Jun 19, 2023
@awaelchli
Copy link
Member

@Mionies That's right. @CherylChaoNYCU for some reason you ended up with folders that have an invalid format. The Trainer would normally save the data in folders like log_dir_name/version_5 where the last part is a number. Perhaps you copied files around manually and then the file browser automatically renamed the colliding files. Is that possible?

On our end, we could improve the error message.

@awaelchli awaelchli added this to the 2.0.x milestone Jun 20, 2023
@awaelchli awaelchli self-assigned this Jul 27, 2023
@renero
Copy link

renero commented Aug 20, 2023

Same problem here. The solution was to rename/remove the invalid name for the TensorBoard logs directory.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants