Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summaries require names of format name/tag #126

Open
nomadbl opened this issue Jul 1, 2023 · 2 comments
Open

Summaries require names of format name/tag #126

nomadbl opened this issue Jul 1, 2023 · 2 comments

Comments

@nomadbl
Copy link
Contributor

nomadbl commented Jul 1, 2023

After updating to ProtoBuf 1.0.0 #124 I found that summaries are not logged correctly to Tensorboard.
Some of them do get logged but some don't. I suspect that's because some summaries are fine but after trying to log incorrectly with some of them, the file or tensorboard stops registring the ones following that.

I prepared a minimal reproducing code by revising the Flux example to the new Flux API (the existing example uses a deprecated API)
nomadbl@fc9ba3e

During logging I observe an error message

[2023-07-01T23:12:43Z WARN  rustboard_core::run] Read error in ./content/log/events.out.tfevents.1.68825314069885e9.lior-HP-Pavilion-Laptop-15-cs3xxx: ReadRecordError(BadLengthCrc(ChecksumError { got: MaskedCrc(0x85987b32), want: MaskedCrc(0x00000000) }))

Which after some googling I can only speculate it indicates it has something to do with multiprocessing and the file trying to get written by multiple instances of the logger in different threads.
So far I tried (without success) to fix it under that assumption by specifying the logger should lock the file:
src/TBLogger.jl, 119:
file = open(fpath, "w"; lock=true)

Any other ideas or insights are welcome. I'll try to isolate the issue using the above mentioned reproducing code.

@nomadbl
Copy link
Contributor Author

nomadbl commented Jul 2, 2023

I succeeded in altering the flux example such that the bug does not occur:
nomadbl@e0f2245

The trick was to change lines like
@info "train" loss=loss_fn(pred, y) acc=accuracy(pred, y)
into
@info "train/vals" loss=loss_fn(pred, y) acc=accuracy(pred, y)

That is, the bug is somehow related to tag names.

@nomadbl nomadbl changed the title Summaries missing when logging from Flux training loop Summaries require names of format name/tag Jul 2, 2023
@nomadbl
Copy link
Contributor Author

nomadbl commented Jul 2, 2023

Since this seems to work with the workaround above I'm leaving this for now.
I suspect that this has to be fixed by setting node_name or tag correctly in Summary_Values (i.e. var"Summary.Value")
I wasn't able to determine how to do this by reading the tensorboard/tensorflow documentation. Looks like a pretty in depth understanding is required there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant