New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line delimited JSON for logs #927
Comments
Sorry for the late reply! Before giving any meaningful comments, I would like to ask the following question: do you plan to keep random access possible for this JSON-based log? I do not immediately see how this can be done, whereas for the in-memory and SQLite backends that we currently support this is trivially supported. |
Although I put this as a Blocks issue, because I can't think of a better place, this isn't really a drop-in replacement for either of the logging backends, and the proposal is very much supposed to be framework-agnostic. Blocks' use for logging is quite specific (since extensions use it to communicate), whereas here I'm aiming more for a minimalist "standard" that harmonizes the hacky ways people currently save and analyze logs in. Needing to unpickle a Python object and then loading that into a Pandas dataframe just to be able to do rudimentary analysis is overkill. Some people instead resort to grepping through the printed log statements, which is equally unsatisfactory. Having a standard logging format that is used throughout the lab would allow us to share simple data analysis and plotting scripts, live plotting servers, etc. regardless of the framework people use. So, the question here is more about what format to store data in to disk and send it over the network so that it can easily be consumed and analyzed. Random access isn't really my concern then. To make this work in the context of Blocks, you'd probably have to combine an in-memory or SQLite backend with an extension that adheres to this specification of saving the logs to disk/sending them over the network. I understand that it isn't entirely satisfactory form Blocks' perspective, because we now have a backend which is separate from the serialization, but I personally can't think of a solution that can easily support random access, streaming to disk, pushing over the network, while also producing files that are universally easy to analyze and interpret. |
Okay, I guess we could support access only to a few last iterations in the respective Blocks wrapper. More questions:
|
Yes, iteration number would just be a key in the JSON log. That makes the format entirely agnostic to things like iterations or epochs. Tools built on top of this format would just use the default convention that one entry is one datapoint, with the option of using any key in the log as a time measure instead. Although I understand the desire to want to store the state, I would not want to store it in the same file.
I don't see a problem with having two files e.g. for model in $(ls *.config)
do
learning_rate=$(jq .learning_rate < "$model")
loss=$(jq -s 'min_by(.loss).loss' < "${filename%.*}.log")
echo "Learning rate $learning_rate had loss $loss"
done And do all the processing you want that way. |
All right, let's assume for now that you convinced me that the status My next question is whether it is creation of a new standalone repository That being said, my opinion is that creating a new repository which So I think it would be very practical to develop these new mini-frameworks On 5 December 2015 at 16:48, Bart van Merriënboer notifications@github.com
|
I don't think pylearn2 not switching to Fuel should be taken as evidence of On Sun, Dec 6, 2015 at 12:39 AM, Dzmitry Bahdanau notifications@github.com
|
I think that consideration shouldn't just be the size of the codebase. I think a bigger consideration is actually how large and changeable the public API is. Take In the long term, if we end up switching from Theano to TensorFlow for example, having Fuel separate will be a big win. Similarly, this logging mini-framework would have no dependencies on Theano (or Blocks) whatsoever and could be more future-proof as well as more widely applicable as a separate package. |
Picklable-itertools has an extremely clear purpose. I don't think that a Large part of Fuel will not be really useful for Tensorflow, since data In short, I am really tired of this DJ experience. Btw, look at Tensorflow:
|
Within a company it's easier to create a monolithic system, or an intricate web of dependencies, because you have full-time software engineers responsible for resolving dependency issues, and your software is never used as part of a wider software stack that it needs to play nice with. For open source software I think that the ideal should be different, and separate codebases for separate projects are important, to enforce modularity, avoid coupling, and to simplify use. For example, I personally don't like having to install two dozen dependencies for Blocks if I just want one particular, tiny functionality. At Google everything is part of a single, supported software stack that each employee is expected to run, so our approaches aren't very comparable. As an alternative comparison, have a look at all the different repositories under the Torch umbrella. I think the issues you highlight are basically issues with Python's package management, the way Blocks and Fuel are managed at the lab, our lack of a unified testing infrastructure, and GitHub's inability to make pull requests that span multiple repositories. I'd be much happier trying to find practical solutions to those problems, than throwing everything into a single, monolithic codebase which is likely to rot eventually, where interdependencies are easily sneaked in, the list of external dependencies grows with each new functionality, where it becomes harder to swap or replace particular parts of the stack as time goes on, and the number of issues/PRs per day becomes overwhelming because they are not separated depending on their functionality, etc. If the management of multiple repositories really becomes too much of a burden, I'd go down the route of writing a few helper scripts instead, which allow us to make pull requests to Blocks and Fuel simultaneously and ensure that their test suites are then run against each other. Likewise, we should maybe find a way of not needing to duplicate |
Concerning practical things. I took a big log from our speech experiment and converted it into json lines format, here is a gist. Surprisingly, json takes a bit less space than cPickle and gzipped json is much smaller:
|
It might be misleading: the log is not zipped despite |
Thanks, the size of the compressed JSON files is very encouraging! |
Indeed! Dima, can you please check that from this particular pickle the Concerning one repo vs many: I think we should have a meeting next week and On 9 December 2015 at 15:54, Bart van Merriënboer notifications@github.com
|
You are right, it was saved with protocol 0, here are sizes with protocol 1, 2 and gzipped:
|
I guess that seems more realistic. I'm slightly curious though about the roles of NumPy arrays here; if 95% of the log data is array data, the actual numbers don't tell you much about the differences between the data formats. That said, a 35% decrease in size is still pretty good. Also, I guess this is another argument for dropping pickling, because it's clear that even experienced users can forget to specify the right protocol and end up with unnecessarily large files. I'll send an e-mail about a design meeting in a bit |
There are no large numpy arrays in this log. I think, it is mostly field On Wed, Dec 9, 2015 at 5:09 PM, Bart van Merriënboer <
|
First code here: https://github.com/bartvm/mimir Once it's more stable I can merge it into Blocks, but a separate repo was easier to start with (since it contains C and Cython code). |
I implemented all the important functionality, and am quite happy with it. In the end it does quite a bit of nice magic (like using threads in order to respond to clients who want data for live plotting or analysis, and using two buffers to make sure that writing the JSON data to disk is both efficient and well-compressed while still being relatively fail-safe). Integrating it into Blocks might not be entirely straightforward because of the nightmares from pickling file objects (and in this case, text wrappers over buffered streams over Cython type extensions that hold file handles...). |
You are using bokeh 0.11 inside a notebook for live plotting, right? It seems a very nice solution to mila-iqia/blocks-extras#35. |
I think IPython notebook is a pretty heavy dependency for that, and if I On Fri, Feb 12, 2016 at 11:49 AM, dmitriy-serdyuk notifications@github.com
|
Yes, I think I use the latest Bokeh. IPython Notebook is not a dependency though, there's also the option of using the Bokeh server. That said, sending/receiving data is entirely done through ZMQ, so you can use whatever plotting framework that you want. |
Err, the "Bokeh server" (as in bokeh-server executable) doesn't exist On Fri, Feb 12, 2016 at 12:33 PM, Bart van Merriënboer <
|
There's the http://bokeh.pydata.org/en/latest/docs/user_guide/server.html#specifying-output-server |
Looks nice! As for Blocks integration, can not we create a new I am also curious if there is a support for multiple processes writing to the same file. I have recently had a usecase when it was necessary. |
@rizar , I think we can do it, the logger appends to the file, so it shouldn't be a problem. A small trick is needed to fetch the last row (it's needed by some extensions) from the log, though. I'm trying to implement it now and I'll see if it works fine. |
Atomicity of the appends is a concern, isn't it? It sounded from what Bart On Fri, Feb 12, 2016 at 2:19 PM, dmitriy-serdyuk notifications@github.com
|
Writing to the same file from multiple processes isn't trivial. You'd need to lock the files to make things atomic like @dwf said, and I think that would be way too slow for many workers. Since ZMQ is already a dependency we could consider implementing something like this, but it would mean you need to designate a single master process to collect the log entries from the workers. I think this is overkill though. Since the log is just line-delimited JSON, why not have each process write to a separate log and then do |
👍 |
All right, that sounds reasonable. Sorry for my being lazy to read the code, but can you please explain how On 12 February 2016 at 15:29, David Warde-Farley notifications@github.com
|
It's simply a ZMQ publish and subscribe pair of sockets (PUB-SUB). The log simple pushes the log entries as JSON strings over the socket, and all the clients that are subscribed will receive them. There's also a "persistent" client-server interface which is slightly more complex. In this case the server starts a separate thread. Each log entry is pushed over the PUB socket to the clients, but also to the other thread. This separate thread has a ROUTER socket. Now, if a client wants to join this happens:
Meanwhile, the main thread sends each log entry to the subscribed clients (PUB) as well as to the second separate thread where it will be stored for potential future clients to receive. |
So although catting a bunch of log files together is fine for analysis, it's a bad too bad that you lose the ability to live plot, print to a single output, etc. so I ended up writing a simple log for multi-GPU training anyway: bartvm/mimir@39b9401 Still need to write docstrings, but the usage is really simple: # Main process
from mimir import ServerLogger
# Accepts the same arguments as a normal Logger object
# blocks until all workers are done
logger = ServerLogger(filename='combined_log.jsonl.gz', stream=True) # A worker
from mimir import RemoteLogger
# A remote logger only needs connection details, localhost:5555 by default
logger = RemoteLogger()
# Log away
logger.log({'iteration': 0}) The server logger receives the log entries, adds a unique ID that identifies each remote logger, and then handles it like it usually would (so it streams it, writes it to a log file, etc.) Some current shortcomings:
|
So I've been frustrated with the current state of logging in Blocks for a while. Pickled Python dictionaries have problematic memory requirements and aren't ideal for analysis given all the issues that come with pickling. The SQLite backend is not very user-friendly and doesn't do well with non-tabular logs, and is not straightforward to analyze either if you don't know SQL.
I've been trying to find an alternative that addresses these issues and meets the following requirements:
rm !("$(paste <(ls *.log | get_validation_error) <(ls *.model) | sort -g | cut -f 2)")
in order to delete all but the best model.Given these requirements, I want to propose storing logs in newline delimited JSON from now on, and to write a small toolbox to deal with this format. The advantages I imagine are as follows:
jq
, enabling things likejq -s 'max_by(.validation_error).training_error < log
, and everything is streaming (yay)My proposal would be to write a mini-framework using this logging approach that is separate from Blocks, and then write Blocks extensions that use it. Basic functionality would be:
log
method (which just takes a Pythondict
) with several handlers:stdout
orstderr
gunzip
andjq
I'd love to know if people think this would meet all their logging needs. Logging is something that everyone ends up having to do, regardless of the framework they use, and I think it would be nice to have a clean and simple solution for it that is widely supported and used in the lab.
The main shortcoming of the format I can think of is that it doesn't allow to store any metadata about the model e.g. the learning rate, momentum velocity, etc. This makes it harder to do things like "plot minimum validation error vs. learning rate". Perhaps we could store this information in a separate JSON file? Or maybe it shouldn't be part of this mini-framework at all?
The text was updated successfully, but these errors were encountered: