-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate train_net and test_net in memory #119
Comments
This will bring substantial improvements in both aspects. I will be able to run a large net on my modest GPU. @jeffdonahue, your SplitLayer #114 is wonderful! Do you have any suggestions about this issue? |
Not setting test_net in the solver.prototxt removes the test_net_ initialization altogether and setting test_interval equal to 0 effectively disables Solver::test() during the training process. But combining the net are still very important when studying the optimization algorithms such as adaptive learning rate #30 or accelerated momentum #53. Without the results of testing every fixed interval, it is impossible to compare the effects of different algorithms and parameters. |
This would be a great pull request if done properly -- and it's far from trivial. None of the core Caffe developers are currently working on this, so we would certainly appreciate a contribution! |
If #57 (Consolidate network definitions) is not solved, the solution to this issue would involve merging the NetParameter of the train net and the test net which is the reverse operation of @jeffdonahue's src/caffe/util/insert_splits.cpp. Once #57 is resolved in the future, the merging will become useless. To distinguish between the layers that only belong to one of the nets, the LayerParameter proto need to add at least one more field to flag the nets that use the layer. This is what #57 required. In summary, a thorough solution had better deal with #57 first. |
@kloudkl I don't quite understand "which is the reverse operation of @jeffdonahue's src/caffe/util/insert_splits.cpp". |
I don't think this is a big issue. While it is tempting to save duplicated If memory is an issue, one can always do training only, and write a Yangqing On Tue, Feb 25, 2014 at 11:38 PM, Lin Min notifications@github.com wrote:
|
@Yangqing at this point using |
Why not just separate the data blobs from the layers so that the layers would be able to accept data of any batchsize (#166)? After all, we are all used to functions or methods being able to process containers such as vector or map of arbitrary sizes. |
That is possible and already supported by caffe, but keep in mind that Also, the data blobs are separated from the layers, they are managed by My argument is that, if your testing does not fit into memory, don't do Yangqing On Thu, Feb 27, 2014 at 4:18 AM, kloudkl notifications@github.com wrote:
|
Some layers fix the batch size num_ in their SetUp methods and iterate over num_ rather than over the real batch sizes of the blobs passed in the Forward* and Backward* methods. If the real batch sizes are smaller than num_ there will be out of memory bound segmentation fault and if they are larger than num_ there will be unprocessed data points. The Layers that preset the valid batch sizes that they accpet are ConvolutionLayer, LRNLayer, FlattenLayer, InnerProductLayer and PaddingLayer(already killed in the dev branch). The other layers either perform element-wise operations or permit flexible batch sizes. As long as the batch sizes of the bottom and top arguments do not exceed the available memory, there is no need for the arguments to be equal to fixed batch sizes. The batchsize field in proto can be removed. To determine when the memory need to allocated on the fly, we had better check that the batch sizes of the top blobs are no less than those of the bottom blobs and allocate if necessary. The allocated memory for each layer is therefore just big enough to contain the data of the maximal batch size that run through the layer. With regard to the concern about avoiding frequent memory deallocation and reallocation, we will permit the blobs to grow in batch size but not to shrink. Therefore, the memory would be allocated lazily and reused once allocated. On the other hand, if memory is scarce and needs to be revoked, the shrink can happen only after a relatively long inactive period. In the use case of merging train_net and test_net, the phase that uses the smaller batch size only reuses a portion of the pre-allocated memory that it indeed requires. |
Thank you for your work on consolidating the weight blobs between test and train nets. However, I noticed that caffe still allocates separate memory for BOTH train and test data blobs. |
@shaibagon Did you find any solution to swap when switching TRAIN - TEST phase? |
@Seanberite I'm afraid not. |
Now train_net and test_net are constructed separately according to the two definition files. As is pointed out in #57, the definition files can be consolidated so that a single definition file creates both the train_net and test_net.
The consolidation can happen in memory level, namely use the same net for both train and test. The layer forward backward function can behave differently at running time according to the Phase parameter.
This will save memory needed by test_net, and also save time by avoiding memory copy from the train_net to the test_net.
The text was updated successfully, but these errors were encountered: