-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355
Conversation
I just edited history to not rename |
@@ -305,6 +305,10 @@ $(TEST_BIN_DIR)/%.testbin: $(TEST_BUILD_DIR)/%.o $(GTEST_OBJ) $(STATIC_NAME) \ | |||
-o $@ $(CXXFLAGS) $(LDFLAGS) $(WARNINGS) | |||
@ echo | |||
|
|||
$(GTEST_OBJ): $(GTEST_SRC) | $(GTEST_BUILD_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with the Makefile changes introduced in 1d4ea4be7e77203306a157580a44f70fb475093e?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kloudkl had added the +$(OBJ_BUILD_DIR)/%.cuo:
rule (presumably due to the renaming of syncedmem.cpp->syncedmem.cu -- we had no rule for building *.cu
files in the top-level src/caffe
directory), but I moved it down in the list because older versions of make
(including the one I use, the default Ubuntu installation) will match the first rule matching instead of the most specific one, so this rule matched *.cu
files in subdirs also, which I fixed by moving it after all other *.cu
rules.
The rest of the changes were basically style changes and adding a dependency on the header files $(HXX_SRCS)
where I happened to notice there wasn't one before (in one case changing from $(PROTO_GEN_HEADER)
, which is a subset of $(HXX_SRCS)
) Sorry for mixing these changes into an unrelated PR...I can remove them from history if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to keep it in this PR for convenience, but could you split the Makefile changes into their own commit (or at least mention them in the message for 1d4ea4b)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - moved into different commit right before the "use handwritten resize methods" one (github displays it as being the last commit though)
Nice memory savings, and reducing the number of allocations is good too! This all looks good to me, and the tests cover the key uses, but I'd rather have another reviewer check this too since memory is slightly important. Thanks @kloudkl for raising the original PR! Thanks Jeff for this and all your recent sharing commits! |
Thanks for taking the time to look over the changes Evan! Agreed that this isn't a PR we want to merge hastily. |
void Reshape(const int num, const int channels, const int height, | ||
const int width); | ||
explicit Blob(const int num = 0, const int channels = 0, | ||
const int height = 0, const int width = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the default arguments? It would be odd to set some being 0 and some being not 0, but if all are 0, it is effectively just the default constructor Blob().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kloudkl got rid of the no-arg blob constructor -- the one with the default args sets member variables and does Reshape, so I don't think it's quite the same?
There are a few things that I am a little worried about on this, but maybe I am overly cautious, so please kindly check with me: (1) Assuming that we have two blobs referring to the same syncedmem, and one does a reshape or w/e that changes the syncedmem storage, either by expanding or shrinking and zero-fitting. Will the other blob play nicely with it? (2) It seems that we never shrink the allocated size, which means that "once expanded, always expanded". I am wondering if this would cause certain problems in the future (for now things seem to be fine in our limited usage of this feature), so it is probably better to document it in the comments so one may be aware of this effect? |
const int width); | ||
explicit Blob(const int num = 0, const int channels = 0, | ||
const int height = 0, const int width = 0); | ||
explicit Blob(Blob* memory_share_blob); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO we could do const reference
explicit Blob(const Blob& memory_share_blob);
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
Hey Yangqing -- no need to explain your hesitation, greatly appreciate feedback from you! (1) The way I implemented this was indeed (as you said in (2)) to only ever expand the SyncedMemory allocations (cpu_data_ and gpu_data_), and do so lazily as with the rest of the SyncedMemory implementation. SyncedMemory maintains the variables For the use in the current codebase, we would only ever expand the original memory allocation once -- specifically during the first training iteration (since as of the dev branch today we first perform a test pass before training, and training blobs are typically larger than test blobs -- if they were <= the size of the test blobs we'd never expand the original allocation). I can't think of any scenarios in which this would cause problems with Blobs interfering with one another, as they always explicitly set the SyncedMemory size before any data accessor/mutator calls. Obviously this can't be used haphazardly to share SyncedMemory's among any arbitrary set of Blobs, but I think it seems to work for sharing only among pairs of blobs between the train and test net. Let me know if you think there's something I may be missing though. (2) Agreed on adding a comment to note that SyncedMemory is only ever expanded -- will do. |
Needs rebase, then let's merge? I think the reasoning in #355 (comment) is sound, but if there are any reservations let me know!
|
I think that's not quite right. With this PR, you can think of Blobs as having a "virtual" size, while the SyncedMemory (still) has a "physical" size which is the maximum of all the virtual sizes of the Blobs sharing it, assuming they've all called So both Blobs have their own "size", but the size is just an int member variable they each hold which does not actually affect the state of the SyncedMemory on its own. When a Blob's Please do let me know if that doesn't make sense or you have any other questions. I forgot to add the comment suggested by Yangqing that set_size will never actually deallocate any memory -- I'll do that, rebase, and let you know that this is ready to merge. I'd like to believe my implementation works and won't cause any problems anywhere, but I'm not sure about immediately merging this into master -- I think it might not be a bad idea to let this simmer in dev a while to make sure nothing funny happens that I/we haven't thought of. |
Alright, no need for a trial by fire by sending this to master soon. Thanks for the explanation–I'd somehow convinced myself we were sharing pointers, although that's clearly impossible with |
One of the most highlighted features of the CUDA 6 is the unified memory which automatically manages the data transfer between the CPU and the GPU. It provides a much more fundamental solution to memory synchronization. Shall we shift to this new GPU memory management paradigm? |
make some casts static_casts minor style fixes
specific ones (for older versions of make that take first match)
-- couldn't get the thrust version working at all
share data blobs with matching names
@jeffdonahue, do you have the time to rebase again? I want to fork this branch to continue the development of #195. If you don't, I will do it in my fork. But I'm afraid to introduce extra conflicts if we resolve the merge conflicts in different ways. |
The quality of memory cards and management systems are usually assured by repeatedly writing, reading and checking random data with the memtest or stress test utilities. The stability of this PR can be tested similarly. |
Agreed that that would be good, but I'm not familiar with those tools and don't really have the spare time to learn them, unless there's something very simple and straightforward I can do? Otherwise if you or anyone else would like to stress test this yourselves, by all means feel free (not that I could stop you...). I'm actually not really sure my implementation is ready for "primetime". I have occasionally seen cases when a I think I might know how to fix it, but probably won't have time to work on it for a few weeks, unfortunately. If you'd like, feel free to take over development yourself again (either with my implementation or yours). |
4278286
to
c01f07a
Compare
This is worth revising in light of #594 especially with regards to
in case anyone in the community with a focus on memory usage is interested in taking up this PR. |
This PR wraps up the SyncedMem changes from @kloudkl in #250 (thanks for doing most of the work on this @kloudkl!). I ended up removing the thrust host/device_vector dependency as I couldn't get them working at all (almost all tests segfaulted), and instead kept the
void*
pointers that we were already using to store the data (and still syncing using explicitcudaMemcpy
calls). I added private fieldscpu_capacity_
andgpu_capacity_
which keep track of the current space allocation, and methodscpu_resize()
andgpu_resize()
allocate additional memory (copying any current memory) if the current allocation is insufficient.The final commit makes use of this new functionality and my
ShareData
andShareDiff
methods from #332 to share all the data blobs between the training and test net (#332 only shared weight blobs). Because of this change I've expanded the imagenet_val batchsize from 50 to 250 (as close to the train batch size of 256 as possible while still evenly dividing 50K...) for slightly faster test time with little memory cost.I ran some tests of the ImageNet model on a Tesla K40 and found little to no difference in training/test time (not counting the test time improvement due to the expanded batch size), while memory use after the first test iteration improves significantly:
Note that the memory use of the test net is non-zero, because additional memory is still allocated for the test net in at least these three cases:
(1) blobs used inside of layers to store intermediate computations (the solver has no way of knowing about these; would have to change each layer's code to share these),
(2) blobs in the test net which do not share a name with any blob in the train net (e.g. the
accuracy
top blob in the ImageNet test net -- which isn't a great example because it's only 8 bytes, but you get the idea)(3) blobs in the test net which are larger (in terms of
count
) than correspondingly named blobs in the train net (there aren't any of these cases in any of Caffe's sample models, but it would obviously happen if you, e.g., used a larger batch size in the test net, or did something weird like swapping the names ofconv1
andconv5
between the train and test net, etc.)