Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355

jeffdonahue · 2014-04-24T02:52:03Z

This PR wraps up the SyncedMem changes from @kloudkl in #250 (thanks for doing most of the work on this @kloudkl!). I ended up removing the thrust host/device_vector dependency as I couldn't get them working at all (almost all tests segfaulted), and instead kept the void* pointers that we were already using to store the data (and still syncing using explicit cudaMemcpy calls). I added private fields cpu_capacity_ and gpu_capacity_ which keep track of the current space allocation, and methods cpu_resize() and gpu_resize() allocate additional memory (copying any current memory) if the current allocation is insufficient.

The final commit makes use of this new functionality and my ShareData and ShareDiff methods from #332 to share all the data blobs between the training and test net (#332 only shared weight blobs). Because of this change I've expanded the imagenet_val batchsize from 50 to 250 (as close to the train batch size of 256 as possible while still evenly dividing 50K...) for slightly faster test time with little memory cost.

I ran some tests of the ImageNet model on a Tesla K40 and found little to no difference in training/test time (not counting the test time improvement due to the expanded batch size), while memory use after the first test iteration improves significantly:

test batch size 50:

    @ dev:
    before first test iteration: 3014 MB (GPU memory use based on 'nvidia-smi')
     after first test iteration: 3255 MB
    -> 241 MB additional memory used by test net

    @ capacity-aware-memory:
    before first test iteration: 3014 MB
     after first test iteration: 3053 MB
    -> 39 MB additional memory used by test net, 84% reduction vs. dev


test batch size 250:

    @ dev:
    before first test iteration: 3014 MB
     after first test iteration: 4143 MB
    -> 1129 MB additional memory used by test net

    @ capacity-aware-memory:
    before first test iteration: 3014 MB
     after first test iteration: 3141 MB
    -> 127 MB additional memory used by test net, 89% reduction vs. dev

Note that the memory use of the test net is non-zero, because additional memory is still allocated for the test net in at least these three cases:

(1) blobs used inside of layers to store intermediate computations (the solver has no way of knowing about these; would have to change each layer's code to share these),
(2) blobs in the test net which do not share a name with any blob in the train net (e.g. the accuracy top blob in the ImageNet test net -- which isn't a great example because it's only 8 bytes, but you get the idea)
(3) blobs in the test net which are larger (in terms of count) than correspondingly named blobs in the train net (there aren't any of these cases in any of Caffe's sample models, but it would obviously happen if you, e.g., used a larger batch size in the test net, or did something weird like swapping the names of conv1 and conv5 between the train and test net, etc.)

jeffdonahue · 2014-04-24T05:49:51Z

I just edited history to not rename syncedmem.cpp to syncedmem.cu (was done in @kloudkl's first commit), as github's diff is pretty useless for renamed files. I can redo that change if this gets merged though.

shelhamer · 2014-04-24T16:40:49Z

Makefile

@@ -305,6 +305,10 @@ $(TEST_BIN_DIR)/%.testbin: $(TEST_BUILD_DIR)/%.o $(GTEST_OBJ) $(STATIC_NAME) \
 		-o $@ $(CXXFLAGS) $(LDFLAGS) $(WARNINGS)
 	@ echo

+$(GTEST_OBJ): $(GTEST_SRC) | $(GTEST_BUILD_DIR)


What's up with the Makefile changes introduced in 1d4ea4be7e77203306a157580a44f70fb475093e?

@kloudkl had added the +$(OBJ_BUILD_DIR)/%.cuo: rule (presumably due to the renaming of syncedmem.cpp->syncedmem.cu -- we had no rule for building *.cu files in the top-level src/caffe directory), but I moved it down in the list because older versions of make (including the one I use, the default Ubuntu installation) will match the first rule matching instead of the most specific one, so this rule matched *.cu files in subdirs also, which I fixed by moving it after all other *.cu rules.

The rest of the changes were basically style changes and adding a dependency on the header files $(HXX_SRCS) where I happened to notice there wasn't one before (in one case changing from $(PROTO_GEN_HEADER), which is a subset of $(HXX_SRCS)) Sorry for mixing these changes into an unrelated PR...I can remove them from history if desired.

Fine to keep it in this PR for convenience, but could you split the Makefile changes into their own commit (or at least mention them in the message for 1d4ea4b)?

done - moved into different commit right before the "use handwritten resize methods" one (github displays it as being the last commit though)

shelhamer · 2014-04-24T16:48:16Z

Nice memory savings, and reducing the number of allocations is good too!

This all looks good to me, and the tests cover the key uses, but I'd rather have another reviewer check this too since memory is slightly important.

Thanks @kloudkl for raising the original PR!

Thanks Jeff for this and all your recent sharing commits!

shelhamer · 2014-04-24T16:48:47Z

@longjon @sergeyk @sguada one of you please review.

jeffdonahue · 2014-04-24T19:00:20Z

Thanks for taking the time to look over the changes Evan! Agreed that this isn't a PR we want to merge hastily.

Yangqing · 2014-04-24T23:06:01Z

include/caffe/blob.hpp

-  void Reshape(const int num, const int channels, const int height,
-    const int width);
+  explicit Blob(const int num = 0, const int channels = 0,
+                const int height = 0, const int width = 0);


Maybe remove the default arguments? It would be odd to set some being 0 and some being not 0, but if all are 0, it is effectively just the default constructor Blob().

@kloudkl got rid of the no-arg blob constructor -- the one with the default args sets member variables and does Reshape, so I don't think it's quite the same?

Yangqing · 2014-04-24T23:12:59Z

There are a few things that I am a little worried about on this, but maybe I am overly cautious, so please kindly check with me:

(1) Assuming that we have two blobs referring to the same syncedmem, and one does a reshape or w/e that changes the syncedmem storage, either by expanding or shrinking and zero-fitting. Will the other blob play nicely with it?

(2) It seems that we never shrink the allocated size, which means that "once expanded, always expanded". I am wondering if this would cause certain problems in the future (for now things seem to be fine in our limited usage of this feature), so it is probably better to document it in the comments so one may be aware of this effect?

Yangqing · 2014-04-24T23:24:26Z

include/caffe/blob.hpp

-    const int width);
+  explicit Blob(const int num = 0, const int channels = 0,
+                const int height = 0, const int width = 0);
+  explicit Blob(Blob* memory_share_blob);


IMHO we could do const reference

explicit Blob(const Blob& memory_share_blob);

? :)

thanks, done

jeffdonahue · 2014-04-25T00:42:00Z

Hey Yangqing -- no need to explain your hesitation, greatly appreciate feedback from you!

(1) The way I implemented this was indeed (as you said in (2)) to only ever expand the SyncedMemory allocations (cpu_data_ and gpu_data_), and do so lazily as with the rest of the SyncedMemory implementation. SyncedMemory maintains the variables size_, cpu_capacity_, and gpu_capacity_, with the size_ variable being the only publicly mutable one via the set_size method which simply updates the size_ variable. Blob calls to the underlying memory accessors (cpu_data(), gpu_data(), mutable_cpu_data(), etc.) now always call set_size(count_) and then will simply return the data pointer as before (copying to/from device/host as needed, just like before) if it already has adequate storage; i.e., if cpu_capacity_ >= size_ (or gpu_capacity_ >= size_ for a gpu_data call). If the SyncedMemory doesn't have sufficient storage, it then allocates the necessary amount of additional memory (keeping any existing data as is via realloc for the host or explicit cudaMemcpy for GPU).

For the use in the current codebase, we would only ever expand the original memory allocation once -- specifically during the first training iteration (since as of the dev branch today we first perform a test pass before training, and training blobs are typically larger than test blobs -- if they were <= the size of the test blobs we'd never expand the original allocation).

I can't think of any scenarios in which this would cause problems with Blobs interfering with one another, as they always explicitly set the SyncedMemory size before any data accessor/mutator calls. Obviously this can't be used haphazardly to share SyncedMemory's among any arbitrary set of Blobs, but I think it seems to work for sharing only among pairs of blobs between the train and test net. Let me know if you think there's something I may be missing though.

(2) Agreed on adding a comment to note that SyncedMemory is only ever expanded -- will do.

shelhamer · 2014-05-02T21:22:12Z

Needs rebase, then let's merge? I think the reasoning in #355 (comment) is sound, but if there are any reservations let me know!

~~Re: (1), the "other" blob is actually a shared pointer to a blob, so the reshape/expansion/whatever is done to "both" blobs. Or have I missed something?~~ Blobs are better thought of as virtual views of a SyncedMemory with concrete size. See follow-up comment for details.

jeffdonahue · 2014-05-02T21:52:43Z

I think that's not quite right. With this PR, you can think of Blobs as having a "virtual" size, while the SyncedMemory (still) has a "physical" size which is the maximum of all the virtual sizes of the Blobs sharing it, assuming they've all called cpu_data or another data accessor/mutator.

So both Blobs have their own "size", but the size is just an int member variable they each hold which does not actually affect the state of the SyncedMemory on its own. When a Blob's {mutable_,}cpu_data (or other data accessor/mutator is called), the Blob does a set_size on its SyncedMemory which it may share with other Blobs. This set_size always takes essentially 0 time, just updating the SyncedMemory's size variable. Then when a SyncedMemory's cpu/gpu_data is acccessed the SyncedMemory checks whether its current allocation is sufficient and allocates additional space if necessary.

Please do let me know if that doesn't make sense or you have any other questions.

I forgot to add the comment suggested by Yangqing that set_size will never actually deallocate any memory -- I'll do that, rebase, and let you know that this is ready to merge. I'd like to believe my implementation works and won't cause any problems anywhere, but I'm not sure about immediately merging this into master -- I think it might not be a bad idea to let this simmer in dev a while to make sure nothing funny happens that I/we haven't thought of.

shelhamer · 2014-05-02T22:16:40Z

Alright, no need for a trial by fire by sending this to master soon. Thanks for the explanation–I'd somehow convinced myself we were sharing pointers, although that's clearly impossible with shareData() being a member function of Blob. Oh well.

kloudkl · 2014-05-03T07:00:00Z

One of the most highlighted features of the CUDA 6 is the unified memory which automatically manages the data transfer between the CPU and the GPU. It provides a much more fundamental solution to memory synchronization. Shall we shift to this new GPU memory management paradigm?

make some casts static_casts minor style fixes

specific ones (for older versions of make that take first match)

-- couldn't get the thrust version working at all

share data blobs with matching names

…y is used

kloudkl · 2014-06-10T01:21:33Z

@jeffdonahue, do you have the time to rebase again? I want to fork this branch to continue the development of #195. If you don't, I will do it in my fork. But I'm afraid to introduce extra conflicts if we resolve the merge conflicts in different ways.

kloudkl · 2014-06-10T01:43:57Z

The quality of memory cards and management systems are usually assured by repeatedly writing, reading and checking random data with the memtest or stress test utilities. The stability of this PR can be tested similarly.

jeffdonahue · 2014-06-13T04:56:34Z

The quality of memory cards and management systems are usually assured by repeatedly writing, reading and checking random data with the memtest or stress test utilities. The stability of this PR can be tested similarly.

Agreed that that would be good, but I'm not familiar with those tools and don't really have the spare time to learn them, unless there's something very simple and straightforward I can do? Otherwise if you or anyone else would like to stress test this yourselves, by all means feel free (not that I could stop you...).

I'm actually not really sure my implementation is ready for "primetime". I have occasionally seen cases when a cudaMalloc fails on a gpu_resize call when there is a lot of extra memory being allocated (e.g., when the first imagenet training iteration begins, since this now happens after the first test iteration when the batch size is smaller -- when I originally wrote this PR that problem wasn't exposed because the training net was allocated first since we didn't do a "0th iteration" test then).

I think I might know how to fix it, but probably won't have time to work on it for a few weeks, unfortunately. If you'd like, feel free to take over development yourself again (either with my implementation or yours).

shelhamer · 2014-09-18T20:37:34Z

This is worth revising in light of #594 especially with regards to

reallocation for reshapes to larger sizes
sharing all blobs between train and test nets (and not only the trained parameters)

in case anyone in the community with a focus on memory usage is interested in taking up this PR.

jeffdonahue · 2015-03-09T21:59:26Z

Closing this as I've abandoned it; @tnarihi feel free to cherry-pick or reference for #1985 if useful.

jeffdonahue mentioned this pull request Apr 24, 2014

Make SyncedMem and Blob be able to increase capacities without deallocation and reallocation #250

Closed

shelhamer reviewed Apr 24, 2014
View reviewed changes

Yangqing reviewed Apr 24, 2014
View reviewed changes

kloudkl and others added 13 commits May 3, 2014 19:41

Use Thrust to manage the underlying memory of SyncedMem

f2fed7a

Implement and test SyncedMem::resize and reserve

f42e347

Add Blob::capacity and make Blob::Reshape capacity aware

a278657

Merge the Blob constructors and initialize all the fields properly

1b0c1e0

Change the Makefile to be consistent in style

3b574fd

Use static_cast instead of reinterpret_cast to cast void pointers

ff98a33

minor cast/style fixes

993c55a

make some casts static_casts minor style fixes

minor makefile style/dependency fixes; move more general rule after more

a495223

specific ones (for older versions of make that take first match)

use handwritten resize methods instead of thrust host/device vectors

ccdd521

-- couldn't get the thrust version working at all

make TestResize less trivial given new syncedmem resize design

ccccbdb

add net constructor that takes pointer to other net with which we will

fd896bf

share data blobs with matching names

increase imagenet val batch size to 250 since little additional memor…

7272ca0

…y is used

misc minor cleanup

5a94f27

jeffdonahue added 6 commits May 3, 2014 20:06

check that malloc/realloc succeed

7a89d81

add comments explaining memory_share_net

43bdc6f

make memory_share_blob input to Blob constructor const

78b35f3

rebase fixup: rm duplicate constructor

dbba6c2

post-rebase cleanup

3011d82

add comment documenting function of set_size method in SyncedMemory

5717ab5

sguada mentioned this pull request May 22, 2014

How to train imagenet with reduced memory and batch size? #430

Closed

kloudkl mentioned this pull request Jun 10, 2014

Allow dynamic batch sizes in all the layers #195

Closed

kloudkl mentioned this pull request Jul 1, 2014

Allow images of different sizes as inputs #557

Closed

longjon mentioned this pull request Jul 3, 2014

On-the-fly net resizing, without reallocation (where possible) #594

Merged

shelhamer force-pushed the dev branch 3 times, most recently from 4278286 to c01f07a Compare August 28, 2014 07:00

shelhamer force-pushed the dev branch from 64258b6 to 403b56b Compare September 19, 2014 04:38

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

jeffdonahue added the abandoned label Mar 9, 2015

jeffdonahue closed this Mar 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355

Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355

jeffdonahue commented Apr 24, 2014

jeffdonahue commented Apr 24, 2014

shelhamer Apr 24, 2014

jeffdonahue Apr 24, 2014

shelhamer Apr 24, 2014

jeffdonahue Apr 24, 2014

shelhamer commented Apr 24, 2014

shelhamer commented Apr 24, 2014

jeffdonahue commented Apr 24, 2014

Yangqing Apr 24, 2014

jeffdonahue Apr 25, 2014

Yangqing commented Apr 24, 2014

Yangqing Apr 24, 2014

jeffdonahue Apr 25, 2014

jeffdonahue commented Apr 25, 2014

shelhamer commented May 2, 2014

jeffdonahue commented May 2, 2014

shelhamer commented May 2, 2014

kloudkl commented May 3, 2014

kloudkl commented Jun 10, 2014

kloudkl commented Jun 10, 2014

jeffdonahue commented Jun 13, 2014

shelhamer commented Sep 18, 2014

jeffdonahue commented Mar 9, 2015

Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355

Wrap up SyncedMem resize from @kloudkl; make train/test nets share data blobs #355

Conversation

jeffdonahue commented Apr 24, 2014

jeffdonahue commented Apr 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shelhamer commented Apr 24, 2014

shelhamer commented Apr 24, 2014

jeffdonahue commented Apr 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yangqing commented Apr 24, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffdonahue commented Apr 25, 2014

shelhamer commented May 2, 2014

jeffdonahue commented May 2, 2014

shelhamer commented May 2, 2014

kloudkl commented May 3, 2014

kloudkl commented Jun 10, 2014

kloudkl commented Jun 10, 2014

jeffdonahue commented Jun 13, 2014

shelhamer commented Sep 18, 2014

jeffdonahue commented Mar 9, 2015