Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Decouple the computational batch size and minibatch size by accumulating gradients #1977
Conversation
This was referenced Feb 26, 2015
shelhamer
added a commit
to shelhamer/caffe
that referenced
this pull request
Feb 26, 2015
|
|
shelhamer |
4f56f71
|
shelhamer
added a commit
to shelhamer/caffe
that referenced
this pull request
Feb 26, 2015
|
|
shelhamer |
dc3479e
|
tnarihi
and 2 others
commented on an outdated diff
Feb 26, 2015
| @@ -477,7 +502,8 @@ void SGDSolver<Dtype>::ComputeUpdateValue() { | ||
| case Caffe::CPU: | ||
| for (int param_id = 0; param_id < net_params.size(); ++param_id) { | ||
| // Compute the value to history, and then copy them to the blob's diff. | ||
| - Dtype local_rate = rate * net_params_lr[param_id]; | ||
| + Dtype local_rate = rate * net_params_lr[param_id] | ||
| + / this->param_.iter_size(); |
tnarihi
Contributor
|
tnarihi
commented on an outdated diff
Feb 26, 2015
| @@ -513,7 +539,8 @@ void SGDSolver<Dtype>::ComputeUpdateValue() { | ||
| #ifndef CPU_ONLY | ||
| for (int param_id = 0; param_id < net_params.size(); ++param_id) { | ||
| // Compute the value to history, and then copy them to the blob's diff. | ||
| - Dtype local_rate = rate * net_params_lr[param_id]; | ||
| + Dtype local_rate = rate * net_params_lr[param_id] | ||
| + / this->param_.iter_size(); |
|
|
|
Commented on the diff. By the way, I am not understanding very well about what @jeffdonahue mentions. Is there any relations between this PR and weight sharing. Gradient accumulations among shared parameters are computed independently (since Anyway, the idea of always accumulating is very good (less memory) if the issues are solved. Both issues do not matter to me since I usually use SGDSolver and I will notice the behavior change of |
|
Oops, I haven't actually stepped through this myself, but I think you're totally right @tnarihi -- there shouldn't be an issue with weight sharing in this implementation. I was confusing it with my version -- I had rebased my RNN PR (#1873) on this, and then just threw the additional changes to Besides the other issues Takuya mentioned, I now think this is strictly good (i.e. it doesn't break anything that works now) and should be merged. Maybe I'll write a new PR based on this, or a commit to append to this one, that does |
|
I see. Thanks Jeff! Sharing diff for weight sharing is nice for memory consumption. I think to restrict all The other thing is, I think, to notifying developers (especially for developers working on PR regarding layers that has parameter updates) that |
Another good point. At some point I had modified the gradient checker to check accumulation (by adding some random noise to the param blob diffs, calling |
|
That sounds nice idea. Abstracting |
|
Actually I have thought of a simpler way to implement this that is independent of gradient accumulation. Maybe it is too tricky, maybe not. Will update. (I am still mildly in favor of always accumulating gradients, disallowing different |
This was referenced Mar 4, 2015
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Mar 10, 2015
|
|
longjon |
be026fc
|
kuprel
referenced
this pull request
Mar 10, 2015
Open
Running Over Whole Sets/Computing Epochs Instead of Iterations #1094
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Mar 10, 2015
|
|
longjon |
ae12045
|
longjon
added a commit
to longjon/caffe
that referenced
this pull request
Mar 10, 2015
|
|
longjon |
10c133a
|
weiliu89
added a commit
to weiliu89/caffe
that referenced
this pull request
Apr 1, 2015
|
|
weiliu89 |
ad6fede
|
weiliu89
added a commit
to weiliu89/caffe
that referenced
this pull request
Apr 14, 2015
|
|
weiliu89 |
0710648
|
|
@shelhamer @jeffdonahue @longjon what is happening with this PR, I think we need to find a solution and merge it as soon as possible. Actually I thought it was already merged since the solution has been around for a while. |
elleryrussell
added a commit
to elleryrussell/caffe
that referenced
this pull request
May 1, 2015
|
|
elleryrussell |
d4ad090
|
|
Accumulating gradients includes subtleties with regards to scaling gradients and hyperparameters w.r.t. to the effective batch size vs. the computational batch size For merge, this needs a test that compares the updates computed by |
|
Right, this should be fine after @shelhamer's list. My idea for a simpler implementation did not pan out; it would only have worked for SGD with momentum. |
shelhamer
referenced
this pull request
May 27, 2015
Merged
Deduplicate solver regularization, logging, and local rates and decays #2518
longjon
and others
added some commits
Aug 12, 2014
shelhamer
added the
ready for review
label
May 28, 2015
|
@jeffdonahue @longjon this should finally be ready. I had to set the gradient based solver test net to constant data to avoid a tricky issue number draws but I think this is fine -- the regression targets are still random so this does give a sequence of gradients to check. |
shelhamer
removed the
ready for review
label
May 28, 2015
|
Accumulation is now checked for I suspect the issue is due to how the history is recorded. The gradient might not be normalized by |
shelhamer
added ready for review and removed ready for review
labels
May 28, 2015
|
0e7a078 makes the normalization for accumulation more obvious and fixes the issue with AdaGrad by normalizing the gradient before the update and history are computed. However, when gradients are accumulated there's overhead for this separate scaling step. The time to update CaffeNet parameters for |
shelhamer
added the
ready for review
label
May 28, 2015
|
Merging for control of memory usage now that this is simple and tested. @sguada sorry for the wait! |
shelhamer
added a commit
that referenced
this pull request
May 30, 2015
|
|
shelhamer |
aeef453
|
shelhamer
merged commit aeef453
into
BVLC:master
May 30, 2015
1 check passed
shelhamer
deleted the
shelhamer:accum-grad branch
May 30, 2015
hli2020
commented
Jun 1, 2015
|
I think the PReLU layer also needs to accumulate the gradients. @shelhamer |
|
Here is my implementation of PReLU gradient accumulation: tnarihi/caffe@4d3fbd5 |
|
Oh sorry, I missed the cherry pick + commit ID. That'll work fine. |
This was referenced Jun 1, 2015
gcr
commented on the diff
Jan 12, 2016
| @@ -469,6 +495,32 @@ void SGDSolver<Dtype>::ApplyUpdate() { | ||
| } | ||
| template <typename Dtype> | ||
| +void SGDSolver<Dtype>::Normalize(int param_id) { | ||
| + if (this->param_.iter_size() == 1) { return; } | ||
| + // Scale gradient to counterbalance accumulation. | ||
| + const vector<shared_ptr<Blob<Dtype> > >& net_params = this->net_->params(); | ||
| + const Dtype accum_normalization = Dtype(1.) / this->param_.iter_size(); |
gcr
|
shelhamer commentedFeb 26, 2015
Accumulate gradients across batches through the
iter_sizesolver field. With this settingbatch_size: 16withiter_size: 1andbatch_size: 4withiter_size: 4are equivalent.masteredition of #1663.adjustnormalize gradients bylocal_rateandlocal_decayaccording toiter_sizeiter_sizeHistorical context:
From @longjon
From @jeffdonahue
From @shelhamer