Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

Closed
bachma opened this issue Nov 3, 2016 · 5 comments
Closed

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

bachma opened this issue Nov 3, 2016 · 5 comments
Assignees

Comments

@bachma
Copy link

bachma commented Nov 3, 2016

Hello,
I'm not a 100% sure, but I think the calculation of iteration_to_epoch is incorrect when using batch_accumulation. I guess this is due to the calculation of solver.max_iter since it doesn't take batch_accumulation into account.

@lukeyeager
Copy link
Member

I'm pretty sure we're doing it right. Please elaborate on why you think what we're doing is incorrect.

@bachma
Copy link
Author

bachma commented Nov 4, 2016

As it is stated in https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/solver.cpp#L280-L282 the internal iter_ counter in caffe "indicates the number of times the weights have been updated."
You can also see this in https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/solver.cpp#L233-L235.

Therefore, the number of samples processed during one caffe iteration should be batch_size x iter_size and should be taken into account in DIGITS while calculating iteration_to_epoch.

This is not done in

# Epochs -> Iterations
train_iter = int(math.ceil(float(self.dataset.get_entry_count(
constants.TRAIN_DB)) / train_data_layer.data_param.batch_size))
solver.max_iter = train_iter * self.train_epochs
snapshot_interval = self.snapshot_interval * train_iter
and
def iteration_to_epoch(self, it):
return float(it * self.train_epochs) / self.solver.max_iter

Please correct me if I'm wrong

@gheinrich
Copy link
Contributor

What @bachma is saying makes sense. I tried training the same model without batch accumulation and with batch accumulation and it does indeed take much longer to go through the same number of "epochs" (in DIGITS speak) when using batch accumulation.

@lukeyeager
Copy link
Member

Oh wow, great point! I'm surprised that's how Caffe counts things.

@lukeyeager
Copy link
Member

Thanks for the report @bachma. I verified the bug with a debug Caffe build - you're exactly right.

Would you like to try out #1262 and verify the fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants