Incorrect iteration_to_epoch calculation with batch_accumulation #1240

bachma · 2016-11-03T09:13:30Z

Hello,
I'm not a 100% sure, but I think the calculation of iteration_to_epoch is incorrect when using batch_accumulation. I guess this is due to the calculation of solver.max_iter since it doesn't take batch_accumulation into account.

lukeyeager · 2016-11-03T20:26:28Z

I'm pretty sure we're doing it right. Please elaborate on why you think what we're doing is incorrect.

bachma · 2016-11-04T07:53:52Z

As it is stated in https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/solver.cpp#L280-L282 the internal iter_ counter in caffe "indicates the number of times the weights have been updated."
You can also see this in https://github.com/NVIDIA/caffe/blob/caffe-0.15/src/caffe/solver.cpp#L233-L235.

Therefore, the number of samples processed during one caffe iteration should be batch_size x iter_size and should be taken into account in DIGITS while calculating iteration_to_epoch.

This is not done in

DIGITS/digits/model/tasks/caffe_train.py

Lines 527 to 531 in 713cb83

    
           # Epochs -> Iterations 
        
           train_iter = int(math.ceil(float(self.dataset.get_entry_count( 
        
               constants.TRAIN_DB)) / train_data_layer.data_param.batch_size)) 
        
           solver.max_iter = train_iter * self.train_epochs 
        
           snapshot_interval = self.snapshot_interval * train_iter

and

DIGITS/digits/model/tasks/caffe_train.py

Lines 889 to 890 in 713cb83

    
           def iteration_to_epoch(self, it): 
        
               return float(it * self.train_epochs) / self.solver.max_iter

Please correct me if I'm wrong

gheinrich · 2016-11-04T10:28:36Z

What @bachma is saying makes sense. I tried training the same model without batch accumulation and with batch accumulation and it does indeed take much longer to go through the same number of "epochs" (in DIGITS speak) when using batch accumulation.

lukeyeager · 2016-11-04T16:22:43Z

Oh wow, great point! I'm surprised that's how Caffe counts things.

lukeyeager · 2016-11-11T23:09:53Z

Thanks for the report @bachma. I verified the bug with a debug Caffe build - you're exactly right.

Would you like to try out #1262 and verify the fix?

lukeyeager added the caffe label Nov 3, 2016

lukeyeager added the bug label Nov 4, 2016

gheinrich mentioned this issue Nov 10, 2016

iter_size/batch_accumulation is not properly appreciated when generating solver prototxt for caffe #1252

Closed

lukeyeager self-assigned this Nov 11, 2016

lukeyeager mentioned this issue Nov 11, 2016

Fix batch accumulation #1262

Merged

lukeyeager closed this as completed in #1262 Nov 16, 2016

lukeyeager mentioned this issue Nov 29, 2016

[Caffe] Fix batch accumulation bug (digits-5.0) #1307

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

bachma commented Nov 3, 2016

lukeyeager commented Nov 3, 2016

bachma commented Nov 4, 2016

gheinrich commented Nov 4, 2016

lukeyeager commented Nov 4, 2016

lukeyeager commented Nov 11, 2016

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

Incorrect iteration_to_epoch calculation with batch_accumulation #1240

Comments

bachma commented Nov 3, 2016

lukeyeager commented Nov 3, 2016

bachma commented Nov 4, 2016

gheinrich commented Nov 4, 2016

lukeyeager commented Nov 4, 2016

lukeyeager commented Nov 11, 2016