Python manual sgd #3959

aniketvartak · 2016-04-08T01:20:35Z

I am trying to implement the SGD functionality to update weights in python manually in caffe python instead of using solver.step() function. The goal is to match the weight updates after doing solver.step() and that by manually updating the weights.

The setup is as follows: Use MNIST data. Set the random seed in solver.prototxt as: random_seed: 52. Make sure momentum: 0.0, weight_decay: 0.0 and, base_lr: 0.01, lr_policy: "fixed". This is done so that, I can simply implement the SGD update equation (with out momentum, regularization etc.). The equation is simply: W_t+1 = W_t - mu * W_t_diff

Following are the two tests:

Test1: Using pycaffe's forward() and backward() to calculate the forward propagation and backward propagation. For each layer that contain weights I do:

for k in weight_layer_idx:
    solver.net.layers[k].blobs[0].diff[...] *= lr # weights
    solver.net.layers[k].blobs[1].diff[...] *= lr # biases

Next, update the weight/biases as:

    solver.net.layers[k].blobs[0].data[...] -= solver.net.layers[k].blobs[0].diff
    solver.net.layers[k].blobs[1].data[...] -= solver.net.layers[k].blobs[1].diff

I run this for 5 iterations.

Test2: Run caffe's solver.step(5).

Now, what I expect is the two tests should yield exactly same weights after the two iterations.

I save the weights values after each of the above tests and calculate the norm difference between the weight vectors by the two tests, and I see that they are not bit-exact. Can some one spot something that I might be missing?

Following is the entire code for reference:

import caffe
caffe.set_device(0)
caffe.set_mode_gpu()
import numpy as np
niter = 5
solver = None
solver = caffe.SGDSolver('solver.prototxt')

# Automatic SGD: TEST2
solver.step(niter)
# save the weights to compare later
w_solver_step = copy(solver.net.layers[1].blobs[0].data.astype('float64'))
b_solver_step = copy(solver.net.layers[1].blobs[1].data.astype('float64'))

# Manual SGD: TEST1
solver = None
solver = caffe.SGDSolver('solver.prototxt')
lr = 0.01

# Get layer types
layer_types = []
for ll in solver.net.layers:
    layer_types.append(ll.type)

# Get the indices of layers that have weights in them
weight_layer_idx = [idx for idx,l in enumerate(layer_types) if 'Convolution' in l or 'InnerProduct' in l]

for it in range(1, niter+1):
    solver.net.forward()  # fprop
    solver.net.backward()  # bprop
    for k in weight_layer_idx:
        solver.net.layers[k].blobs[0].diff[...] *= lr
        solver.net.layers[k].blobs[1].diff[...] *= lr
        solver.net.layers[k].blobs[0].data[...] -= solver.net.layers[k].blobs[0].diff
        solver.net.layers[k].blobs[1].data[...] -= solver.net.layers[k].blobs[1].diff

# save the weights to compare later
w_fwdbwd_update = copy(solver.net.layers[1].blobs[0].data.astype('float64'))
b_fwdbwd_update = copy(solver.net.layers[1].blobs[1].data.astype('float64'))

# Compare
print "after iter", niter, ": weight diff: ", np.linalg.norm(w_solver_step - w_fwdbwd_update), "and bias diff:", np.linalg.norm(b_solver_step - b_fwdbwd_update)

The last line that compares the weights with the two tests produces:

after iter 5 : weight diff: 0.000203027766144 and bias diff: 1.78390789051e-05

where as I expect this difference to be 0.0

Any ideas?

Also note,
If I run these two tests for only 1 iteration, I get exactly matching weight vectors from all layers, but not the subsequent iterations.

The text was updated successfully, but these errors were encountered:

seanbell · 2016-04-24T08:32:07Z

It looks like you're not clearing the diff in each blob. If you want to match the C++ code, you need to clear the blobs manually before each forward/backward pass (set them all to 0): https://github.com/BVLC/caffe/blob/master/src/caffe/solver.cpp#L203

It's not cleared automatically in order to support gradient accumulation (the iter_size solver parameter).

nathanin · 2017-04-27T01:45:47Z

@aniketvartak Hi, I'm trying to do something similar because I want to perform forward pass, then update some layers, then perform backpropagation, then update the weights. My data is manually loaded MNIST images and their labels. Manually loading the data is important for my eventual application. I use an Input layer for this: n.data, n.labels = L.Input(shape=[dict(dim=[64,1,28,28]), dict(dim=[64])], transform_param=dict(scale=1./255), ntop=2)

To do the learning, I am trying to perform learning without calling solver.step()

for it in range(100):
    # Manually load data - returns batch = ndarray(64,1,28,28), labels = ndarray(64)
    batch, labels = zip(*(get_random_digit() for _ in range(64)))

    # Set data into network
    solver.net.blobs['data'].data[...] = batch
    solver.net.blobs['labels'].data[...] = labels

    solver.net.forward()
    solver.net.backward()

    ## Test 2 code or Test 1 code ....

My problem is that even when I run for many iterations, no learning takes place. In fact, the first iteration will have at least different classes assigned for each image in the mini-batch. After the first iteration, all labels seem to be some random class.

At first I thought it was a problem with my inputs. So I dropped the update snippet into the LeNet example (here). Even here, accuracy stays the same over 1000 iterations, so no learning occurs.

@seanbell Any ideas?

automan000 · 2017-05-12T06:16:36Z

@nathanin I think you should update weights in each blob manually. net.backward() doesn't do this for you.

        net.backward()
        # manually update
        for layer in net.layers:
            for blob in layer.blobs:
                blob.data[...] -= current_lr * blob.diff

nathanin · 2017-05-12T18:17:10Z

@automan000

Hi, thanks for the input. Manually updating like this I found many problems. There were a couple ways to get around. The one I chose was to expose solver::ApplyUpdate() in the python interface. This way, the equivalent to solver.step(1) is:

solver.net.forward()
solver.net.backward()
solver.update()

Had to move the iter_ increment to inside the sgd_solver.. but I haven't really found a problem with doing that.

automan000 · 2017-05-13T04:04:04Z

@nathanin Thanks for your sharing.

xiao7199 · 2017-05-20T07:24:45Z

@nathanin
Hi, I try to link the ApplyUpdate in _caffe.cpp but this method is protected, how do you solve it?

nathanin · 2017-05-20T17:49:53Z

@xiao7199 I moved it to be public in include/caffe/sgd_solvers.hpp .... careful that I don't know if this has consequences otherwise. If you find issue with this solution, please let me know. Good luck :)

mitar · 2017-10-28T09:55:18Z

I did changes @nathanin described above in this fork: https://github.com/mitar/caffe

Noiredd · 2018-02-14T09:50:48Z

Closing as this is not related to Caffe development; also the original question seems to have been answered.

mitar · 2018-02-14T16:17:41Z

I have opened this pull request with a fix for this issue: #6238

Noiredd closed this as completed Feb 14, 2018

mitar mentioned this issue Feb 14, 2018

Expose necessary interface to guide manual SGD process from Python #6238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python manual sgd #3959

Python manual sgd #3959

aniketvartak commented Apr 8, 2016

seanbell commented Apr 24, 2016 •

edited

Loading

nathanin commented Apr 27, 2017

automan000 commented May 12, 2017 •

edited

Loading

nathanin commented May 12, 2017

automan000 commented May 13, 2017

xiao7199 commented May 20, 2017

nathanin commented May 20, 2017

mitar commented Oct 28, 2017

Noiredd commented Feb 14, 2018

mitar commented Feb 14, 2018

Python manual sgd #3959

Python manual sgd #3959

Comments

aniketvartak commented Apr 8, 2016

seanbell commented Apr 24, 2016 • edited Loading

nathanin commented Apr 27, 2017

automan000 commented May 12, 2017 • edited Loading

nathanin commented May 12, 2017

automan000 commented May 13, 2017

xiao7199 commented May 20, 2017

nathanin commented May 20, 2017

mitar commented Oct 28, 2017

Noiredd commented Feb 14, 2018

mitar commented Feb 14, 2018

seanbell commented Apr 24, 2016 •

edited

Loading

automan000 commented May 12, 2017 •

edited

Loading