Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forward and backward for FBNet #13

Open
GuntherZhong opened this issue Mar 18, 2019 · 5 comments
Open

Forward and backward for FBNet #13

GuntherZhong opened this issue Mar 18, 2019 · 5 comments

Comments

@GuntherZhong
Copy link

Hi, JunrQ:

Thanks for your work, it is really quite helpful~
I have a question: I found that in your FBNet source code, you generate batch_size models for batch_size samples per batch, however, the total loss is summed and the loss.backward() function is called. So how this backward() function is applied? For a single model or for batch_size model? Besides, I wonder that why you use this method for FBNet while a single model is generated, loss.backward() is called and then two .step() function is applied in SNAS code.

@GuntherZhong GuntherZhong changed the title Forward and backword for FBNet Forward and backward for FBNet Mar 19, 2019
@JunrQ
Copy link
Owner

JunrQ commented Mar 19, 2019

hi @GuntherZhong, I don't really understand the meaning of

generate batch_size models for batch_size samples per batch

And if i understand you right, the

two .step()
means

NAS/snas/snas/snas.py

Lines 318 to 319 in f5b0f25

self.w_opt.step()
self.t_opt.step()

The reason is that the model parameters and architecture parameters are trained together in snas, Actually i'm not sure if this search procedure is right.

I do experiments with cifar10, training them together or alternatively , they both converges well. But cifar10 may be too easy.

@GuntherZhong
Copy link
Author

hi @GuntherZhong, I don't really understand the meaning of

generate batch_size models for batch_size samples per batch

And if i understand you right, the

two .step()
means

  [NAS/snas/snas/snas.py](https://github.com/JunrQ/NAS/blob/f5b0f2548d7ea6e72b661b608c35b9de6afa6259/snas/snas/snas.py#L318-L319)


    Lines 318 to 319
  in
  [f5b0f25](/JunrQ/NAS/commit/f5b0f2548d7ea6e72b661b608c35b9de6afa6259)





    
      
       self.w_opt.step() 
    

    
      
       self.t_opt.step()

The reason is that the model parameters and architecture parameters are trained together in snas, Actually i'm not sure if this search procedure is right.

I do experiments with cifar10, training them together or alternatively , they both converges well. But cifar10 may be too easy.

@JunrQ Thanks for your reply.

Sorry that I may not describe it clearly. Well, I found that in your source code, in the FBNet.forward() function, the theta.repeat(batch_size, 1) function is called, and after that the weight = nn.functional.gumbel_softmax(t, temperature) is called. Since weights for the samples in this batch are different, in my opinion, it can be seen as batch_size models are generated from the theta parameters for batch_size samples. And this implementation is quite different from the implementation in SNAS code.

And you get the point which I mean the two .steps() exactly :-), I also found the single-level optimization in the original SNAS paper is not described quite clearly. Thus it is quite a pity that although both the SNAS paper and the FBNet paper use Gumbel softmax, we don't know how to implement it correctly :-(

@JunrQ
Copy link
Owner

JunrQ commented Mar 19, 2019

@GuntherZhong, thank you for your good question.

According to BP algorithm, when do loss.backward(), every weights generated by gumbel_softmax will have its grads with shape [batch, ...], which means every repeated theta will have grads with the same shape. When do backward for function repeat, the grads will be summed along the batch_size axis.

For the fact that loss is mean, so i think the grad of theta is the mean of the batch_size generated models.

I don't know which is better, using mean of different generated models or the same models (like DARTS).

@GuntherZhong
Copy link
Author

Thanks for your answer. Quite sorry to reply so late. It is quite an efficient way to solve the expectation minimization problem. However, I wonder that where I can find some reference or papers to explain the algorithm you use :-) Thank you ~

@Jihao-Li
Copy link

@GuntherZhong I think Gumbel softmax coefficients should be multiplied on the block, not on each feature map. The code of DARTS is implemented as I said. However, I haven't done the experiment yet. If you get the result, we can talk about it in detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants